UPSTREAM PR #20087: Hybrid model cache: add `--checkpoint-every-nb` by loci-dev · Pull Request #1222 · auroralabs-loci/llama.cpp

loci-dev · 2026-03-04T02:17:29Z

Note

Source pull request: ggml-org/llama.cpp#20087

Add an option to create checkpoints after processing every n batches during prompt processing.

Hopefully solves #19794 #19298 #18497 and similar.

Usage: llama-server -m model.gguf --checkpoint-every-nb 3: creates a checkpoint every 3 batches.

loci-review · 2026-03-04T04:50:53Z

Overview

Analysis of 112,748 functions across 15 binaries reveals minimal performance impact from adding checkpoint management functionality. Modified: 180 functions (0.16%), new: 6, removed: 0, unchanged: 112,562 (99.84%). All changes confined to command-line argument parsing infrastructure, with no modifications to inference hot paths.

Power Consumption Changes:

build.bin.llama-cvector-generator: 358,322→357,555 nJ (-0.214%)
build.bin.llama-tts: 363,680→362,840 nJ (-0.231%)
build.bin.libllama.so: 256,709→256,709 nJ (-0.000%)
build.bin.libmtmd.so, llama-bench, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-quantize, llama-qwen2vl-cli, llama-tokenize, libggml-cpu.so, libggml-base.so, libggml.so: 0.000% change

Function Analysis

Lambda #16 (--dry-allowed-length handler) in llama-cvector-generator and llama-tts shows response time increases of 13,120% and 13,081% respectively (14.5ns→1,923ns and 14.5ns→1,918ns). Source code unchanged; regression stems from infrastructure overhead changes in argument parsing framework. Executes once during startup.

Lambda #34 (--main-gpu handler) shows 4,831% and 4,715% response time increases (111ns→5,473ns and 113ns→5,450ns) due to increased overhead in llama_supports_gpu_offload() backend enumeration. No source changes detected.

Lambda #18 regressions (7,758% and 7,739%) are false positives caused by lambda position renumbering—the addition of --checkpoint-every-nb shifted subsequent lambdas, causing comparison between functionally different handlers.

Other analyzed functions (lambdas #35, #57, #60) show 435-673% increases from compiler optimization differences affecting inlining decisions. All changes occur in one-time initialization code with cumulative overhead of ~8.4 microseconds per application launch, negligible compared to model loading time (seconds).

Additional Findings

Zero impact on inference operations: no changes to matrix operations, attention mechanisms, KV cache, quantization kernels, or GPU backends (CUDA, Metal, HIP). The new --checkpoint-every-nb feature is opt-in (disabled by default), providing crash recovery for long-context scenarios without affecting users who don't enable it. Core inference library (libllama.so) shows no measurable power consumption change, confirming inference efficiency is maintained.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

stojanai · 2026-03-04T11:21:25Z

llama-tts I need flamegraph before and after to understand

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…ort form `-ctxcp` for `--ctx-checkpoints`)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

loci-review · 2026-03-05T04:07:44Z

No summary available at this time. Visit Loci Inspector to review detailed analysis.

pwilkin added 2 commits March 3, 2026 19:24

Add --checkpoint-every-nb

ebdbc01

Fix envvar

6f38050

loci-dev temporarily deployed to PROD__AL_DEMO March 4, 2026 02:17 — with GitHub Actions Inactive

pwilkin and others added 4 commits March 4, 2026 13:01

Update tools/server/server-context.cpp

09dcc34

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Change to --checkpoint-every-n-tokens (short form -cpent), add sh…

3f75417

…ort form `-ctxcp` for `--ctx-checkpoints`)

Add extra debugging messaging

777b0bc

Update common/common.h

d731858

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

loci-dev force-pushed the main branch from 504cad7 to 9f4f332 Compare March 5, 2026 02:17

loci-dev temporarily deployed to PROD__AL_DEMO March 5, 2026 02:17 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 9 times, most recently from 61601b2 to 56aaa36 Compare March 13, 2026 02:16

loci-dev force-pushed the main branch 9 times, most recently from e3ea641 to efc22ce Compare March 19, 2026 02:18

loci-dev force-pushed the main branch 9 times, most recently from 88f82d8 to 8c39ead Compare March 25, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #20087: Hybrid model cache: add `--checkpoint-every-nb`#1222

UPSTREAM PR #20087: Hybrid model cache: add `--checkpoint-every-nb`#1222
loci-dev wants to merge 6 commits intomainfrom
loci/pr-20087-batch-checkpoints

loci-dev commented Mar 4, 2026

Uh oh!

loci-review bot commented Mar 4, 2026

Uh oh!

stojanai commented Mar 4, 2026

Uh oh!

loci-review bot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

loci-dev commented Mar 4, 2026

Uh oh!

loci-review bot commented Mar 4, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

stojanai commented Mar 4, 2026

Uh oh!

loci-review bot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants