Skip to content

UPSTREAM PR #20087: Hybrid model cache: add --checkpoint-every-nb#1222

Open
loci-dev wants to merge 6 commits intomainfrom
loci/pr-20087-batch-checkpoints
Open

UPSTREAM PR #20087: Hybrid model cache: add --checkpoint-every-nb#1222
loci-dev wants to merge 6 commits intomainfrom
loci/pr-20087-batch-checkpoints

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Mar 4, 2026

Note

Source pull request: ggml-org/llama.cpp#20087

Add an option to create checkpoints after processing every n batches during prompt processing.

Hopefully solves #19794 #19298 #18497 and similar.

Usage: llama-server -m model.gguf --checkpoint-every-nb 3: creates a checkpoint every 3 batches.

@loci-review
Copy link

loci-review bot commented Mar 4, 2026

Overview

Analysis of 112,748 functions across 15 binaries reveals minimal performance impact from adding checkpoint management functionality. Modified: 180 functions (0.16%), new: 6, removed: 0, unchanged: 112,562 (99.84%). All changes confined to command-line argument parsing infrastructure, with no modifications to inference hot paths.

Power Consumption Changes:

  • build.bin.llama-cvector-generator: 358,322→357,555 nJ (-0.214%)
  • build.bin.llama-tts: 363,680→362,840 nJ (-0.231%)
  • build.bin.libllama.so: 256,709→256,709 nJ (-0.000%)
  • build.bin.libmtmd.so, llama-bench, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-quantize, llama-qwen2vl-cli, llama-tokenize, libggml-cpu.so, libggml-base.so, libggml.so: 0.000% change

Function Analysis

Lambda #16 (--dry-allowed-length handler) in llama-cvector-generator and llama-tts shows response time increases of 13,120% and 13,081% respectively (14.5ns→1,923ns and 14.5ns→1,918ns). Source code unchanged; regression stems from infrastructure overhead changes in argument parsing framework. Executes once during startup.

Lambda #34 (--main-gpu handler) shows 4,831% and 4,715% response time increases (111ns→5,473ns and 113ns→5,450ns) due to increased overhead in llama_supports_gpu_offload() backend enumeration. No source changes detected.

Lambda #18 regressions (7,758% and 7,739%) are false positives caused by lambda position renumbering—the addition of --checkpoint-every-nb shifted subsequent lambdas, causing comparison between functionally different handlers.

Other analyzed functions (lambdas #35, #57, #60) show 435-673% increases from compiler optimization differences affecting inlining decisions. All changes occur in one-time initialization code with cumulative overhead of ~8.4 microseconds per application launch, negligible compared to model loading time (seconds).

Additional Findings

Zero impact on inference operations: no changes to matrix operations, attention mechanisms, KV cache, quantization kernels, or GPU backends (CUDA, Metal, HIP). The new --checkpoint-every-nb feature is opt-in (disabled by default), providing crash recovery for long-context scenarios without affecting users who don't enable it. Core inference library (libllama.so) shows no measurable power consumption change, confirming inference efficiency is maintained.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@stojanai
Copy link

stojanai commented Mar 4, 2026

llama-tts I need flamegraph before and after to understand

pwilkin and others added 4 commits March 4, 2026 13:01
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…ort form `-ctxcp` for `--ctx-checkpoints`)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@loci-review
Copy link

loci-review bot commented Mar 5, 2026

No summary available at this time. Visit Loci Inspector to review detailed analysis.

@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 61601b2 to 56aaa36 Compare March 13, 2026 02:16
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from e3ea641 to efc22ce Compare March 19, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 88f82d8 to 8c39ead Compare March 25, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants