[Hybrid] Create checkpoints while processing the prompt by whoreson · Pull Request #17428 · ggml-org/llama.cpp

whoreson · 2025-11-21T13:15:42Z

Currently, llama-server creates a single checkpoint after e.g. processing 600000 tokens with IBM Granite. Which will likely get erased real soon.

I have deemed this sad state of affairs insufferable, so I instructed my girl Gemini to solve the issue, and generate half of the requested checkpoints during PP.

Now it's actually usable.

Here is the result, do the needful saars.

slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.235375
slot update_slots: id  0 | task 487 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  0 | task 487 | created prompt processing context checkpoint 2 of 256 (pos_min = 4095, pos_max = 4095, size = 147.481 MiB)
slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.353063
slot update_slots: id  0 | task 487 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id  0 | task 487 | created prompt processing context checkpoint 3 of 256 (pos_min = 6143, pos_max = 6143, size = 147.481 MiB)
slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.470750
slot update_slots: id  0 | task 487 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  0 | task 487 | created prompt processing context checkpoint 4 of 256 (pos_min = 8191, pos_max = 8191, size = 147.481 MiB)
slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 10240, batch.n_tokens = 2048, progress = 0.588438
slot update_slots: id  0 | task 487 | n_tokens = 10240, memory_seq_rm [10240, end)
slot update_slots: id  0 | task 487 | created prompt processing context checkpoint 5 of 256 (pos_min = 10239, pos_max = 10239, size = 147.481 MiB)
slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 12288, batch.n_tokens = 2048, progress = 0.706126
slot update_slots: id  0 | task 487 | n_tokens = 12288, memory_seq_rm [12288, end)
slot update_slots: id  0 | task 487 | created prompt processing context checkpoint 6 of 256 (pos_min = 12287, pos_max = 12287, size = 147.481 MiB)
slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 14336, batch.n_tokens = 2048, progress = 0.823813
slot update_slots: id  0 | task 487 | n_tokens = 14336, memory_seq_rm [14336, end)
slot update_slots: id  0 | task 487 | created prompt processing context checkpoint 7 of 256 (pos_min = 14335, pos_max = 14335, size = 147.481 MiB)
slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 16384, batch.n_tokens = 2048, progress = 0.941501
slot update_slots: id  0 | task 487 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id  0 | task 487 | created prompt processing context checkpoint 8 of 256 (pos_min = 16383, pos_max = 16383, size = 147.481 MiB)
slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 17402, batch.n_tokens = 1018, progress = 1.000000
slot update_slots: id  0 | task 487 | prompt done, n_tokens = 17402, batch.n_tokens = 1018
...
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 0.923
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 1886 | processing task
slot update_slots: id  0 | task 1886 | new prompt, n_ctx_slot = 300032, n_keep = 256, task.n_tokens = 17352
slot update_slots: id  0 | task 1886 | n_past = 17338, slot.prompt.tokens.size() = 18790, seq_id = 0, pos_min = 18789, n_swa = 1
slot update_slots: id  0 | task 1886 | restored context checkpoint (pos_min = 16383, pos_max = 16383, size = 147.481 MiB)
slot update_slots: id  0 | task 1886 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id  0 | task 1886 | prompt processing progress, n_tokens = 17352, batch.n_tokens = 968, progress = 1.000000
slot update_slots: id  0 | task 1886 | prompt done, n_tokens = 17352, batch.n_tokens = 968

…e loading request for a non-existent file

Mushoz · 2025-11-21T14:16:17Z

You can simply start llama-server with --swa-full. Then you don't need any checkpoints anymore, as the KV-cache can be reused with as many (or as few) tokens that have the same prefix.

whoreson · 2025-11-22T04:28:48Z

You can simply start llama-server with --swa-full. Then you don't need any checkpoints anymore, as the KV-cache can be reused with as many (or as few) tokens that have the same prefix.

most uninformed comment of the month award

Mushoz · 2025-11-22T07:22:18Z

My apologies. I mistakenly thought Granite was using sliding window attention (for which my comment would have made sense), but it's not. Disregard my previous comment. Having said that, there is no need to be unfriendly towards people trying to be helpful, even if they made a mistake.

whoreson · 2025-11-22T09:30:35Z

i can still feel very friendly even when mildly annoyed (useful trait when keeping women around)

JohannesGaessler · 2025-11-22T10:57:05Z

@whoreson please stay on topic and refrain from needlessly agitating specific persons or groups of people.

whoreson · 2025-11-28T11:28:04Z

well cudadev sir.

anyway this is the wrong place for this, creating a usable cache is a library task, doesn't belong in a http client

on that note, it would make a lot of sense if the (horrible, incompatible, and screenspace-waster) web UI would finally be separated out of llama-server, and would communicate with it via HTTP, but i don't think there's any chance of that happening, considering that llama.cpp had exactly zero feature freeze/release periods in all of its existence, just code being shovelled in endlessly

whoreson · 2026-02-27T06:18:53Z

Now that Qwen 3.5 is out, can you guys FINALLY fix this functionality so that I won't have to track all the server code reorganizations locally? Jesus

@JohannesGaessler

aagit · 2026-03-01T10:28:34Z

Hi,

The second commit is fully orthogonal, perhaps it should be moved to a different PR, so we can focus on converging on an optimal solution for the kvcache truncation with checkpoints.

So my reason to checkpointing at fixed (albeit configurable) interval in my PR is that the token interval controls the worst case max lantecy that you get. If the context is small I would like fewer checkpoints if it's large I would like more checkpoints. I think the interval should be fixed in tokens.

Still it's suboptimal in my approach that ctx-checkpoints is left to the user to set right. I would prefer if there was no reservation and if checkpoints would be dynamically allocated with a lifetime tracking the kvcache, and freed when the kvcache is dropped below the checkpoint. So the parameter and the CTX checkpoint number would disappear. No idea if that's feasible or if there's some blocker to achieve it.

So I did the minimal tweak that would provide checkpoints at a fixed token interval, leveraging the configurable logic batch size.

Our PR are needed every time you modify a file in context while codiing, without them all code editor enhancement like the gptel PR to add context lru management and opt-in kvcache preloading won't work at all...

Thanks!

create checkpoints while processing the prompt

5aee16f

whoreson requested review from ggerganov and ngxson as code owners November 21, 2025 13:15

no need to invalidate K/V cache just because someone submitted a cach…

b9e4145

…e loading request for a non-existent file

github-actions bot added examples server labels Nov 21, 2025

loci-dev mentioned this pull request Nov 21, 2025

UPSTREAM PR #17428: [Hybrid] Create checkpoints while processing the prompt auroralabs-loci/llama.cpp#280

Open

whoreson mentioned this pull request Mar 1, 2026

server: batch checkpoints to support kvcache context truncation #19970

Closed

whoreson mentioned this pull request Mar 4, 2026

Hybrid model cache: add --checkpoint-every-nb #20087

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hybrid] Create checkpoints while processing the prompt#17428

[Hybrid] Create checkpoints while processing the prompt#17428
whoreson wants to merge 2 commits intoggml-org:masterfrom
whoreson:master

whoreson commented Nov 21, 2025

Uh oh!

Mushoz commented Nov 21, 2025

Uh oh!

whoreson commented Nov 22, 2025

Uh oh!

Mushoz commented Nov 22, 2025

Uh oh!

whoreson commented Nov 22, 2025

Uh oh!

JohannesGaessler commented Nov 22, 2025

Uh oh!

whoreson commented Nov 28, 2025

Uh oh!

whoreson commented Feb 27, 2026

Uh oh!

aagit commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

whoreson commented Nov 21, 2025

Uh oh!

Mushoz commented Nov 21, 2025

Uh oh!

whoreson commented Nov 22, 2025

Uh oh!

Mushoz commented Nov 22, 2025

Uh oh!

whoreson commented Nov 22, 2025

Uh oh!

JohannesGaessler commented Nov 22, 2025

Uh oh!

whoreson commented Nov 28, 2025

Uh oh!

whoreson commented Feb 27, 2026

Uh oh!

aagit commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants