Skip to content

[Hybrid] Create checkpoints while processing the prompt#17428

Open
whoreson wants to merge 2 commits intoggml-org:masterfrom
whoreson:master
Open

[Hybrid] Create checkpoints while processing the prompt#17428
whoreson wants to merge 2 commits intoggml-org:masterfrom
whoreson:master

Conversation

@whoreson
Copy link
Copy Markdown
Contributor

Currently, llama-server creates a single checkpoint after e.g. processing 600000 tokens with IBM Granite. Which will likely get erased real soon.

I have deemed this sad state of affairs insufferable, so I instructed my girl Gemini to solve the issue, and generate half of the requested checkpoints during PP.

Now it's actually usable.

Here is the result, do the needful saars.

slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.235375
slot update_slots: id  0 | task 487 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  0 | task 487 | created prompt processing context checkpoint 2 of 256 (pos_min = 4095, pos_max = 4095, size = 147.481 MiB)
slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.353063
slot update_slots: id  0 | task 487 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id  0 | task 487 | created prompt processing context checkpoint 3 of 256 (pos_min = 6143, pos_max = 6143, size = 147.481 MiB)
slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.470750
slot update_slots: id  0 | task 487 | n_tokens = 8192, memory_seq_rm [8192, end)
slot update_slots: id  0 | task 487 | created prompt processing context checkpoint 4 of 256 (pos_min = 8191, pos_max = 8191, size = 147.481 MiB)
slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 10240, batch.n_tokens = 2048, progress = 0.588438
slot update_slots: id  0 | task 487 | n_tokens = 10240, memory_seq_rm [10240, end)
slot update_slots: id  0 | task 487 | created prompt processing context checkpoint 5 of 256 (pos_min = 10239, pos_max = 10239, size = 147.481 MiB)
slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 12288, batch.n_tokens = 2048, progress = 0.706126
slot update_slots: id  0 | task 487 | n_tokens = 12288, memory_seq_rm [12288, end)
slot update_slots: id  0 | task 487 | created prompt processing context checkpoint 6 of 256 (pos_min = 12287, pos_max = 12287, size = 147.481 MiB)
slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 14336, batch.n_tokens = 2048, progress = 0.823813
slot update_slots: id  0 | task 487 | n_tokens = 14336, memory_seq_rm [14336, end)
slot update_slots: id  0 | task 487 | created prompt processing context checkpoint 7 of 256 (pos_min = 14335, pos_max = 14335, size = 147.481 MiB)
slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 16384, batch.n_tokens = 2048, progress = 0.941501
slot update_slots: id  0 | task 487 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id  0 | task 487 | created prompt processing context checkpoint 8 of 256 (pos_min = 16383, pos_max = 16383, size = 147.481 MiB)
slot update_slots: id  0 | task 487 | prompt processing progress, n_tokens = 17402, batch.n_tokens = 1018, progress = 1.000000
slot update_slots: id  0 | task 487 | prompt done, n_tokens = 17402, batch.n_tokens = 1018
...
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.999 (> 0.100 thold), f_keep = 0.923
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 1886 | processing task
slot update_slots: id  0 | task 1886 | new prompt, n_ctx_slot = 300032, n_keep = 256, task.n_tokens = 17352
slot update_slots: id  0 | task 1886 | n_past = 17338, slot.prompt.tokens.size() = 18790, seq_id = 0, pos_min = 18789, n_swa = 1
slot update_slots: id  0 | task 1886 | restored context checkpoint (pos_min = 16383, pos_max = 16383, size = 147.481 MiB)
slot update_slots: id  0 | task 1886 | n_tokens = 16384, memory_seq_rm [16384, end)
slot update_slots: id  0 | task 1886 | prompt processing progress, n_tokens = 17352, batch.n_tokens = 968, progress = 1.000000
slot update_slots: id  0 | task 1886 | prompt done, n_tokens = 17352, batch.n_tokens = 968

@Mushoz
Copy link
Copy Markdown

Mushoz commented Nov 21, 2025

You can simply start llama-server with --swa-full. Then you don't need any checkpoints anymore, as the KV-cache can be reused with as many (or as few) tokens that have the same prefix.

@whoreson
Copy link
Copy Markdown
Contributor Author

You can simply start llama-server with --swa-full. Then you don't need any checkpoints anymore, as the KV-cache can be reused with as many (or as few) tokens that have the same prefix.

most uninformed comment of the month award

@Mushoz
Copy link
Copy Markdown

Mushoz commented Nov 22, 2025

My apologies. I mistakenly thought Granite was using sliding window attention (for which my comment would have made sense), but it's not. Disregard my previous comment. Having said that, there is no need to be unfriendly towards people trying to be helpful, even if they made a mistake.

@whoreson
Copy link
Copy Markdown
Contributor Author

i can still feel very friendly even when mildly annoyed (useful trait when keeping women around)

@JohannesGaessler
Copy link
Copy Markdown
Contributor

@whoreson please stay on topic and refrain from needlessly agitating specific persons or groups of people.

@whoreson
Copy link
Copy Markdown
Contributor Author

well cudadev sir.

anyway this is the wrong place for this, creating a usable cache is a library task, doesn't belong in a http client

on that note, it would make a lot of sense if the (horrible, incompatible, and screenspace-waster) web UI would finally be separated out of llama-server, and would communicate with it via HTTP, but i don't think there's any chance of that happening, considering that llama.cpp had exactly zero feature freeze/release periods in all of its existence, just code being shovelled in endlessly

@whoreson
Copy link
Copy Markdown
Contributor Author

Now that Qwen 3.5 is out, can you guys FINALLY fix this functionality so that I won't have to track all the server code reorganizations locally? Jesus

@JohannesGaessler

@aagit
Copy link
Copy Markdown
Contributor

aagit commented Mar 1, 2026

Hi,

The second commit is fully orthogonal, perhaps it should be moved to a different PR, so we can focus on converging on an optimal solution for the kvcache truncation with checkpoints.

So my reason to checkpointing at fixed (albeit configurable) interval in my PR is that the token interval controls the worst case max lantecy that you get. If the context is small I would like fewer checkpoints if it's large I would like more checkpoints. I think the interval should be fixed in tokens.

Still it's suboptimal in my approach that ctx-checkpoints is left to the user to set right. I would prefer if there was no reservation and if checkpoints would be dynamically allocated with a lifetime tracking the kvcache, and freed when the kvcache is dropped below the checkpoint. So the parameter and the CTX checkpoint number would disappear. No idea if that's feasible or if there's some blocker to achieve it.

So I did the minimal tweak that would provide checkpoints at a fixed token interval, leveraging the configurable logic batch size.

Our PR are needed every time you modify a file in context while codiing, without them all code editor enhancement like the gptel PR to add context lru management and opt-in kvcache preloading won't work at all...

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants