Hybrid model cache: add --checkpoint-every-nb#20087
Conversation
--checkpoint-every-nb--checkpoint-every-nb
|
Thank you for your great work, as always. I wanted to ask, does this fix prompt caching not working after(!) the context size has been exceeded? Or is that a problem that is impossible to solve for RNN models? For non RNN models, the solution is simple: The beginning of the prompt is truncated by the UI and then sent to the backend to keep the chat rolling, prompt caching still works. This however doesn't work with Qwen 3.5. IMO this is a dealbreaker for this kind of architecture, as not everyone can run 250k context + (and even that fills up eventually) so either you start a new chat regularly or live with very long prompt processing each enquiry. I am a bit confused about the many PRs around prompt caching for hybrid models to be honest, as prompt caching for Qwen 3.5 has been working as expected for quite a while now, as long as, and that is the important part, you stay within your context window. |
|
@Dampfinchen to be honest, the very idea of the hybrid architecture is that you can run a very long context cheaply, so this eliminates the need for context truncation. Unfortunately, due to how recurrent states are constructed, there is no way to take a "partial range" of a recurrent model's cache - once any prefix is invalid, you have to reprocess. This solution is mostly intended for situations where agentic coders keep a certain prefix of the prompt, but not long enough for the default solution (n_tokens - 512) to work. The idea behind this is basically - build up snapshots incrementally, and if at any time any prefix is needed then one of the checkpoints should probably work. However, to reiterate, there is no possible solution for a situation where the prefix itself changes. For example, there's a long-standing issue with Claude Code which inserts a custom header that gets added to the beginning of the prompt each time - this breaks any sort of checkpointing. Likewise, if you have the exact datetime in your prompt, this will force reprocessing every time. There is nothing that can be done in those cases - those have to be fixed on the client side. |
Exactly this bug is how I ended up here. Just switched from Opencode to Claude Code this afternoon, and was wondering why Qwen3.5-122B is reprocessing every single prompt. Will this flag (--checkpoint-every-nb) be enabled by default or has to be set explicitly? Thanks for the great work! |
I built it locally, and it looks like it is disabled by default. Here is the relevant section from I am still having to rebuild the cache in OpenCode after sending a new message after the context window gets above ~50,000 tokens (tried with The strange thing is that tool calls don't trigger this issue (using OpenCode), only manual messages. Makes me wonder if this is some sort of issue with OpenCode, or maybe some PEBKAC issue on my part. EDIT: I am successfully working around the issue with |
|
@pwilkin I don't think this flag is needed. Which case does it fix that is not already covered by the existing logic? |
|
@ggerganov From all the reports I gathered that there are cases where the agent reprocessing cuts of a larger part of the prompt while still keeping a reasonably large prefix - a notable example of such cases is when the agent doesn't pass back reasoning content to the model, so the last reasoning content gets cut off (and it's often more than 512 tokens). |
|
Another thing this helps with is keeping a checkpoint with all (or most) of the agent's fixed tools / instructions header, which can go up to 20k tokens in some cases. |
This change is not going to fix that - when the reasoning is removed, the prefix changes. So making checkpoints during reasoning will just increase memory usage without reducing computation. |
No, because in a typical agentic scenario, you're going to have something like this: [AGENT INSTRUCTIONS][USER QUERY][AGENT REASONING][AGENT RESPONSE] Now, if reasoning is removed, you'll have [AGENT INSTRUCTIONS][USER QUERY][AGENT RESPONSE][USER REQUERY] Although some part is removed, you're still left with the [AGENT INSTRUCTIONS][USER QUERY] part, which can be quite large, esp. if files have been attached. |
|
I have a case where the prompt looks like |
|
The I think a better example where the current logic on With a fresh Ideally, the best logic is to create a checkpoint before each user message. But it might be more complicated to do that than the basic interval-based checkpointing proposed here. @smilediver If you have control over the client, you can send |
It is if you start a new conversation, yes. But if you resume a session, as people often do with those agents, then it doesn't work anymore. There are a lot of real-life edge cases here which is I thought that this approach would be the simplest to capture most of them. |
ggerganov
left a comment
There was a problem hiding this comment.
I think we can bump the default number of checkpoints to 32 and enable this new functionality by default to make checkpoints every 8192 tokens.
You can set the env variable CLAUDE_CODE_ATTRIBUTION_HEADER to 0 to avoid the custom header in CC. |
|
FWIW to keep it minimal I used the logic batch size instead of a new config option. #19970 . Hardcoded to 8k isn't ideal. |
yes luckily gptel never alters the prompt and the prompt performs much better after the context anyway... verified both with rg-edit and synthmerge_bench for all models except gemini 2.5 with reasoning enabled. When you're editing files in context you're inevitably going to truncate context. gptel context LRU management PR I posted makes sure the most frequently edited files are pushed to the end of context for this reason, it gives an extra noticeable boost. |
Ye I know that one :) but what I mean is that if you don't use that option, there is literally no way to avoid reprocessing. |
|
@ggerganov aight, I changed the defaults to 32 checkpoints and checkpointing every 8192 tokens. |
68b87a8 to
6874373
Compare
6874373 to
c9d8cdc
Compare
c9d8cdc to
516c5d6
Compare
ggerganov
left a comment
There was a problem hiding this comment.
Ideally, the best logic is to create a checkpoint before each user message. But it might be more complicated to do that than the basic interval-based checkpointing proposed here.
If you are interested to improve this further, I think this can be implemented during chat formatting to emit a list of token positions based on the locations of the user messages. And then use this list to determine the checkpoint locations during prompt processing.
|
I have just stumbled across this comment: https://www.reddit.com/r/LocalLLaMA/comments/1rlkptk/comment/o8ta804/
Might or might not have some relevancy here. |
|
@ggerganov might try this, but that's largely nontrivial - when we parse the chat message, we first detokenize it, so at the point of parsing the parser doesn't know what the token positions are. I know @ngxson has been thinking of making the Jinja engine token-aware, but IIRC that's a work in progress as well. |
|
I suggest adding a timestamp to the cache. For cache management of the same prompt word, each time the best cache is restored, only the timestamp needs to be updated. In addition, cache blocks that have not been hit for a long time in the cache stack can be discarded first. |
Thank you for your explanation and your fantastic work. I have nothing to add except for disagreeing with "run very long context cheaply, so this eliminates the need for context truncation." Even for longer context, you will run into this issue eventually. Regardless if you use 10K context or 300K, context will fill up eventually and then you run into this issue, with more context it just takes longer to happen. To make matters worse, agentic workflows and vision tasks easily fill up context in no time. With non hybrid models, you just keep cruising not having to worry about it. I'm really not a fan of this approach "just use more context" even for very efficient models like Qwen 3.5 more context still adds more memory and needs more compute which not everyone has especially in these times. To be honest, I think the concept is having a context that eventually fills up and will have to be reset is fundamentally flawed, imagine if the human brain worked like that. In 5 years I predict we will look at context in LLMs the same way we look at dial up internet now. Anyways, thank you again for your great work. I realize there is nothing you can do about this. I hope Qwen will change the architecture in the future so atleast a rolling context is viable again. But more so, for the future I hope the whole concept of having a context for LLMs will be completely overhauled. |
@Dampfinchen Just to make sure I understand, with non-recurrent models did you use the |
|
Sorry to bother you. I downloaded the latest code from the main branch and compiled it using ROCm, but it doesn't work with this parameter. After checking the code modified in this PR, I noticed that another parameter is used in the code. |
|
I just tested the new upstream commit with default settings, and as anticipated, my reproducer (available in my previous PR) shows a regression in truncated kvcache lookup time. Default 8k Settings:
Why would I care about 2 extra seconds after truncation?
The razor conding workflow gives the full source file that is going to be rewritten in point 1, but by ripping through the source with ripgrep in point 2, input and output of the LLM (provided to the LLM after context) are as razor thin as possible, greatly reducing the generation wait time and improving accuracy of the output too. I never rewrite files in full, that's not workable with dozen of files to edit in one go all 10k lines (not tokens) long. The Issue: Suggestion to avoid an extra ~2 sec delay after kvcache truncation if your workflow is similar
This is combined with context managed in LRU fashion, with which also improves accuracy further. The files that are most frequently modified by llama.cpp are refiled at the of context using a gptel-context-optimizer module hooking into file save/invalidate hooks, so the truncation is most frequently happening at the end of the kvcache. Thank you for merging this configurable solution upstream so I can use these new models in addition of my devstral-2-small default pick! |
Yeah. When the context of the prompt is larger than the maximum context size set with -c and the UI, the UI connected to llama.cpp server has to truncate older messages to keep a rolling chat window going. With non-reccurent models this is not an issue, as the cache reuse function works as expected. But with hybrid model this is not supported. This means that every time I send a message to the AI, it has to reprocess the entire prompt again and again which slows down things alot.
But yeah nothing llama.cpp can do about that. Perhaps some UI trickery could work, like truncating a huge chunk of the prompt at once when it goes near the maximum context, then after that intitial processing you'd be safe from reprocessing until the ctx is almost equal to the max ctx again and the thing repeats. I think koboldcpp's smart context does that. |
|
I just want to thank you for your efforts. (Hope this comment does not bother anyone after the closed PR. :) ) |
Key fix: 17a4258 kv-cache: fix M-RoPE checkpoints (ggml-org#20132) - Qwen3.5 uses M-RoPE (n_pos_per_embd > 1 via qwen35.rope.dimension_sections) - Old checkpoint save/restore code omitted llama_kv_cell_ext data for M-RoPE - Caused access violation (0xc0000005) in llama.dll during prompt cache update - This was the root cause of repeated server crashes Also includes: f5ddcd1 Checkpoint every n tokens (ggml-org#20087) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
|
Never seen this before, might be a regression introduced here: gpt-oss-120b Seems like an off-by-one issue, |
|
What is your server command? |
Nothing fancy, although the auto-fit will offload to CPU since I cannot load the model entirely in VRAM. No other logs, the loop filled my scrollback buffer. I'll troubleshoot a bit more and see if I can reliably reproduce. Wanted to share in case there was something obvious to the smarter people in the room. |
|
Yes, the logic is a bit flaky there - it's most likely an off-by-one error. With #20277 it will be easier to trace it down. |
Add an option to create checkpoints after processing every
nbatches during prompt processing.Hopefully solves #19794 #19298 #18497 and similar.
Usage:
llama-server -m model.gguf --checkpoint-every-nb 3: creates a checkpoint every 3 batches.