server : make 2 checkpoints near the end of the prompt#20288
server : make 2 checkpoints near the end of the prompt#20288
Conversation
Thanks, this helps a lot. From reading the patch, my understanding is that the new logic intentionally keeps two near-end checkpoints:
My guess is that this is a tradeoff between:
Is that the right way to think about it? If so, what is the main practical benefit of still keeping a small tail (for example ~64 tokens) uncheckpointed, instead of trying to checkpoint all the way to the end of the prompt? Is that mainly for robustness when the last user message changes slightly or prompt boundaries shift? Also, in my testing, this behavior does not show up with Qwen3-30B-A3B: replay/regenerate there can have near-zero prompt re-eval, while Qwen3.5-35B-A3B re-evaluates the tail much more noticeably. Is that difference mainly because Qwen3-30B-A3B can reuse the live KV/cache path more directly, while Qwen3.5-35B-A3B relies more on this checkpoint heuristic due to its hybrid/recurrent-like architecture? Finally, do you think near-zero or zero prompt re-eval is achievable in principle for Qwen3.5 regenerate/replay, or is some small amount of tail reprocessing still expected by design? |
Yes, that's correct.
The main restriction is to guarantee that no "reasoning" tokens will get included in the checkpoint because they will be removed for the next user message. I guess 64 tokens is quite big, considering that typically only one "reasoning" token gets added (f.ex |
|
Thanks a lot for this! I just tested this branch and It is working well and reduces the time to first token quite a lot :) |
|
Thanks, could you confirm that changing the secondary checkpoint from 64 -> 4 tokens works OK? |
It seems to be working okay with 4 as well for Qwen3.5 with the default template in both instruct and reasoning modes. |
|
The longest I know is gpt-oss at 3 tokens: Seems like a safe assumption. |
* server : make 2 checkpoints near the end of the prompt * cont : adjust checkpoints
* server : make 2 checkpoints near the end of the prompt * cont : adjust checkpoints
fix #20239 (comment)
In some cases, reprocessing the last 512 tokens of the prompt could be too slow. In other cases it is necessary in order to allow mutating the last user message.
Make 2 checkpoints to satisfy all needs.