Skip to content

server : make 2 checkpoints near the end of the prompt#20288

Merged
ggerganov merged 2 commits intomasterfrom
gg/server-ckpt-near-end
Mar 10, 2026
Merged

server : make 2 checkpoints near the end of the prompt#20288
ggerganov merged 2 commits intomasterfrom
gg/server-ckpt-near-end

Conversation

@ggerganov
Copy link
Member

fix #20239 (comment)

In some cases, reprocessing the last 512 tokens of the prompt could be too slow. In other cases it is necessary in order to allow mutating the last user message.

Make 2 checkpoints to satisfy all needs.

@alex2robotic
Copy link

alex2robotic commented Mar 9, 2026

fix #20239 (comment)

In some cases, reprocessing the last 512 tokens of the prompt could be too slow. In other cases it is necessary in order to allow mutating the last user message.

Make 2 checkpoints to satisfy all needs.

Thanks, this helps a lot.

From reading the patch, my understanding is that the new logic intentionally keeps two near-end checkpoints:

  • one around n_ubatch before the end of the prompt
  • one around 64 tokens before the end

My guess is that this is a tradeoff between:

  1. reducing TTFT for replay/regenerate of the same request, and
  2. still allowing the last user message to be edited without falling back too far.

Is that the right way to think about it?

If so, what is the main practical benefit of still keeping a small tail (for example ~64 tokens) uncheckpointed, instead of trying to checkpoint all the way to the end of the prompt? Is that mainly for robustness when the last user message changes slightly or prompt boundaries shift?

Also, in my testing, this behavior does not show up with Qwen3-30B-A3B: replay/regenerate there can have near-zero prompt re-eval, while Qwen3.5-35B-A3B re-evaluates the tail much more noticeably. Is that difference mainly because Qwen3-30B-A3B can reuse the live KV/cache path more directly, while Qwen3.5-35B-A3B relies more on this checkpoint heuristic due to its hybrid/recurrent-like architecture?

Finally, do you think near-zero or zero prompt re-eval is achievable in principle for Qwen3.5 regenerate/replay, or is some small amount of tail reprocessing still expected by design?

@ggerganov
Copy link
Member Author

Is that the right way to think about it?

Yes, that's correct.

If so, what is the main practical benefit of still keeping a small tail (for example ~64 tokens) uncheckpointed, instead of trying to checkpoint all the way to the end of the prompt? Is that mainly for robustness when the last user message changes slightly or prompt boundaries shift?

The main restriction is to guarantee that no "reasoning" tokens will get included in the checkpoint because they will be removed for the next user message. I guess 64 tokens is quite big, considering that typically only one "reasoning" token gets added (f.ex <think>). We should probably reduce this to 4?

@schynce
Copy link

schynce commented Mar 9, 2026

Thanks a lot for this! I just tested this branch and It is working well and reduces the time to first token quite a lot :)

@ggerganov
Copy link
Member Author

Thanks, could you confirm that changing the secondary checkpoint from 64 -> 4 tokens works OK?

@schynce
Copy link

schynce commented Mar 9, 2026

Thanks, could you confirm that changing the secondary checkpoint from 64 -> 4 tokens works OK?

It seems to be working okay with 4 as well for Qwen3.5 with the default template in both instruct and reasoning modes.

@ggerganov
Copy link
Member Author

ggerganov commented Mar 9, 2026

@pwilkin @aldehir Is my assumption correct that "begin thinking" tokens that get appended after a new user message and before the generation starts, cannot be more than 4 (typically a single token)? Or do we know about models with longer "begin thinking" incantations?

@aldehir
Copy link
Contributor

aldehir commented Mar 9, 2026

The longest I know is gpt-oss at 3 tokens: <|channel|> analysis, <|message|>.

Seems like a safe assumption.

@ggerganov ggerganov marked this pull request as ready for review March 10, 2026 06:02
@ggerganov ggerganov requested a review from ngxson as a code owner March 10, 2026 06:02
@ggerganov ggerganov merged commit a7b3dee into master Mar 10, 2026
16 of 75 checks passed
@ggerganov ggerganov deleted the gg/server-ckpt-near-end branch March 10, 2026 12:28
ProgenyAlpha pushed a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 12, 2026
* server : make 2 checkpoints near the end of the prompt

* cont : adjust checkpoints
Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
* server : make 2 checkpoints near the end of the prompt

* cont : adjust checkpoints
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen3.5 35B in llama-server keeps re-evaluating ~512 tail tokens on every turn

4 participants