server : make 2 checkpoints near the end of the prompt by ggerganov · Pull Request #20288 · ggml-org/llama.cpp

ggerganov · 2026-03-09T12:21:33Z

fix #20239 (comment)

In some cases, reprocessing the last 512 tokens of the prompt could be too slow. In other cases it is necessary in order to allow mutating the last user message.

Make 2 checkpoints to satisfy all needs.

alex2robotic · 2026-03-09T15:16:29Z

fix #20239 (comment)

In some cases, reprocessing the last 512 tokens of the prompt could be too slow. In other cases it is necessary in order to allow mutating the last user message.

Make 2 checkpoints to satisfy all needs.

Thanks, this helps a lot.

From reading the patch, my understanding is that the new logic intentionally keeps two near-end checkpoints:

one around n_ubatch before the end of the prompt
one around 64 tokens before the end

My guess is that this is a tradeoff between:

reducing TTFT for replay/regenerate of the same request, and
still allowing the last user message to be edited without falling back too far.

Is that the right way to think about it?

If so, what is the main practical benefit of still keeping a small tail (for example ~64 tokens) uncheckpointed, instead of trying to checkpoint all the way to the end of the prompt? Is that mainly for robustness when the last user message changes slightly or prompt boundaries shift?

Also, in my testing, this behavior does not show up with Qwen3-30B-A3B: replay/regenerate there can have near-zero prompt re-eval, while Qwen3.5-35B-A3B re-evaluates the tail much more noticeably. Is that difference mainly because Qwen3-30B-A3B can reuse the live KV/cache path more directly, while Qwen3.5-35B-A3B relies more on this checkpoint heuristic due to its hybrid/recurrent-like architecture?

Finally, do you think near-zero or zero prompt re-eval is achievable in principle for Qwen3.5 regenerate/replay, or is some small amount of tail reprocessing still expected by design?

ggerganov · 2026-03-09T15:54:55Z

Is that the right way to think about it?

Yes, that's correct.

If so, what is the main practical benefit of still keeping a small tail (for example ~64 tokens) uncheckpointed, instead of trying to checkpoint all the way to the end of the prompt? Is that mainly for robustness when the last user message changes slightly or prompt boundaries shift?

The main restriction is to guarantee that no "reasoning" tokens will get included in the checkpoint because they will be removed for the next user message. I guess 64 tokens is quite big, considering that typically only one "reasoning" token gets added (f.ex <think>). We should probably reduce this to 4?

schynce · 2026-03-09T18:31:43Z

Thanks a lot for this! I just tested this branch and It is working well and reduces the time to first token quite a lot :)

ggerganov · 2026-03-09T18:40:08Z

Thanks, could you confirm that changing the secondary checkpoint from 64 -> 4 tokens works OK?

schynce · 2026-03-09T19:23:00Z

Thanks, could you confirm that changing the secondary checkpoint from 64 -> 4 tokens works OK?

It seems to be working okay with 4 as well for Qwen3.5 with the default template in both instruct and reasoning modes.

ggerganov · 2026-03-09T19:27:23Z

@pwilkin @aldehir Is my assumption correct that "begin thinking" tokens that get appended after a new user message and before the generation starts, cannot be more than 4 (typically a single token)? Or do we know about models with longer "begin thinking" incantations?

aldehir · 2026-03-09T19:58:25Z

The longest I know is gpt-oss at 3 tokens: <|channel|> analysis, <|message|>.

Seems like a safe assumption.

* server : make 2 checkpoints near the end of the prompt * cont : adjust checkpoints

server : make 2 checkpoints near the end of the prompt

8e56d7d

github-actions bot added examples server labels Mar 9, 2026

ggerganov mentioned this pull request Mar 9, 2026

Eval bug: forcing full prompt re-processing due to lack of cache data in Qwen3.5 27B #20153

Closed

cont : adjust checkpoints

63a4450

ggerganov marked this pull request as ready for review March 10, 2026 06:02

ggerganov requested a review from ngxson as a code owner March 10, 2026 06:02

ggerganov merged commit a7b3dee into master Mar 10, 2026
16 of 75 checks passed

ggerganov deleted the gg/server-ckpt-near-end branch March 10, 2026 12:28

krystophny mentioned this pull request Mar 11, 2026

server: fix multi-turn cache reuse for hybrid/recurrent models #20428

Closed

ProgenyAlpha pushed a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 12, 2026

server : make 2 checkpoints near the end of the prompt (ggml-org#20288)

32dbf34

* server : make 2 checkpoints near the end of the prompt * cont : adjust checkpoints

srogmann mentioned this pull request Mar 14, 2026

server : speculative checkpointing #19493

Open

crsawyer mentioned this pull request Mar 17, 2026

Misc. bug: Server 500 error (gpt-oss, AutoParser, cache) #20532

Closed

Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026

server : make 2 checkpoints near the end of the prompt (ggml-org#20288)

fdfbc25

* server : make 2 checkpoints near the end of the prompt * cont : adjust checkpoints

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : make 2 checkpoints near the end of the prompt#20288

server : make 2 checkpoints near the end of the prompt#20288
ggerganov merged 2 commits intomasterfrom
gg/server-ckpt-near-end

ggerganov commented Mar 9, 2026

Uh oh!

alex2robotic commented Mar 9, 2026 •

edited

Loading

Uh oh!

ggerganov commented Mar 9, 2026

Uh oh!

schynce commented Mar 9, 2026

Uh oh!

ggerganov commented Mar 9, 2026

Uh oh!

schynce commented Mar 9, 2026

Uh oh!

ggerganov commented Mar 9, 2026 •

edited

Loading

Uh oh!

aldehir commented Mar 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ggerganov commented Mar 9, 2026

Uh oh!

alex2robotic commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 9, 2026

Uh oh!

schynce commented Mar 9, 2026

Uh oh!

ggerganov commented Mar 9, 2026

Uh oh!

schynce commented Mar 9, 2026

Uh oh!

ggerganov commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alex2robotic commented Mar 9, 2026 •

edited

Loading

ggerganov commented Mar 9, 2026 •

edited

Loading

aldehir commented Mar 9, 2026 •

edited

Loading