Hybrid model cache: add `--checkpoint-every-nb` by pwilkin · Pull Request #20087 · ggml-org/llama.cpp

pwilkin · 2026-03-03T18:27:46Z

Add an option to create checkpoints after processing every n batches during prompt processing.

Hopefully solves #19794 #19298 #18497 and similar.

Usage: llama-server -m model.gguf --checkpoint-every-nb 3: creates a checkpoint every 3 batches.

Dampfinchen · 2026-03-03T23:36:16Z

Thank you for your great work, as always. I wanted to ask, does this fix prompt caching not working after(!) the context size has been exceeded? Or is that a problem that is impossible to solve for RNN models? For non RNN models, the solution is simple: The beginning of the prompt is truncated by the UI and then sent to the backend to keep the chat rolling, prompt caching still works. This however doesn't work with Qwen 3.5.

IMO this is a dealbreaker for this kind of architecture, as not everyone can run 250k context + (and even that fills up eventually) so either you start a new chat regularly or live with very long prompt processing each enquiry.

I am a bit confused about the many PRs around prompt caching for hybrid models to be honest, as prompt caching for Qwen 3.5 has been working as expected for quite a while now, as long as, and that is the important part, you stay within your context window.

pwilkin · 2026-03-03T23:51:09Z

@Dampfinchen to be honest, the very idea of the hybrid architecture is that you can run a very long context cheaply, so this eliminates the need for context truncation. Unfortunately, due to how recurrent states are constructed, there is no way to take a "partial range" of a recurrent model's cache - once any prefix is invalid, you have to reprocess.

This solution is mostly intended for situations where agentic coders keep a certain prefix of the prompt, but not long enough for the default solution (n_tokens - 512) to work. The idea behind this is basically - build up snapshots incrementally, and if at any time any prefix is needed then one of the checkpoints should probably work.

However, to reiterate, there is no possible solution for a situation where the prefix itself changes. For example, there's a long-standing issue with Claude Code which inserts a custom header that gets added to the beginning of the prompt each time - this breaks any sort of checkpointing. Likewise, if you have the exact datetime in your prompt, this will force reprocessing every time. There is nothing that can be done in those cases - those have to be fixed on the client side.

jhemmond · 2026-03-04T01:57:55Z

there's a long-standing issue with Claude Code which inserts a custom header that gets added to the beginning of the prompt each time - this breaks any sort of checkpointing.

Exactly this bug is how I ended up here. Just switched from Opencode to Claude Code this afternoon, and was wondering why Qwen3.5-122B is reprocessing every single prompt. Will this flag (--checkpoint-every-nb) be enabled by default or has to be set explicitly?

Thanks for the great work!

arcanemachine · 2026-03-04T05:05:01Z

Will this flag (--checkpoint-every-nb) be enabled by default or has to be set explicitly?

I built it locally, and it looks like it is disabled by default. Here is the relevant section from llama-server --help:

--checkpoint-every-nb N                 create a checkpoint every n batches during prefill (processing), -1 to
                                        disable (default: -1)

I am still having to rebuild the cache in OpenCode after sending a new message after the context window gets above ~50,000 tokens (tried with --checkpoint-every-nb 3, then also tried with that flag and --swa-full, still getting the cache issue):

srv  log_server_r: done request: POST /v1/chat/completions 192.168.1.100 200
slot print_timing: id  3 | task 3166 | 
prompt eval time =     534.69 ms /    96 tokens (    5.57 ms per token,   179.54 tokens per second)
       eval time =    2018.21 ms /    68 tokens (   29.68 ms per token,    33.69 tokens per second)
      total time =    2552.90 ms /   164 tokens
slot      release: id  3 | task 3166 | stop processing: n_tokens = 55002, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-constructed
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.918 (> 0.100 thold), f_keep = 0.918
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  3 | task 3235 | processing task, is_child = 0
slot update_slots: id  3 | task 3235 | new prompt, n_ctx_slot = 120064, n_keep = 0, task.n_tokens = 55049
slot update_slots: id  3 | task 3235 | n_past = 50518, slot.prompt.tokens.size() = 55002, seq_id = 3, pos_min = 55001, n_swa = 1
slot update_slots: id  3 | task 3235 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 53491, pos_max = 53491, n_tokens = 53492, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 53599, pos_max = 53599, n_tokens = 53600, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 53689, pos_max = 53689, n_tokens = 53690, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 53898, pos_max = 53898, n_tokens = 53899, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 53981, pos_max = 53981, n_tokens = 53982, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 54383, pos_max = 54383, n_tokens = 54384, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 54529, pos_max = 54529, n_tokens = 54530, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 54838, pos_max = 54838, n_tokens = 54839, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 3235 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.037203
slot update_slots: id  3 | task 3235 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  3 | task 3235 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.074406
slot update_slots: id  3 | task 3235 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  3 | task 3235 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.111610
slot update_slots: id  3 | task 3235 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id  3 | task 3235 | 3 batches since last checkpoint at 0, creating new checkpoint during processing at position 8192
slot update_slots: id  3 | task 3235 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.148813
slot update_slots: id  3 | task 3235 | created context checkpoint 1 of 8 (pos_min = 6143, pos_max = 6143, n_tokens = 6144, size = 75.376 MiB)
...

The strange thing is that tool calls don't trigger this issue (using OpenCode), only manual messages. Makes me wonder if this is some sort of issue with OpenCode, or maybe some PEBKAC issue on my part.

EDIT: I am successfully working around the issue with --ctx-checkpoints 128.

allkhor · 2026-03-04T05:44:24Z

@jhemmond, take a look at #20003 — maybe this will help you.

ggerganov · 2026-03-04T10:01:35Z

@pwilkin I don't think this flag is needed. Which case does it fix that is not already covered by the existing logic?

pwilkin · 2026-03-04T11:05:36Z

@ggerganov From all the reports I gathered that there are cases where the agent reprocessing cuts of a larger part of the prompt while still keeping a reasonably large prefix - a notable example of such cases is when the agent doesn't pass back reasoning content to the model, so the last reasoning content gets cut off (and it's often more than 512 tokens).

pwilkin · 2026-03-04T11:06:50Z

Another thing this helps with is keeping a checkpoint with all (or most) of the agent's fixed tools / instructions header, which can go up to 20k tokens in some cases.

whoreson · 2026-03-04T11:42:27Z

@pwilkin I don't think this flag is needed. Which case does it fix that is not already covered by the existing logic?

I have explained 4 months ago here: #17428

ggerganov · 2026-03-04T11:58:38Z

a notable example of such cases is when the agent doesn't pass back reasoning content to the model, so the last reasoning content gets cut off (and it's often more than 512 tokens).

This change is not going to fix that - when the reasoning is removed, the prefix changes. So making checkpoints during reasoning will just increase memory usage without reducing computation.

pwilkin · 2026-03-04T12:00:37Z

This change is not going to fix that - when the reasoning is removed, the prefix changes. So making checkpoints during reasoning will just increase memory usage without reducing computation.

No, because in a typical agentic scenario, you're going to have something like this:

[AGENT INSTRUCTIONS][USER QUERY][AGENT REASONING][AGENT RESPONSE]

Now, if reasoning is removed, you'll have

[AGENT INSTRUCTIONS][USER QUERY][AGENT RESPONSE][USER REQUERY]

Although some part is removed, you're still left with the [AGENT INSTRUCTIONS][USER QUERY] part, which can be quite large, esp. if files have been attached.

smilediver · 2026-03-04T12:07:32Z

I have a case where the prompt looks like [append_part][dynamic_part], where the model is called with the same append_part and different dynamic_part several times (different queries for the same data). After that, append_part is appended and the process is repeated.

ggerganov · 2026-03-04T12:23:23Z

The [AGENT INSTRUCTIONS][USER QUERY] is checkpointed with the existing logic, so it's not a good example - it should still work even without this PR.

I think a better example where the current logic on master would fail is if you continue an existing session that looks like this for example:

system
user0
assistant0
user1
assistant1
...
userN
assistantN
userN+1

With a fresh llama-server we will create only one checkpoint right before userN+1. Now if we want to go back to an earlier message (e.g. userN-1) and branch from there, we won't have checkpoint data (because the llama-server has just started). I guess this is something that can happen often with OpenCode, so it's good to support it.

Ideally, the best logic is to create a checkpoint before each user message. But it might be more complicated to do that than the basic interval-based checkpointing proposed here.

@smilediver If you have control over the client, you can send [append_part] alone once at the start to force a checkpoint creation and then follow up with the different dynamic parts. This is something we utilize in the llama.vim plugin to reduce computation.

pwilkin · 2026-03-04T13:56:23Z

The [AGENT INSTRUCTIONS][USER QUERY] is checkpointed with the existing logic, so it's not a good example - it should still work even without this PR.

It is if you start a new conversation, yes. But if you resume a session, as people often do with those agents, then it doesn't work anymore. There are a lot of real-life edge cases here which is I thought that this approach would be the simplest to capture most of them.

ggerganov

I think we can bump the default number of checkpoints to 32 and enable this new functionality by default to make checkpoints every 8192 tokens.

LucaAmigoni · 2026-03-04T19:51:03Z

However, to reiterate, there is no possible solution for a situation where the prefix itself changes. For example, there's a long-standing issue with Claude Code which inserts a custom header that gets added to the beginning of the prompt each time - this breaks any sort of checkpointing. Likewise, if you have the exact datetime in your prompt, this will force reprocessing every time. There is nothing that can be done in those cases - those have to be fixed on the client side.

You can set the env variable CLAUDE_CODE_ATTRIBUTION_HEADER to 0 to avoid the custom header in CC.

aagit · 2026-03-04T20:00:03Z

FWIW to keep it minimal I used the logic batch size instead of a new config option. #19970 . Hardcoded to 8k isn't ideal.

aagit · 2026-03-04T20:03:36Z

However, to reiterate, there is no possible solution for a situation where the prefix itself changes. For example, there's a long-standing issue with Claude Code which inserts a custom header that gets added to the beginning of the prompt each time - this breaks any sort of checkpointing. Likewise, if you have the exact datetime in your prompt, this will force reprocessing every time. There is nothing that can be done in those cases - those have to be fixed on the client side.

yes luckily gptel never alters the prompt and the prompt performs much better after the context anyway... verified both with rg-edit and synthmerge_bench for all models except gemini 2.5 with reasoning enabled. When you're editing files in context you're inevitably going to truncate context. gptel context LRU management PR I posted makes sure the most frequently edited files are pushed to the end of context for this reason, it gives an extra noticeable boost.

pwilkin · 2026-03-05T01:12:49Z

You can set the env variable CLAUDE_CODE_ATTRIBUTION_HEADER to 0 to avoid the custom header in CC.

Ye I know that one :) but what I mean is that if you don't use that option, there is literally no way to avoid reprocessing.

pwilkin · 2026-03-05T22:52:34Z

@ggerganov aight, I changed the defaults to 32 checkpoints and checkpointing every 8192 tokens.

Dampfinchen · 2026-03-06T12:18:10Z

@Dampfinchen to be honest, the very idea of the hybrid architecture is that you can run a very long context cheaply, so this eliminates the need for context truncation. Unfortunately, due to how recurrent states are constructed, there is no way to take a "partial range" of a recurrent model's cache - once any prefix is invalid, you have to reprocess.

This solution is mostly intended for situations where agentic coders keep a certain prefix of the prompt, but not long enough for the default solution (n_tokens - 512) to work. The idea behind this is basically - build up snapshots incrementally, and if at any time any prefix is needed then one of the checkpoints should probably work.

However, to reiterate, there is no possible solution for a situation where the prefix itself changes. For example, there's a long-standing issue with Claude Code which inserts a custom header that gets added to the beginning of the prompt each time - this breaks any sort of checkpointing. Likewise, if you have the exact datetime in your prompt, this will force reprocessing every time. There is nothing that can be done in those cases - those have to be fixed on the client side.

Thank you for your explanation and your fantastic work. I have nothing to add except for disagreeing with "run very long context cheaply, so this eliminates the need for context truncation."

Even for longer context, you will run into this issue eventually. Regardless if you use 10K context or 300K, context will fill up eventually and then you run into this issue, with more context it just takes longer to happen. To make matters worse, agentic workflows and vision tasks easily fill up context in no time. With non hybrid models, you just keep cruising not having to worry about it.

I'm really not a fan of this approach "just use more context" even for very efficient models like Qwen 3.5 more context still adds more memory and needs more compute which not everyone has especially in these times. To be honest, I think the concept is having a context that eventually fills up and will have to be reset is fundamentally flawed, imagine if the human brain worked like that. In 5 years I predict we will look at context in LLMs the same way we look at dial up internet now.

Anyways, thank you again for your great work. I realize there is nothing you can do about this. I hope Qwen will change the architecture in the future so atleast a rolling context is viable again. But more so, for the future I hope the whole concept of having a context for LLMs will be completely overhauled.

ggerganov · 2026-03-06T13:02:10Z

With non hybrid models, you just keep cruising not having to worry about it.

@Dampfinchen Just to make sure I understand, with non-recurrent models did you use the --cache-reuse functionality in order to avoid the reprocessing when the context becomes full?

IIIIIllllIIIIIlllll · 2026-03-06T13:16:39Z

Sorry to bother you. I downloaded the latest code from the main branch and compiled it using ROCm, but it doesn't work with this parameter. After checking the code modified in this PR, I noticed that another parameter is used in the code.
Am I missing something?

llama-server -m /home/mark/Models/Q8/Qwen3.5-35B-A3B-Q8_0/Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf 
--port 8088 
--mmproj /home/mark/Models/Q8/Qwen3.5-35B-A3B-Q8_0/mmproj-F32.gguf 
--ctx-size 262144 
--flash-attn on --no-mmap 
--temp 1 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1 --frequency-penalty 0.0 
--batch-size 4096 --ubatch-size 4096 
--parallel 4 
--cache-ram -1 
--cache-type-k f16 --cache-type-v f16 
--threads -1 --seed -1 -dio --no-webui 
--reasoning-budget -1 
--swa-full 
--checkpoint-every-nb 3 
--chat-template-file /home/mark/App/llama.cpp/cache/Qwen3.5-35B-A3B-Q8_0.jinja 
--metrics 
--slot-save-path /home/mark/App/llama.cpp/cache --alias Qwen3.5-35B-A3B-Q8_0 
--timeout 36000 --host 0.0.0.0

aagit · 2026-03-06T15:27:57Z

I just tested the new upstream commit with default settings, and as anticipated, my reproducer (available in my previous PR) shows a regression in truncated kvcache lookup time.

Default 8k Settings:

kvcache preload: 32.5362 seconds
truncated kvcache lookup: 3.3391 seconds
kvcache preload: 32.5151 seconds
truncated kvcache lookup: 3.3146 seconds

--checkpoint-every-n-tokens 2048 restores the performance to the level of my previous setup:

kvcache preload: 33.0262 seconds
truncated kvcache lookup: 0.7092 seconds

Why would I care about 2 extra seconds after truncation?

Context Ingestion: I add source files (and relevant buffers) to the context using gptel-add-file/gptel-add via a per-project function (after clearing the context).
Refactoring: I traverse a ~100k files project using ripgrep-edit with regex patterns. I have llama.cpp rewrite the rg-edit buffer under the control of a dynamically generated GBNF grammar with two algo picked control lines at the start and end of each snippnet.
Flush: I close the ripgrep buffer, flushing the changes.

The razor conding workflow gives the full source file that is going to be rewritten in point 1, but by ripping through the source with ripgrep in point 2, input and output of the LLM (provided to the LLM after context) are as razor thin as possible, greatly reducing the generation wait time and improving accuracy of the output too. I never rewrite files in full, that's not workable with dozen of files to edit in one go all 10k lines (not tokens) long.

The Issue:
Step 3 inevitably truncates the context, so the extra 2 sec wait is added to most new requests.

Suggestion to avoid an extra ~2 sec delay after kvcache truncation if your workflow is similar

--checkpoint-every-n-tokens 2048
--ctx-checkpoints (increase as needed, e.g., to 64 to cover the last 128k of context)

This is combined with context managed in LRU fashion, with which also improves accuracy further. The files that are most frequently modified by llama.cpp are refiled at the of context using a gptel-context-optimizer module hooking into file save/invalidate hooks, so the truncation is most frequently happening at the end of the kvcache.

Thank you for merging this configurable solution upstream so I can use these new models in addition of my devstral-2-small default pick!

Dampfinchen · 2026-03-06T19:03:55Z

With non hybrid models, you just keep cruising not having to worry about it.

@Dampfinchen Just to make sure I understand, with non-recurrent models did you use the --cache-reuse functionality in order to avoid the reprocessing when the context becomes full?

Yeah. When the context of the prompt is larger than the maximum context size set with -c and the UI, the UI connected to llama.cpp server has to truncate older messages to keep a rolling chat window going. With non-reccurent models this is not an issue, as the cache reuse function works as expected. But with hybrid model this is not supported. This means that every time I send a message to the AI, it has to reprocess the entire prompt again and again which slows down things alot.

``forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)`

But yeah nothing llama.cpp can do about that. Perhaps some UI trickery could work, like truncating a huge chunk of the prompt at once when it goes near the maximum context, then after that intitial processing you'd be safe from reprocessing until the ctx is almost equal to the max ctx again and the thing repeats. I think koboldcpp's smart context does that.

Petrox · 2026-03-07T02:22:30Z

I just want to thank you for your efforts. (Hope this comment does not bother anyone after the closed PR. :) )

Key fix: 17a4258 kv-cache: fix M-RoPE checkpoints (ggml-org#20132) - Qwen3.5 uses M-RoPE (n_pos_per_embd > 1 via qwen35.rope.dimension_sections) - Old checkpoint save/restore code omitted llama_kv_cell_ext data for M-RoPE - Caused access violation (0xc0000005) in llama.dll during prompt cache update - This was the root cause of repeated server crashes Also includes: f5ddcd1 Checkpoint every n tokens (ggml-org#20087) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

aldehir · 2026-03-09T06:37:18Z

Never seen this before, might be a regression introduced here:

gpt-oss-120b
https://cdn.alde.dev/llama.cpp/videos/pp-loop.mp4

Seems like an off-by-one issue, 509/508 = 1.001969.

ggerganov · 2026-03-09T07:12:38Z

What is your server command?

aldehir · 2026-03-09T07:25:36Z

What is your server command?

Nothing fancy, although the auto-fit will offload to CPU since I cannot load the model entirely in VRAM. No other logs, the loop filled my scrollback buffer.

llama-server.exe -m E:\Models\gpt-oss-120b\gpt-oss-120b-F16.gguf -c 64000

ae87863

I'll troubleshoot a bit more and see if I can reliably reproduce. Wanted to share in case there was something obvious to the smarter people in the room.

ggerganov · 2026-03-09T07:39:08Z

Yes, the logic is a bit flaky there - it's most likely an off-by-one error.

With #20277 it will be easier to trace it down.

whoreson · 2026-04-03T14:05:24Z

Did someone break this? It doesn't seem to work.

pwilkin requested review from ggerganov and ngxson as code owners March 3, 2026 18:27

pwilkin changed the title ~~Add --checkpoint-every-nb~~ Hybrid model cache: add --checkpoint-every-nb Mar 3, 2026

github-actions Bot added examples server labels Mar 3, 2026

loci-dev mentioned this pull request Mar 4, 2026

UPSTREAM PR #20087: Hybrid model cache: add --checkpoint-every-nb auroralabs-loci/llama.cpp#1222

Open

ggerganov reviewed Mar 4, 2026

View reviewed changes

Comment thread tools/server/server-context.cpp Outdated

ggerganov reviewed Mar 4, 2026

View reviewed changes

Comment thread common/arg.cpp Outdated

ggerganov reviewed Mar 4, 2026

View reviewed changes

Comment thread common/common.h Outdated

pwilkin force-pushed the batch-checkpoints branch from 68b87a8 to 6874373 Compare March 5, 2026 22:58

pwilkin requested a review from aldehir as a code owner March 5, 2026 22:58

IMbackK mentioned this pull request Mar 6, 2026

Eval bug: Qwen 3.5 Loading checkpoints causes a crash #20176

Open

SubAllStar mentioned this pull request Mar 8, 2026

Eval bug: "Chunk not found" crash with hybrid attention models (Qwen3.5) under parallel load #20222

Closed

williamtwomey mentioned this pull request Mar 8, 2026

Eval bug: Qwen 3.5 Full prompt re-processing on every conversation turn #20225

Closed

eapache mentioned this pull request Mar 8, 2026

Qwen3.5 35B in llama-server keeps re-evaluating ~512 tail tokens on every turn #20239

Closed

ggerganov mentioned this pull request Mar 9, 2026

server : add kill switch when server is stuck #20277

Merged

ggerganov mentioned this pull request Mar 9, 2026

server : fix off-by-1 in server_tokens::size_up_to_pos() #20279

Merged

bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 10, 2026

Checkpoint every n tokens: squash (ggml-org#20087)

601a390

krystophny mentioned this pull request Mar 11, 2026

server: fix multi-turn cache reuse for hybrid/recurrent models #20428

Closed

aoleg mentioned this pull request Mar 14, 2026

[Feature request] Hybrid model cache: add --checkpoint-every-nb LostRuins/koboldcpp#2034

Open

Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026

Checkpoint every n tokens: squash (ggml-org#20087)

05422fa

cutlerbenjamin1-cmd mentioned this pull request Apr 3, 2026

Eval bug: Qwen3.5-27B CUDA illegal memory access regression in prompt cache (single GPU, agentic tool-call pattern) #21383

Closed

asg-3d mentioned this pull request Apr 14, 2026

Server forces full prompt re-processing on subsequent requests (SWA/recurrent memory error) #21831

Open

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

Checkpoint every n tokens: squash (ggml-org#20087)

8bac606

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026

Checkpoint every n tokens: squash (ggml-org#20087)

1b5b649

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

Checkpoint every n tokens: squash (ggml-org#20087)

d057c36

dandm1 pushed a commit to dandm1/llama.cpp that referenced this pull request May 13, 2026

Checkpoint every n tokens: squash (ggml-org#20087)

c97f0a8

my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026

Checkpoint every n tokens: squash (ggml-org#20087)

29471f5

my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026

Checkpoint every n tokens: squash (ggml-org#20087)

622e312

Conversation

pwilkin commented Mar 3, 2026

Uh oh!

Dampfinchen commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Mar 3, 2026

Uh oh!

jhemmond commented Mar 4, 2026

Uh oh!

arcanemachine commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allkhor commented Mar 4, 2026

Uh oh!

ggerganov commented Mar 4, 2026

Uh oh!

pwilkin commented Mar 4, 2026

Uh oh!

pwilkin commented Mar 4, 2026

Uh oh!

whoreson commented Mar 4, 2026

Uh oh!

ggerganov commented Mar 4, 2026

Uh oh!

pwilkin commented Mar 4, 2026

Uh oh!

Uh oh!

Uh oh!

smilediver commented Mar 4, 2026

Uh oh!

ggerganov commented Mar 4, 2026

Uh oh!

pwilkin commented Mar 4, 2026

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LucaAmigoni commented Mar 4, 2026

Uh oh!

aagit commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aagit commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Mar 5, 2026

Uh oh!

pwilkin commented Mar 5, 2026

Uh oh!

Dampfinchen commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IIIIIllllIIIIIlllll commented Mar 6, 2026

Uh oh!

aagit commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why would I care about 2 extra seconds after truncation?

Suggestion to avoid an extra ~2 sec delay after kvcache truncation if your workflow is similar

Uh oh!

Dampfinchen commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Petrox commented Mar 7, 2026

Uh oh!

aldehir commented Mar 9, 2026

Uh oh!

ggerganov commented Mar 9, 2026

Uh oh!

aldehir commented Mar 9, 2026

Uh oh!

ggerganov commented Mar 9, 2026

Uh oh!

whoreson commented Apr 3, 2026

Uh oh!

Reviewers

Dampfinchen commented Mar 3, 2026 •

edited

Loading

arcanemachine commented Mar 4, 2026 •

edited

Loading

aagit commented Mar 4, 2026 •

edited

Loading

aagit commented Mar 4, 2026 •

edited

Loading

Dampfinchen commented Mar 6, 2026 •

edited

Loading

ggerganov commented Mar 6, 2026 •

edited

Loading

aagit commented Mar 6, 2026 •

edited

Loading

Dampfinchen commented Mar 6, 2026 •

edited

Loading