Skip to content

Hybrid model cache: add --checkpoint-every-nb#20087

Merged
pwilkin merged 1 commit intoggml-org:masterfrom
pwilkin:batch-checkpoints
Mar 6, 2026
Merged

Hybrid model cache: add --checkpoint-every-nb#20087
pwilkin merged 1 commit intoggml-org:masterfrom
pwilkin:batch-checkpoints

Conversation

@pwilkin
Copy link
Contributor

@pwilkin pwilkin commented Mar 3, 2026

Add an option to create checkpoints after processing every n batches during prompt processing.

Hopefully solves #19794 #19298 #18497 and similar.

Usage: llama-server -m model.gguf --checkpoint-every-nb 3: creates a checkpoint every 3 batches.

@pwilkin pwilkin requested review from ggerganov and ngxson as code owners March 3, 2026 18:27
@pwilkin pwilkin changed the title Add --checkpoint-every-nb Hybrid model cache: add --checkpoint-every-nb Mar 3, 2026
@Dampfinchen
Copy link

Dampfinchen commented Mar 3, 2026

Thank you for your great work, as always. I wanted to ask, does this fix prompt caching not working after(!) the context size has been exceeded? Or is that a problem that is impossible to solve for RNN models? For non RNN models, the solution is simple: The beginning of the prompt is truncated by the UI and then sent to the backend to keep the chat rolling, prompt caching still works. This however doesn't work with Qwen 3.5.

IMO this is a dealbreaker for this kind of architecture, as not everyone can run 250k context + (and even that fills up eventually) so either you start a new chat regularly or live with very long prompt processing each enquiry.

I am a bit confused about the many PRs around prompt caching for hybrid models to be honest, as prompt caching for Qwen 3.5 has been working as expected for quite a while now, as long as, and that is the important part, you stay within your context window.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 3, 2026

@Dampfinchen to be honest, the very idea of the hybrid architecture is that you can run a very long context cheaply, so this eliminates the need for context truncation. Unfortunately, due to how recurrent states are constructed, there is no way to take a "partial range" of a recurrent model's cache - once any prefix is invalid, you have to reprocess.

This solution is mostly intended for situations where agentic coders keep a certain prefix of the prompt, but not long enough for the default solution (n_tokens - 512) to work. The idea behind this is basically - build up snapshots incrementally, and if at any time any prefix is needed then one of the checkpoints should probably work.

However, to reiterate, there is no possible solution for a situation where the prefix itself changes. For example, there's a long-standing issue with Claude Code which inserts a custom header that gets added to the beginning of the prompt each time - this breaks any sort of checkpointing. Likewise, if you have the exact datetime in your prompt, this will force reprocessing every time. There is nothing that can be done in those cases - those have to be fixed on the client side.

@jhemmond
Copy link

jhemmond commented Mar 4, 2026

there's a long-standing issue with Claude Code which inserts a custom header that gets added to the beginning of the prompt each time - this breaks any sort of checkpointing.

Exactly this bug is how I ended up here. Just switched from Opencode to Claude Code this afternoon, and was wondering why Qwen3.5-122B is reprocessing every single prompt. Will this flag (--checkpoint-every-nb) be enabled by default or has to be set explicitly?

Thanks for the great work!

@arcanemachine
Copy link

arcanemachine commented Mar 4, 2026

Will this flag (--checkpoint-every-nb) be enabled by default or has to be set explicitly?

I built it locally, and it looks like it is disabled by default. Here is the relevant section from llama-server --help:

--checkpoint-every-nb N                 create a checkpoint every n batches during prefill (processing), -1 to
                                        disable (default: -1)                                                                

I am still having to rebuild the cache in OpenCode after sending a new message after the context window gets above ~50,000 tokens (tried with --checkpoint-every-nb 3, then also tried with that flag and --swa-full, still getting the cache issue):

srv  log_server_r: done request: POST /v1/chat/completions 192.168.1.100 200
slot print_timing: id  3 | task 3166 | 
prompt eval time =     534.69 ms /    96 tokens (    5.57 ms per token,   179.54 tokens per second)
       eval time =    2018.21 ms /    68 tokens (   29.68 ms per token,    33.69 tokens per second)
      total time =    2552.90 ms /   164 tokens
slot      release: id  3 | task 3166 | stop processing: n_tokens = 55002, truncated = 0
srv  update_slots: all slots are idle
srv  params_from_: Chat format: peg-constructed
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 0.918 (> 0.100 thold), f_keep = 0.918
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> ?top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  3 | task 3235 | processing task, is_child = 0
slot update_slots: id  3 | task 3235 | new prompt, n_ctx_slot = 120064, n_keep = 0, task.n_tokens = 55049
slot update_slots: id  3 | task 3235 | n_past = 50518, slot.prompt.tokens.size() = 55002, seq_id = 3, pos_min = 55001, n_swa = 1
slot update_slots: id  3 | task 3235 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 53491, pos_max = 53491, n_tokens = 53492, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 53599, pos_max = 53599, n_tokens = 53600, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 53689, pos_max = 53689, n_tokens = 53690, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 53898, pos_max = 53898, n_tokens = 53899, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 53981, pos_max = 53981, n_tokens = 53982, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 54383, pos_max = 54383, n_tokens = 54384, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 54529, pos_max = 54529, n_tokens = 54530, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | erased invalidated context checkpoint (pos_min = 54838, pos_max = 54838, n_tokens = 54839, n_swa = 1, size = 75.376 MiB)
slot update_slots: id  3 | task 3235 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  3 | task 3235 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.037203
slot update_slots: id  3 | task 3235 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id  3 | task 3235 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.074406
slot update_slots: id  3 | task 3235 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id  3 | task 3235 | prompt processing progress, n_tokens = 6144, batch.n_tokens = 2048, progress = 0.111610
slot update_slots: id  3 | task 3235 | n_tokens = 6144, memory_seq_rm [6144, end)
slot update_slots: id  3 | task 3235 | 3 batches since last checkpoint at 0, creating new checkpoint during processing at position 8192
slot update_slots: id  3 | task 3235 | prompt processing progress, n_tokens = 8192, batch.n_tokens = 2048, progress = 0.148813
slot update_slots: id  3 | task 3235 | created context checkpoint 1 of 8 (pos_min = 6143, pos_max = 6143, n_tokens = 6144, size = 75.376 MiB)
...

The strange thing is that tool calls don't trigger this issue (using OpenCode), only manual messages. Makes me wonder if this is some sort of issue with OpenCode, or maybe some PEBKAC issue on my part.

EDIT: I am successfully working around the issue with --ctx-checkpoints 128.

@allkhor
Copy link

allkhor commented Mar 4, 2026

@jhemmond, take a look at #20003 — maybe this will help you.

@ggerganov
Copy link
Member

@pwilkin I don't think this flag is needed. Which case does it fix that is not already covered by the existing logic?

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 4, 2026

@ggerganov From all the reports I gathered that there are cases where the agent reprocessing cuts of a larger part of the prompt while still keeping a reasonably large prefix - a notable example of such cases is when the agent doesn't pass back reasoning content to the model, so the last reasoning content gets cut off (and it's often more than 512 tokens).

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 4, 2026

Another thing this helps with is keeping a checkpoint with all (or most) of the agent's fixed tools / instructions header, which can go up to 20k tokens in some cases.

@whoreson
Copy link
Contributor

whoreson commented Mar 4, 2026

@pwilkin I don't think this flag is needed. Which case does it fix that is not already covered by the existing logic?

I have explained 4 months ago here: #17428

@ggerganov
Copy link
Member

a notable example of such cases is when the agent doesn't pass back reasoning content to the model, so the last reasoning content gets cut off (and it's often more than 512 tokens).

This change is not going to fix that - when the reasoning is removed, the prefix changes. So making checkpoints during reasoning will just increase memory usage without reducing computation.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 4, 2026

This change is not going to fix that - when the reasoning is removed, the prefix changes. So making checkpoints during reasoning will just increase memory usage without reducing computation.

No, because in a typical agentic scenario, you're going to have something like this:

[AGENT INSTRUCTIONS][USER QUERY][AGENT REASONING][AGENT RESPONSE]

Now, if reasoning is removed, you'll have

[AGENT INSTRUCTIONS][USER QUERY][AGENT RESPONSE][USER REQUERY]

Although some part is removed, you're still left with the [AGENT INSTRUCTIONS][USER QUERY] part, which can be quite large, esp. if files have been attached.

@smilediver
Copy link

I have a case where the prompt looks like [append_part][dynamic_part], where the model is called with the same append_part and different dynamic_part several times (different queries for the same data). After that, append_part is appended and the process is repeated.

@ggerganov
Copy link
Member

The [AGENT INSTRUCTIONS][USER QUERY] is checkpointed with the existing logic, so it's not a good example - it should still work even without this PR.

I think a better example where the current logic on master would fail is if you continue an existing session that looks like this for example:

system
user0
assistant0
user1
assistant1
...
userN
assistantN
userN+1

With a fresh llama-server we will create only one checkpoint right before userN+1. Now if we want to go back to an earlier message (e.g. userN-1) and branch from there, we won't have checkpoint data (because the llama-server has just started). I guess this is something that can happen often with OpenCode, so it's good to support it.

Ideally, the best logic is to create a checkpoint before each user message. But it might be more complicated to do that than the basic interval-based checkpointing proposed here.

@smilediver If you have control over the client, you can send [append_part] alone once at the start to force a checkpoint creation and then follow up with the different dynamic parts. This is something we utilize in the llama.vim plugin to reduce computation.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 4, 2026

The [AGENT INSTRUCTIONS][USER QUERY] is checkpointed with the existing logic, so it's not a good example - it should still work even without this PR.

It is if you start a new conversation, yes. But if you resume a session, as people often do with those agents, then it doesn't work anymore. There are a lot of real-life edge cases here which is I thought that this approach would be the simplest to capture most of them.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can bump the default number of checkpoints to 32 and enable this new functionality by default to make checkpoints every 8192 tokens.

@LucaAmigoni
Copy link

However, to reiterate, there is no possible solution for a situation where the prefix itself changes. For example, there's a long-standing issue with Claude Code which inserts a custom header that gets added to the beginning of the prompt each time - this breaks any sort of checkpointing. Likewise, if you have the exact datetime in your prompt, this will force reprocessing every time. There is nothing that can be done in those cases - those have to be fixed on the client side.

You can set the env variable CLAUDE_CODE_ATTRIBUTION_HEADER to 0 to avoid the custom header in CC.

@aagit
Copy link
Contributor

aagit commented Mar 4, 2026

FWIW to keep it minimal I used the logic batch size instead of a new config option. #19970 . Hardcoded to 8k isn't ideal.

@aagit
Copy link
Contributor

aagit commented Mar 4, 2026

However, to reiterate, there is no possible solution for a situation where the prefix itself changes. For example, there's a long-standing issue with Claude Code which inserts a custom header that gets added to the beginning of the prompt each time - this breaks any sort of checkpointing. Likewise, if you have the exact datetime in your prompt, this will force reprocessing every time. There is nothing that can be done in those cases - those have to be fixed on the client side.

yes luckily gptel never alters the prompt and the prompt performs much better after the context anyway... verified both with rg-edit and synthmerge_bench for all models except gemini 2.5 with reasoning enabled. When you're editing files in context you're inevitably going to truncate context. gptel context LRU management PR I posted makes sure the most frequently edited files are pushed to the end of context for this reason, it gives an extra noticeable boost.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 5, 2026

You can set the env variable CLAUDE_CODE_ATTRIBUTION_HEADER to 0 to avoid the custom header in CC.

Ye I know that one :) but what I mean is that if you don't use that option, there is literally no way to avoid reprocessing.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 5, 2026

@ggerganov aight, I changed the defaults to 32 checkpoints and checkpointing every 8192 tokens.

@pwilkin pwilkin force-pushed the batch-checkpoints branch from 68b87a8 to 6874373 Compare March 5, 2026 22:58
@pwilkin pwilkin requested a review from aldehir as a code owner March 5, 2026 22:58
@pwilkin pwilkin force-pushed the batch-checkpoints branch from 6874373 to c9d8cdc Compare March 5, 2026 23:00
@pwilkin pwilkin force-pushed the batch-checkpoints branch from c9d8cdc to 516c5d6 Compare March 5, 2026 23:00
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the best logic is to create a checkpoint before each user message. But it might be more complicated to do that than the basic interval-based checkpointing proposed here.

If you are interested to improve this further, I think this can be implemented during chat formatting to emit a list of token positions based on the locations of the user messages. And then use this list to determine the checkpoint locations during prompt processing.

@Dampfinchen
Copy link

Dampfinchen commented Mar 6, 2026

I have just stumbled across this comment: https://www.reddit.com/r/LocalLLaMA/comments/1rlkptk/comment/o8ta804/

`I noticed Qwen always reprocess it's lasts response, this is because the chat template is configured to not include previous thinking in the context. This is fine, it's a tradeoff and that's how Qwen was trained.

The problem is, the same happens with thinking disabled. I think the cause is that even with thinking disabled there are empty think tags in the template, which are also removed from the conversation, causing a context shift and forcing reprocessing

It would be nice to modify the template to keep empty tags In the conversation history`

Might or might not have some relevancy here.

@pwilkin
Copy link
Contributor Author

pwilkin commented Mar 6, 2026

@ggerganov might try this, but that's largely nontrivial - when we parse the chat message, we first detokenize it, so at the point of parsing the parser doesn't know what the token positions are. I know @ngxson has been thinking of making the Jinja engine token-aware, but IIRC that's a work in progress as well.

@pwilkin pwilkin merged commit f5ddcd1 into ggml-org:master Mar 6, 2026
78 checks passed
@JamePeng
Copy link

JamePeng commented Mar 6, 2026

I suggest adding a timestamp to the cache. For cache management of the same prompt word, each time the best cache is restored, only the timestamp needs to be updated. In addition, cache blocks that have not been hit for a long time in the cache stack can be discarded first.

@Dampfinchen
Copy link

Dampfinchen commented Mar 6, 2026

@Dampfinchen to be honest, the very idea of the hybrid architecture is that you can run a very long context cheaply, so this eliminates the need for context truncation. Unfortunately, due to how recurrent states are constructed, there is no way to take a "partial range" of a recurrent model's cache - once any prefix is invalid, you have to reprocess.

This solution is mostly intended for situations where agentic coders keep a certain prefix of the prompt, but not long enough for the default solution (n_tokens - 512) to work. The idea behind this is basically - build up snapshots incrementally, and if at any time any prefix is needed then one of the checkpoints should probably work.

However, to reiterate, there is no possible solution for a situation where the prefix itself changes. For example, there's a long-standing issue with Claude Code which inserts a custom header that gets added to the beginning of the prompt each time - this breaks any sort of checkpointing. Likewise, if you have the exact datetime in your prompt, this will force reprocessing every time. There is nothing that can be done in those cases - those have to be fixed on the client side.

Thank you for your explanation and your fantastic work. I have nothing to add except for disagreeing with "run very long context cheaply, so this eliminates the need for context truncation."

Even for longer context, you will run into this issue eventually. Regardless if you use 10K context or 300K, context will fill up eventually and then you run into this issue, with more context it just takes longer to happen. To make matters worse, agentic workflows and vision tasks easily fill up context in no time. With non hybrid models, you just keep cruising not having to worry about it.

I'm really not a fan of this approach "just use more context" even for very efficient models like Qwen 3.5 more context still adds more memory and needs more compute which not everyone has especially in these times. To be honest, I think the concept is having a context that eventually fills up and will have to be reset is fundamentally flawed, imagine if the human brain worked like that. In 5 years I predict we will look at context in LLMs the same way we look at dial up internet now.

Anyways, thank you again for your great work. I realize there is nothing you can do about this. I hope Qwen will change the architecture in the future so atleast a rolling context is viable again. But more so, for the future I hope the whole concept of having a context for LLMs will be completely overhauled.

@ggerganov
Copy link
Member

ggerganov commented Mar 6, 2026

With non hybrid models, you just keep cruising not having to worry about it.

@Dampfinchen Just to make sure I understand, with non-recurrent models did you use the --cache-reuse functionality in order to avoid the reprocessing when the context becomes full?

@IIIIIllllIIIIIlllll
Copy link

Sorry to bother you. I downloaded the latest code from the main branch and compiled it using ROCm, but it doesn't work with this parameter. After checking the code modified in this PR, I noticed that another parameter is used in the code.
Am I missing something?

llama-server -m /home/mark/Models/Q8/Qwen3.5-35B-A3B-Q8_0/Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf 
--port 8088 
--mmproj /home/mark/Models/Q8/Qwen3.5-35B-A3B-Q8_0/mmproj-F32.gguf 
--ctx-size 262144 
--flash-attn on --no-mmap 
--temp 1 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1 --frequency-penalty 0.0 
--batch-size 4096 --ubatch-size 4096 
--parallel 4 
--cache-ram -1 
--cache-type-k f16 --cache-type-v f16 
--threads -1 --seed -1 -dio --no-webui 
--reasoning-budget -1 
--swa-full 
--checkpoint-every-nb 3 
--chat-template-file /home/mark/App/llama.cpp/cache/Qwen3.5-35B-A3B-Q8_0.jinja 
--metrics 
--slot-save-path /home/mark/App/llama.cpp/cache --alias Qwen3.5-35B-A3B-Q8_0 
--timeout 36000 --host 0.0.0.0

@aagit
Copy link
Contributor

aagit commented Mar 6, 2026

I just tested the new upstream commit with default settings, and as anticipated, my reproducer (available in my previous PR) shows a regression in truncated kvcache lookup time.

Default 8k Settings:

kvcache preload: 32.5362 seconds
truncated kvcache lookup: 3.3391 seconds
kvcache preload: 32.5151 seconds
truncated kvcache lookup: 3.3146 seconds

--checkpoint-every-n-tokens 2048 restores the performance to the level of my previous setup:

kvcache preload: 33.0262 seconds
truncated kvcache lookup: 0.7092 seconds

Why would I care about 2 extra seconds after truncation?

  1. Context Ingestion: I add source files (and relevant buffers) to the context using gptel-add-file/gptel-add via a per-project function (after clearing the context).
  2. Refactoring: I traverse a ~100k files project using ripgrep-edit with regex patterns. I have llama.cpp rewrite the rg-edit buffer under the control of a dynamically generated GBNF grammar with two algo picked control lines at the start and end of each snippnet.
  3. Flush: I close the ripgrep buffer, flushing the changes.

The razor conding workflow gives the full source file that is going to be rewritten in point 1, but by ripping through the source with ripgrep in point 2, input and output of the LLM (provided to the LLM after context) are as razor thin as possible, greatly reducing the generation wait time and improving accuracy of the output too. I never rewrite files in full, that's not workable with dozen of files to edit in one go all 10k lines (not tokens) long.

The Issue:
Step 3 inevitably truncates the context, so the extra 2 sec wait is added to most new requests.

Suggestion to avoid an extra ~2 sec delay after kvcache truncation if your workflow is similar

  • --checkpoint-every-n-tokens 2048
  • --ctx-checkpoints (increase as needed, e.g., to 64 to cover the last 128k of context)

This is combined with context managed in LRU fashion, with which also improves accuracy further. The files that are most frequently modified by llama.cpp are refiled at the of context using a gptel-context-optimizer module hooking into file save/invalidate hooks, so the truncation is most frequently happening at the end of the kvcache.

Thank you for merging this configurable solution upstream so I can use these new models in addition of my devstral-2-small default pick!

@Dampfinchen
Copy link

Dampfinchen commented Mar 6, 2026

With non hybrid models, you just keep cruising not having to worry about it.

@Dampfinchen Just to make sure I understand, with non-recurrent models did you use the --cache-reuse functionality in order to avoid the reprocessing when the context becomes full?

Yeah. When the context of the prompt is larger than the maximum context size set with -c and the UI, the UI connected to llama.cpp server has to truncate older messages to keep a rolling chat window going. With non-reccurent models this is not an issue, as the cache reuse function works as expected. But with hybrid model this is not supported. This means that every time I send a message to the AI, it has to reprocess the entire prompt again and again which slows down things alot.

``forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)`

But yeah nothing llama.cpp can do about that. Perhaps some UI trickery could work, like truncating a huge chunk of the prompt at once when it goes near the maximum context, then after that intitial processing you'd be safe from reprocessing until the ctx is almost equal to the max ctx again and the thing repeats. I think koboldcpp's smart context does that.

@Petrox
Copy link

Petrox commented Mar 7, 2026

I just want to thank you for your efforts. (Hope this comment does not bother anyone after the closed PR. :) )

birdingman0626 added a commit to birdingman0626/llama.cpp_blackwell that referenced this pull request Mar 8, 2026
Key fix: 17a4258 kv-cache: fix M-RoPE checkpoints (ggml-org#20132)
- Qwen3.5 uses M-RoPE (n_pos_per_embd > 1 via qwen35.rope.dimension_sections)
- Old checkpoint save/restore code omitted llama_kv_cell_ext data for M-RoPE
- Caused access violation (0xc0000005) in llama.dll during prompt cache update
- This was the root cause of repeated server crashes

Also includes: f5ddcd1 Checkpoint every n tokens (ggml-org#20087)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@aldehir
Copy link
Contributor

aldehir commented Mar 9, 2026

Never seen this before, might be a regression introduced here:

gpt-oss-120b
https://cdn.alde.dev/llama.cpp/videos/pp-loop.mp4

Seems like an off-by-one issue, 509/508 = 1.001969.

@ggerganov
Copy link
Member

What is your server command?

@aldehir
Copy link
Contributor

aldehir commented Mar 9, 2026

What is your server command?

Nothing fancy, although the auto-fit will offload to CPU since I cannot load the model entirely in VRAM. No other logs, the loop filled my scrollback buffer.

llama-server.exe -m E:\Models\gpt-oss-120b\gpt-oss-120b-F16.gguf -c 64000

ae87863

I'll troubleshoot a bit more and see if I can reliably reproduce. Wanted to share in case there was something obvious to the smarter people in the room.

@ggerganov
Copy link
Member

Yes, the logic is a bit flaky there - it's most likely an off-by-one error.

With #20277 it will be easier to trace it down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.