Skip to content

UPSTREAM PR #20428: server: fix multi-turn cache reuse for hybrid/recurrent models#1248

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-20428-hybrid-cache-reuse
Open

UPSTREAM PR #20428: server: fix multi-turn cache reuse for hybrid/recurrent models#1248
loci-dev wants to merge 2 commits intomainfrom
loci/pr-20428-hybrid-cache-reuse

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#20428

Summary

Multi-turn conversations with hybrid models (e.g. Qwen3.5-A3B) currently force full prompt reprocessing on every turn because:

  1. Short prompts never get checkpoints: The pos_max >= 64 minimum threshold prevents checkpoint creation for prompts shorter than ~68 tokens. For hybrid models, even short prompts need checkpoints since the recurrent state cannot be rolled back without one.

  2. seq_rm failure clears everything: When llama_memory_seq_rm fails (the recurrent memory has no cell-level checkpoint at the trim position), the server clears all memory instead of trying to restore from a server-level checkpoint.

Changes

  • Lower checkpoint thresholds for hybrid/recurrent models: Remove the pos_max >= 64 and spacing >= 64 minimums for hybrid/recurrent models while keeping them for SWA-only models. This ensures checkpoints are always created during prompt processing, enabling multi-turn reuse.

  • Restore checkpoint on seq_rm failure: When seq_rm fails, search for the nearest server-level checkpoint before the trim position and restore it, instead of clearing all memory. This is a safety net for cases where the primary checkpoint restore path (the pos_min > pos_min_thold check) doesn't trigger.

Test results (Qwen3.5-35B-A3B, macOS Apple Silicon)

Q4_K_XL quantization:

Turn Total prompt Cached New processed
1 31 0 31
2 48 27 21

Q8_K_XL quantization:

Turn Total prompt Cached New processed
1 31 0 31
2 48 27 21
3 65 44 21

Server logs confirm checkpoints are created and restored:

slot update_slots: created context checkpoint 1 of 32 (pos_min = 26, pos_max = 26, n_tokens = 27, size = 62.813 MiB)
slot update_slots: restored context checkpoint (pos_min = 26, pos_max = 26, n_tokens = 27, n_past = 27, size = 62.813 MiB)

Previously, the logs showed:

slot update_slots: forcing full prompt re-processing due to lack of cache data

Prior work and acknowledgments

This PR builds on the checkpoint infrastructure developed across several PRs:

  • #13194 @ggerganov — kv-cache: add SWA support (foundation)
  • #13833 @ggerganov — SWA cache sizing
  • #13979 @gabe-l-hart — hybrid recurrent cache abstraction
  • #15293 @ggerganov — server: add SWA checkpoints (introduced the checkpoint mechanism this PR extends)
  • #16382 @ddh0 — generalized context checkpointing to all hybrid/recurrent models
  • #17009 @gabe-l-hart — hybrid context shift for multi-turn
  • #18391 @o7si — fix server crash when seq_rm fails for hybrid models
  • #19408 @ggerganov — improved checkpoint logic
  • #20087 @pwilkin--checkpoint-every-nb for finer-grained rollback
  • #20288 @ggerganov — make 2 checkpoints near end of prompt

Related: #20075 @eauchs — speculative decoding fixes for hybrid SSM/MoE (Qwen3.5)

When llama_memory_seq_rm fails (common for hybrid models like Qwen3.5-A3B
where recurrent memory lacks a cell-level checkpoint at the trim position),
try restoring the nearest server-level checkpoint instead of clearing all
memory and forcing full prompt reprocessing.

This enables multi-turn cache reuse for hybrid models: on turn 2+, the
server restores the checkpoint closest to the common prefix boundary
rather than reprocessing the entire conversation from scratch.
For hybrid models like Qwen3.5-A3B, even short prompts (< 64 tokens) need
checkpoints to enable multi-turn cache reuse. Without a checkpoint, the
recurrent state cannot be rolled back and every new turn forces full prompt
reprocessing.

Remove the pos_max >= 64 and spacing >= 64 thresholds for hybrid/recurrent
models while keeping them for SWA-only models.
@loci-review
Copy link
Copy Markdown

loci-review bot commented Mar 12, 2026

No meaningful performance changes were detected across 119965 analyzed functions in the following binaries: build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.llama-bench, build.bin.libllama.so, build.bin.libmtmd.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@krystophny
Copy link
Copy Markdown

Sorry I commented in wrong PR before.

@loci-dev loci-dev force-pushed the main branch 11 times, most recently from efc22ce to 945fa3a Compare March 19, 2026 02:18
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 8c39ead to 418d9f2 Compare March 26, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from d997939 to 8527fd7 Compare March 27, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants