UPSTREAM PR #20428: server: fix multi-turn cache reuse for hybrid/recurrent models by loci-dev · Pull Request #1248 · auroralabs-loci/llama.cpp

loci-dev · 2026-03-12T03:05:23Z

Note

Source pull request: ggml-org/llama.cpp#20428

Summary

Multi-turn conversations with hybrid models (e.g. Qwen3.5-A3B) currently force full prompt reprocessing on every turn because:

Short prompts never get checkpoints: The pos_max >= 64 minimum threshold prevents checkpoint creation for prompts shorter than ~68 tokens. For hybrid models, even short prompts need checkpoints since the recurrent state cannot be rolled back without one.
seq_rm failure clears everything: When llama_memory_seq_rm fails (the recurrent memory has no cell-level checkpoint at the trim position), the server clears all memory instead of trying to restore from a server-level checkpoint.

Changes

Lower checkpoint thresholds for hybrid/recurrent models: Remove the pos_max >= 64 and spacing >= 64 minimums for hybrid/recurrent models while keeping them for SWA-only models. This ensures checkpoints are always created during prompt processing, enabling multi-turn reuse.
Restore checkpoint on seq_rm failure: When seq_rm fails, search for the nearest server-level checkpoint before the trim position and restore it, instead of clearing all memory. This is a safety net for cases where the primary checkpoint restore path (the pos_min > pos_min_thold check) doesn't trigger.

Test results (Qwen3.5-35B-A3B, macOS Apple Silicon)

Q4_K_XL quantization:

Turn	Total prompt	Cached	New processed
1	31	0	31
2	48	27	21

Q8_K_XL quantization:

Turn	Total prompt	Cached	New processed
1	31	0	31
2	48	27	21
3	65	44	21

Server logs confirm checkpoints are created and restored:

slot update_slots: created context checkpoint 1 of 32 (pos_min = 26, pos_max = 26, n_tokens = 27, size = 62.813 MiB)
slot update_slots: restored context checkpoint (pos_min = 26, pos_max = 26, n_tokens = 27, n_past = 27, size = 62.813 MiB)

Previously, the logs showed:

slot update_slots: forcing full prompt re-processing due to lack of cache data

Prior work and acknowledgments

This PR builds on the checkpoint infrastructure developed across several PRs:

#13194 @ggerganov — kv-cache: add SWA support (foundation)
#13833 @ggerganov — SWA cache sizing
#13979 @gabe-l-hart — hybrid recurrent cache abstraction
#15293 @ggerganov — server: add SWA checkpoints (introduced the checkpoint mechanism this PR extends)
#16382 @ddh0 — generalized context checkpointing to all hybrid/recurrent models
#17009 @gabe-l-hart — hybrid context shift for multi-turn
#18391 @o7si — fix server crash when seq_rm fails for hybrid models
#19408 @ggerganov — improved checkpoint logic
#20087 @pwilkin — --checkpoint-every-nb for finer-grained rollback
#20288 @ggerganov — make 2 checkpoints near end of prompt

Related: #20075 @eauchs — speculative decoding fixes for hybrid SSM/MoE (Qwen3.5)

When llama_memory_seq_rm fails (common for hybrid models like Qwen3.5-A3B where recurrent memory lacks a cell-level checkpoint at the trim position), try restoring the nearest server-level checkpoint instead of clearing all memory and forcing full prompt reprocessing. This enables multi-turn cache reuse for hybrid models: on turn 2+, the server restores the checkpoint closest to the common prefix boundary rather than reprocessing the entire conversation from scratch.

For hybrid models like Qwen3.5-A3B, even short prompts (< 64 tokens) need checkpoints to enable multi-turn cache reuse. Without a checkpoint, the recurrent state cannot be rolled back and every new turn forces full prompt reprocessing. Remove the pos_max >= 64 and spacing >= 64 thresholds for hybrid/recurrent models while keeping them for SWA-only models.

loci-review · 2026-03-12T04:07:13Z

No meaningful performance changes were detected across 119965 analyzed functions in the following binaries: build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.llama-bench, build.bin.libllama.so, build.bin.libmtmd.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

krystophny · 2026-03-12T10:58:41Z

Sorry I commented in wrong PR before.

krystophny added 2 commits March 11, 2026 22:25

loci-dev temporarily deployed to PROD__AL_DEMO March 12, 2026 03:05 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 11 times, most recently from efc22ce to 945fa3a Compare March 19, 2026 02:18

loci-dev force-pushed the main branch 9 times, most recently from 8c39ead to 418d9f2 Compare March 26, 2026 02:17

loci-dev force-pushed the main branch 2 times, most recently from d997939 to 8527fd7 Compare March 27, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #20428: server: fix multi-turn cache reuse for hybrid/recurrent models#1248

UPSTREAM PR #20428: server: fix multi-turn cache reuse for hybrid/recurrent models#1248
loci-dev wants to merge 2 commits intomainfrom
loci/pr-20428-hybrid-cache-reuse

loci-dev commented Mar 12, 2026

Uh oh!

loci-review bot commented Mar 12, 2026

Uh oh!

krystophny commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Mar 12, 2026

Summary

Changes

Test results (Qwen3.5-35B-A3B, macOS Apple Silicon)

Prior work and acknowledgments

Uh oh!

loci-review bot commented Mar 12, 2026

Uh oh!

krystophny commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants