fix: speculative decoding broken on hybrid SSM/MoE (Qwen3.5 MoE)#20075
fix: speculative decoding broken on hybrid SSM/MoE (Qwen3.5 MoE)#20075eauchs wants to merge 4 commits intoggml-org:masterfrom
Conversation
|
I used Claude to build the latest llama.cpp with this fix and it works but idk how you're getting 63-89% acceptance cause I'm only getting 44% and a bit less than half the t/s on both 27B UD-Q4_K_XL and 122B UD-Q3_K_XL with 0.8B UD-Q4_K_XL. I've also encountered issues with both 27B and 122B looping when using a draft model. |
can you share your logs ? prompt eval time = 207.90 ms / 13 tokens ( 15.99 ms per token, 62.53 tokens per second) |
Yep that explains it right there. The prompt is "Write me a Flappy Bird clone entirely in a single HTML File." for the below runs. Qwen3.5 122B Qwen3.5 27B Qwen3.5 27B Temp set to 0.1 and Min P set to 0.01 Qwen3.5 27B Official Precise Coding Settings (temp=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0) |
|
I asked Claude to look at the Commit and see if it sees anything since I certainly can't lol and here's what it says if it helps any.
For what its worth I bumped Context to 130k and had Qwen3.5 27B without a draft model and with the Precise settings create the Diffs in plain text using the HTML for Claude since it couldn't access the Commit. |
The soft rollback path (cells[tail_id].pos = p0 - 1) only updated position metadata, leaving SSM tensor state (r_l/s_l) reflecting the post-speculative position. This caused silent state corruption and looping on speculative decoding rejection for recurrent/hybrid models (e.g. Qwen3.5 MoE 27B). seq_rm now returns false when no checkpoint exists at p0-1, correctly signaling to the caller that rollback requires re-evaluation. The hybrid memory layer already propagates false correctly. Also add a LLAMA_LOG_DEBUG when the 0.9 cache threshold prevents checkpoint creation, making the behavior visible rather than silent. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
|
Thanks for the detailed testing. The acceptance rate difference is expected — The looping on 27B was a real bug: the soft rollback path (cells[tail_id].pos Fix in latest commit: seq_rm now returns false when no checkpoint exists at Also added a LLAMA_LOG_DEBUG when the 0.9 threshold prevents checkpoint Can you retest the 27B looping case with this commit? |
…eq_rm" This reverts commit 9a04ac4.
The checkpoint mechanism in find_slot only triggered when a sequence moved to a new cell (has_cell=false), which never occurs during normal single-sequence autoregressive generation. As a result, seq_rm had no checkpoint to roll back to during speculative decoding rejection. Fix: add checkpoint creation in the has_cell=true branch. Before the current cell is overwritten with new tokens, its SSM state (r_l/s_l) is copied to a free cell and kept as a checkpoint. This makes the rollback history available for the common single-sequence case. Also replace the soft rollback in seq_rm (which only rewound position metadata, leaving tensor state corrupted) with a proper return false, signaling to the caller that re-evaluation is required when no checkpoint exists at p0-1. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
|
Using this PR, I am still not able to enable spec decoding: Full log: Config: |
|
Just a side-note: I think this PR shoud adress the issue for Nemotron 3 models too, could be worth including them in testing, showing that the solution is indeed general and not Qwen specific? |
Add native MTP support for the dense Qwen 3.5 architecture (0.8B, 2B, 4B, 9B, 27B). What works: - MTP graph builder for dense qwen35 (build_mtp_head in qwen35.cpp) - MTP tensor loading and registration for QWEN35 arch - GGUF converter handles MTP tensors (mtp.fc, mtp.layers, mtp.norm, etc.) - Public API: llama_get_mtp_logits(), llama_model_n_mtp_layers() - Server auto-detects MTP from GGUF metadata - Speculative state machine for MTP draft token generation - PR ggml-org#20075 applied: recurrent state checkpoint/restore for hybrid models - M-RoPE position check relaxed for speculative re-evaluation - Windows os.kill fix for gateway process detection What needs work: - Speculative verify loop conflicts with tool-calling requests (400 error) - The recommended fix: bypass the speculative framework entirely and implement MTP acceptance directly in the server generation loop (no seq_rm/rollback needed since MTP drafts are produced in-graph) - MTP attention skipped (projection + FFN path only) due to inp_out_ids token count mismatch Tested on: RTX 5060 8GB, Windows 11, CUDA 13.2 Model: Qwen3.5-9B with MTP tensors (Q4_K_M quantization) Base: llama.cpp b8388
Implements recurrent state checkpointing for Qwen3.5 hybrid attention+SSM architecture, enabling speculative decoding that was previously broken due to SSM layers not supporting partial sequence removal. Upstream PR: ggml-org#20075
The recurrent memory was sized to n_seq_max (typically 1 for single sequence), leaving no room for the checkpoint cells that PR ggml-org#20075's seq_rm rollback needs. When speculative decoding is enabled, scale the buffer by 9x (1 current + 8 checkpoint slots per sequence).
|
ROCm (gfx1151) — crashes in copy_cell() Tested: Qwen3.5-35B-A3B + Qwen3.5-0.8B draft, llama-server -np 1 -c 8192, b8420 + this PR cherry-picked. Two issues found:
common_speculative_is_compat() in speculative.cpp tests seq_rm() on a fresh context — but checkpoints only get created during normal decoding in find_slot(). So the test always fails for recurrent models. Had to bypass it manually to get further.
llama-model.cpp sets recurrent_rs_size = max(1, n_seq_max). With -np 1 that's 1 cell. The checkpoint logic wants up to 8 per sequence — nowhere to put them. copy_cell() gets called with next_empty_cell beyond buffer bounds: Tried bumping rs_size to n_seq_max + 4 (both paths in llama-model.cpp:8085 and :8104) — fixes the OOB but then r_l/s_l tensors aren't backed by a large enough backend buffer → GGML_ASSERT(buffer) in ggml-backend.cpp:194. Same env as @stephensrmmartin who reported the partial sequence removal error above. All existing testers seem to be on Metal/CUDA — might handle buffer bounds differently? |
Speculative decoding on hybrid SSM/MoE models is broken right now. With a draft model you either crash immediately ("the target context does not support partial sequence removal") or end up with garbage loops. Took me a while to track down why.
Two things were wrong in find_slot: empty_cell.src was pointing to orig_cell.src instead of seq_meta.tail (so the graph was reading stale state from the wrong cell), and the copy_cell call was just... missing. On top of that, llama_memory_recurrent has no rollback mechanism at all for the SSM state when draft tokens get rejected, which is what causes the state drift.
Fix adds a checkpoint/restore with a rolling buffer (depth 8 per sequence). On Metal, ggml_backend_tensor_copy is synchronous in ggml 0.9.7 so no barrier needed.
Numbers on M3 Max 128GB:
Qwen3.5-122B-A10B-UD-Q4_K_XL + Qwen3.5-0.8B draft
Baseline: ~20.4 t/s → with patch: 23.5–29.7 t/s, acceptance rate around 63–89% depending on --draft-max
No garbage loops over extended runs
The checkpoint depth (8) and memory guard are hardcoded for now — not sure if it's worth exposing them as llama_context_params, open to feedback. VRAM overhead is n_seq_max × depth × SSM_state_size_per_layer, fine on my end but probably worth discussing for smaller devices.