Skip to content

server : fix speculative checkpoint bugs on hybrid models#20925

Closed
petter-b wants to merge 28 commits intoggml-org:masterfrom
petter-b:feature/checkpoint-fixes-refactored
Closed

server : fix speculative checkpoint bugs on hybrid models#20925
petter-b wants to merge 28 commits intoggml-org:masterfrom
petter-b:feature/checkpoint-fixes-refactored

Conversation

@petter-b
Copy link

@petter-b petter-b commented Mar 24, 2026

Accidentally submitted to the wrong repo. See petter-b#1.

srogmann and others added 28 commits March 21, 2026 09:41
The extra_args field was silently ignored — never appended to the
server command line. Tests that set extra_args were running without
the intended flags.
Sends 20 sequential requests with checkpoints enabled on a hybrid
model with small context. Uses --draft-p-min 0.99 to force draft
rejection, exercising the checkpoint restore/rewind path. Without
the KV leak fix, attention KV entries from rejected drafts
accumulate and crash the server.
restore_checkpoint uses PARTIAL_ONLY flag which only restores recurrent
state. Attention KV entries from rejected draft tokens at positions
beyond the checkpoint remain as orphans. Add memory_seq_rm to clean
them after each restore.
Make memory_seq_rm(p0, -1) unconditional in rewind() instead of
only calling it in the else branch (when no checkpoint exists).

For hybrid SSM models, delete_checkpoint() only restores recurrent
state — attention KV entries from the bonus token and unaccepted
draft tokens beyond p0 remain as orphans unless explicitly removed.
10 identical requests with f16 V cache and checkpoints on Qwen3.5-0.8B
must produce identical output, validating bit-exact checkpoint round-trip.
Validates that JSON grammar output remains valid after checkpoint restore.
Without sampler state save/restore, rejected speculation advances the
grammar past the rollback point, producing invalid output.

Note: with ngram-mod speculation the bug may not trigger reliably in
automated tests due to draft acceptance patterns. The bug was confirmed
during manual investigation with grammar + checkpoint restore.
Clone sampler state at checkpoint creation, restore on rollback.
Without this, ghost tokens from rejected drafts remain in the
sampler's prev ring buffer and grammar state, causing invalid
output with grammar-constrained generation.
Reproduces GGML_ASSERT(logits != nullptr) crash when KV cache fills
during speculative draft decode with concurrent multi-turn requests.
Tests both ngram and draft model paths.
Clear speculative draft state for all slots when llama_decode fails.
Without this, sample_and_accept() tries to access logits for draft
positions that were never decoded, hitting GGML_ASSERT(logits != nullptr).
PARTIAL_ONLY flag is ignored by standard KV cache — full state is
saved on every checkpoint, wasting memory. memory_seq_rm handles
draft cleanup correctly for standard KV, making checkpoints redundant.
Only enable for recurrent or hybrid models.
Quantized V cache with speculative checkpoints on hybrid models
causes non-deterministic output due to batch-size-dependent FP
differences crossing quantization boundaries. Emit a warning at
startup recommending f16/bf16 V cache.
Uses stories15M (auto-downloaded, pure transformer) with checkpoint
flags. The standard KV guard disables checkpoints for this model,
validating the fallback path remains correct.
restore_checkpoint() only restores recurrent/partial state (PARTIAL_ONLY).
Attention KV entries from rejected drafts remained as orphans. The cleanup
(memory_seq_rm) was in the server callback — any other consumer would need
to independently discover this invariant.

Move the cleanup into sample_and_accept() so the speculative layer owns it.
Change restore_checkpoint return type to include pos_max so the speculative
layer knows where to clean from.
Add per-request counters: spec_cycles (total compute_draft calls),
spec_empty (no n-gram prediction), spec_skip (full rejection via
checkpoint restore). Emitted in timings JSON and server logs.
Two env-var-gated debug features:
- LLAMA_DEBUG_KV_DUPLICATES: scan for duplicate (seq_id, pos) pairs
  after apply_ubatch, log when found
- LLAMA_SKIP_SEQRM_AFTER_CKPT: skip memory_seq_rm after checkpoint
  restore to reproduce upstream behavior
@petter-b petter-b requested review from a team and ggerganov as code owners March 24, 2026 01:10
@petter-b petter-b requested a review from a team as a code owner March 24, 2026 01:10
@github-actions github-actions bot added examples python python script changes server labels Mar 24, 2026
@petter-b petter-b closed this Mar 24, 2026
@ggml-gh-bot
Copy link

ggml-gh-bot bot commented Mar 24, 2026

Hi @petter-b, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants