server : fix speculative checkpoint bugs on hybrid models#20925
Closed
petter-b wants to merge 28 commits intoggml-org:masterfrom
Closed
server : fix speculative checkpoint bugs on hybrid models#20925petter-b wants to merge 28 commits intoggml-org:masterfrom
petter-b wants to merge 28 commits intoggml-org:masterfrom
Conversation
The extra_args field was silently ignored — never appended to the server command line. Tests that set extra_args were running without the intended flags.
Sends 20 sequential requests with checkpoints enabled on a hybrid model with small context. Uses --draft-p-min 0.99 to force draft rejection, exercising the checkpoint restore/rewind path. Without the KV leak fix, attention KV entries from rejected drafts accumulate and crash the server.
restore_checkpoint uses PARTIAL_ONLY flag which only restores recurrent state. Attention KV entries from rejected draft tokens at positions beyond the checkpoint remain as orphans. Add memory_seq_rm to clean them after each restore.
Make memory_seq_rm(p0, -1) unconditional in rewind() instead of only calling it in the else branch (when no checkpoint exists). For hybrid SSM models, delete_checkpoint() only restores recurrent state — attention KV entries from the bonus token and unaccepted draft tokens beyond p0 remain as orphans unless explicitly removed.
10 identical requests with f16 V cache and checkpoints on Qwen3.5-0.8B must produce identical output, validating bit-exact checkpoint round-trip.
Validates that JSON grammar output remains valid after checkpoint restore. Without sampler state save/restore, rejected speculation advances the grammar past the rollback point, producing invalid output. Note: with ngram-mod speculation the bug may not trigger reliably in automated tests due to draft acceptance patterns. The bug was confirmed during manual investigation with grammar + checkpoint restore.
Clone sampler state at checkpoint creation, restore on rollback. Without this, ghost tokens from rejected drafts remain in the sampler's prev ring buffer and grammar state, causing invalid output with grammar-constrained generation.
Reproduces GGML_ASSERT(logits != nullptr) crash when KV cache fills during speculative draft decode with concurrent multi-turn requests. Tests both ngram and draft model paths.
Clear speculative draft state for all slots when llama_decode fails. Without this, sample_and_accept() tries to access logits for draft positions that were never decoded, hitting GGML_ASSERT(logits != nullptr).
PARTIAL_ONLY flag is ignored by standard KV cache — full state is saved on every checkpoint, wasting memory. memory_seq_rm handles draft cleanup correctly for standard KV, making checkpoints redundant. Only enable for recurrent or hybrid models.
Quantized V cache with speculative checkpoints on hybrid models causes non-deterministic output due to batch-size-dependent FP differences crossing quantization boundaries. Emit a warning at startup recommending f16/bf16 V cache.
Uses stories15M (auto-downloaded, pure transformer) with checkpoint flags. The standard KV guard disables checkpoints for this model, validating the fallback path remains correct.
restore_checkpoint() only restores recurrent/partial state (PARTIAL_ONLY). Attention KV entries from rejected drafts remained as orphans. The cleanup (memory_seq_rm) was in the server callback — any other consumer would need to independently discover this invariant. Move the cleanup into sample_and_accept() so the speculative layer owns it. Change restore_checkpoint return type to include pos_max so the speculative layer knows where to clean from.
Add per-request counters: spec_cycles (total compute_draft calls), spec_empty (no n-gram prediction), spec_skip (full rejection via checkpoint restore). Emitted in timings JSON and server logs.
Two env-var-gated debug features: - LLAMA_DEBUG_KV_DUPLICATES: scan for duplicate (seq_id, pos) pairs after apply_ubatch, log when found - LLAMA_SKIP_SEQRM_AFTER_CKPT: skip memory_seq_rm after checkpoint restore to reproduce upstream behavior
|
Hi @petter-b, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Accidentally submitted to the wrong repo. See petter-b#1.