server : fix speculative checkpoint bugs on hybrid models by petter-b · Pull Request #20925 · ggml-org/llama.cpp

petter-b · 2026-03-24T01:10:33Z

Accidentally submitted to the wrong repo. See petter-b#1.

The extra_args field was silently ignored — never appended to the server command line. Tests that set extra_args were running without the intended flags.

Sends 20 sequential requests with checkpoints enabled on a hybrid model with small context. Uses --draft-p-min 0.99 to force draft rejection, exercising the checkpoint restore/rewind path. Without the KV leak fix, attention KV entries from rejected drafts accumulate and crash the server.

restore_checkpoint uses PARTIAL_ONLY flag which only restores recurrent state. Attention KV entries from rejected draft tokens at positions beyond the checkpoint remain as orphans. Add memory_seq_rm to clean them after each restore.

Make memory_seq_rm(p0, -1) unconditional in rewind() instead of only calling it in the else branch (when no checkpoint exists). For hybrid SSM models, delete_checkpoint() only restores recurrent state — attention KV entries from the bonus token and unaccepted draft tokens beyond p0 remain as orphans unless explicitly removed.

10 identical requests with f16 V cache and checkpoints on Qwen3.5-0.8B must produce identical output, validating bit-exact checkpoint round-trip.

Validates that JSON grammar output remains valid after checkpoint restore. Without sampler state save/restore, rejected speculation advances the grammar past the rollback point, producing invalid output. Note: with ngram-mod speculation the bug may not trigger reliably in automated tests due to draft acceptance patterns. The bug was confirmed during manual investigation with grammar + checkpoint restore.

Clone sampler state at checkpoint creation, restore on rollback. Without this, ghost tokens from rejected drafts remain in the sampler's prev ring buffer and grammar state, causing invalid output with grammar-constrained generation.

Reproduces GGML_ASSERT(logits != nullptr) crash when KV cache fills during speculative draft decode with concurrent multi-turn requests. Tests both ngram and draft model paths.

Clear speculative draft state for all slots when llama_decode fails. Without this, sample_and_accept() tries to access logits for draft positions that were never decoded, hitting GGML_ASSERT(logits != nullptr).

PARTIAL_ONLY flag is ignored by standard KV cache — full state is saved on every checkpoint, wasting memory. memory_seq_rm handles draft cleanup correctly for standard KV, making checkpoints redundant. Only enable for recurrent or hybrid models.

Quantized V cache with speculative checkpoints on hybrid models causes non-deterministic output due to batch-size-dependent FP differences crossing quantization boundaries. Emit a warning at startup recommending f16/bf16 V cache.

Uses stories15M (auto-downloaded, pure transformer) with checkpoint flags. The standard KV guard disables checkpoints for this model, validating the fallback path remains correct.

restore_checkpoint() only restores recurrent/partial state (PARTIAL_ONLY). Attention KV entries from rejected drafts remained as orphans. The cleanup (memory_seq_rm) was in the server callback — any other consumer would need to independently discover this invariant. Move the cleanup into sample_and_accept() so the speculative layer owns it. Change restore_checkpoint return type to include pos_max so the speculative layer knows where to clean from.

Add per-request counters: spec_cycles (total compute_draft calls), spec_empty (no n-gram prediction), spec_skip (full rejection via checkpoint restore). Emitted in timings JSON and server logs.

Two env-var-gated debug features: - LLAMA_DEBUG_KV_DUPLICATES: scan for duplicate (seq_id, pos) pairs after apply_ubatch, log when found - LLAMA_SKIP_SEQRM_AFTER_CKPT: skip memory_seq_rm after checkpoint restore to reproduce upstream behavior

ggml-gh-bot · 2026-03-24T01:14:35Z

Hi @petter-b, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.
Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

srogmann and others added 28 commits March 21, 2026 09:41

server : speculative decoding using checkpoints

47a1fd0

server : fix draft check with checkpoints

9586727

server : rename spec vars

da87251

server : log levels

81038a6

server : refactored spec logic to speculative.cpp

71e50c8

server : renamed spec checkpoints option

21ae577

server : fix spec checkpoints, logging

e6fa5ee

speculative : checkpoints with draft model, logging

951aca0

server : n_tokens_cur and create_checkpoint in draft

9245c24

server : fix server_speculative_callback (slot.id)

e62e74d

spec : fix ngram-map/begin idx_last_check

29d8410

spec : init ckpt (begin() wasn't called)

0a79738

tests : add extra_args support to ServerProcess

a086c1b

The extra_args field was silently ignored — never appended to the server command line. Tests that set extra_args were running without the intended flags.

tests : add f16 checkpoint determinism test for hybrid model

613649f

10 identical requests with f16 V cache and checkpoints on Qwen3.5-0.8B must produce identical output, validating bit-exact checkpoint round-trip.

tests : add crash regression test for spec decode KV exhaustion

4a85599

Reproduces GGML_ASSERT(logits != nullptr) crash when KV cache fills during speculative draft decode with concurrent multi-turn requests. Tests both ngram and draft model paths.

server : fix crash on decode failure during speculative draft

57d5b7a

Clear speculative draft state for all slots when llama_decode fails. Without this, sample_and_accept() tries to access logits for draft positions that were never decoded, hitting GGML_ASSERT(logits != nullptr).

tests : add CI-compatible spec checkpoint determinism test

8b997d9

Uses stories15M (auto-downloaded, pure transformer) with checkpoint flags. The standard KV guard disables checkpoints for this model, validating the fallback path remains correct.

server : add diagnostic counters for speculation debugging

4c74035

Add per-request counters: spec_cycles (total compute_draft calls), spec_empty (no n-gram prediction), spec_skip (full rejection via checkpoint restore). Emitted in timings JSON and server logs.

kv : write duplicate detection to file for container retrieval

6b38caf

petter-b requested review from a team and ggerganov as code owners March 24, 2026 01:10

petter-b requested a review from a team as a code owner March 24, 2026 01:10

github-actions bot added examples python python script changes server labels Mar 24, 2026

petter-b closed this Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : fix speculative checkpoint bugs on hybrid models#20925

server : fix speculative checkpoint bugs on hybrid models#20925
petter-b wants to merge 28 commits intoggml-org:masterfrom
petter-b:feature/checkpoint-fixes-refactored

petter-b commented Mar 24, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot bot commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

petter-b commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggml-gh-bot bot commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

petter-b commented Mar 24, 2026 •

edited

Loading