UPSTREAM PR #18471: Add self‑speculative decoding (no draft model required) by loci-dev · Pull Request #750 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-29T23:35:54Z

This PR introduces self-speculative decoding: instead of using a dedicated draft model (which is good, if available, see #18039), the current token history is used to predict future tokens. This can provide a speedup in cases where the output contains repeated parts of the prompt. A typical example is making many small changes in a large source file.

Example 1 (gpt-oss-120b in VRAM): Translation of a few comments in a Python script (chosen as a favorable case).

slot update_slots: id  3 | task 0 | created context checkpoint 1 of 8 (pos_min = 324, pos_max = 2883, size = 90.030 MiB)
slot print_timing: id  3 | task 0 | 
prompt eval time =     436.48 ms /  2948 tokens (    0.15 ms per token,  6754.03 tokens per second)
       eval time =   18886.86 ms /  3423 tokens (    5.52 ms per token,   181.24 tokens per second)
      total time =   19323.34 ms /  6371 tokens
slot      release: id  3 | task 0 | stop processing: n_tokens = 6370, truncated = 0

Same prompt with --draft-min 12 --draft-max 48 --spec-self 1:

slot update_slots: id  3 | task 0 | created context checkpoint 1 of 8 (pos_min = 324, pos_max = 2883, size = 90.030 MiB)
slot print_timing: id  3 | task 0 | 
prompt eval time =     431.85 ms /  2948 tokens (    0.15 ms per token,  6826.38 tokens per second)
       eval time =    7163.27 ms /  3193 tokens (    2.24 ms per token,   445.75 tokens per second)
      total time =    7595.13 ms /  6141 tokens
draft acceptance rate = 0.76827 ( 2397 accepted /  3120 generated)
slot      release: id  3 | task 0 | stop processing: n_tokens = 6140, truncated = 0

To keep the PR simple, the new argument --spec-self reuses the same draft-min and draft-max values as used for a potential draft model. When combining both speculative decoding methods, these values are shared (no independent tuning of min/max for each method).

Example 2 (Qwen3-235B, with heavy offloading):

slot update_slots: id  3 | task 0 | prompt done, n_tokens = 2962, batch.n_tokens = 914
slot print_timing: id  3 | task 0 |
prompt eval time =   15606.37 ms /  2962 tokens (    5.27 ms per token,   189.79 tokens per second)
       eval time =  252551.71 ms /  2973 tokens (   84.95 ms per token,    11.77 tokens per second)
      total time =  268158.08 ms /  5935 tokens
srv  log_server_r: request: POST /v1/chat/completions 192.168.32.208 200

Same prompt with --draft-min 15 --draft-max 40 --spec-self 1:

slot update_slots: id  3 | task 0 | prompt done, n_tokens = 2962, batch.n_tokens = 914
slot print_timing: id  3 | task 0 | 
prompt eval time =   15474.80 ms /  2962 tokens (    5.22 ms per token,   191.41 tokens per second)
       eval time =  141116.29 ms /  2963 tokens (   47.63 ms per token,    21.00 tokens per second)
      total time =  156591.09 ms /  5925 tokens
draft acceptance rate = 0.86304 ( 2382 accepted /  2760 generated)

This speedup factor (from ~12 to ~21 tokens/s) occurs only in favorable cases with large repeated sections!

The algorithm is simple: search for a pattern of length draft-min in the token history and use the subsequent draft-max tokens for speculation. No further optimizations are implemented. I had the idea for this PR while waiting for a source file to finish at 5 t/s ;-)

loci-review · 2026-01-16T23:27:46Z

Explore the complete analysis inside the Version Insights

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

loci-review · 2026-01-24T17:19:53Z

Performance Review Report: llama.cpp Speculative Decoding Enhancement

Executive Summary

Analysis of 16 functions across llama-tts and llama-cvector-generator binaries reveals no performance regressions. All changes stem from intentional feature additions implementing self-speculative decoding configuration infrastructure across 12 commits by Sascha Rogmann.

Total initialization overhead: 40,954 nanoseconds (0.041 milliseconds)

Key Findings

All analyzed functions reside in common/arg.cpp CLI argument parsing code, executing once at startup before model loading or inference. Zero impact on performance-critical paths (matrix operations, attention mechanisms, KV cache, quantization).

Most-Impacted Functions

Lambda #132 (--spec-config parser):

Response time: 1,485 ns → 40,027 ns (+38,542 ns)
Implements 4-level hierarchical string parsing (semicolon→colon→comma→equals)
Newly added functionality, not regression

Lambda #133 (--spec-config parser):

Response time: 24 ns → 1,489 ns (+1,465 ns)
Parses complex configuration strings like "draft;ngram-cache:n=4,m=2"
Completely new in target version

Lambda #145 (--spec-config parser):

Response time: 29 ns → 438 ns (+409 ns)
Multi-level validation with exception handling
New feature addition

STL Functions (cbegin/begin):

Response time: 84 ns → 265 ns (+181 ns each)
Build configuration issue (Debug vs Release mode)
Not code changes

Code Changes

The 12 commits introduce comprehensive speculative decoding configuration:

New CLI arguments: --spec-config, --spec-draftless, --spec-ngram-size-n/m
Parameter namespace consolidation: params.speculative
Support for multiple strategies: draft models, ngram-cache, ngram-simple, eagle3, draftless
53 files changed (11 modified, 39 added, 3 deleted)

Performance Context

Relative to typical workloads:

Model loading: 1-10 seconds (1,000,000,000-10,000,000,000 ns)
Token inference: 10-100 milliseconds (10,000,000-100,000,000 ns)
Argument parsing overhead: 40,954 ns (0.041 ms)

Impact: 0.0004% - 0.004% of model loading time

Power Consumption

Estimated energy impact: 80 microjoules per program startup

25,000× smaller than model loading energy
62,500× smaller than 100-token generation
Annual impact (1,000 restarts/day): 0.000008 kWh

Critical Assessment

No performance-critical functions affected. Project insights identify matrix operations (70-90% of inference time), attention mechanisms, and KV cache as critical—none were modified. All changes affect initialization code filtered out for non-server binaries (llama-tts, llama-cvector-generator don't use speculative decoding).

STL regressions: The 215% slowdown in std::vector::cbegin/begin indicates Debug build configuration in target version versus Release in base. This is a build issue, not code regression.

Justification

The 6,099% maximum percentage increase is misleading—it represents new functionality addition (placeholder → production parser) rather than degradation. The implementation prioritizes correctness, flexibility, and user experience appropriately for configuration code. The 40-microsecond absolute overhead is negligible for one-time initialization.

Conclusion

This version successfully adds flexible speculative decoding configuration without compromising llama.cpp's performance. Changes are well-architected with proper validation, error handling, and separation of concerns. Recommendation: Approve for production—no optimization needed.

See the complete breakdown in Version Insights
Have questions? Tag @loci-dev to ask about this PR.

loci-review · 2026-01-27T10:04:24Z

No summary available at this time. Visit Version Insights to review detailed analysis.

loci-dev had a problem deploying to PROD__AL_DEMO December 29, 2025 23:35 — with GitHub Actions Failure

loci-dev force-pushed the main branch 24 times, most recently from ca06125 to 76fc6ba Compare January 2, 2026 00:37

loci-dev force-pushed the upstream-PR18471-branch_srogmann-feature/self-speculative branch from 7b3d537 to 9fee55e Compare January 2, 2026 00:48

loci-dev had a problem deploying to PROD__AL_DEMO January 2, 2026 00:49 — with GitHub Actions Failure

loci-dev force-pushed the main branch 3 times, most recently from 86bf5db to 07aff19 Compare January 2, 2026 17:07

loci-dev force-pushed the main branch from e81ea79 to a3dcd73 Compare January 8, 2026 12:16

srogmann and others added 12 commits January 24, 2026 15:36

server: introduce self-speculative decoding

1fb2658

server: moved self-call into speculative.cpp

1faeb62

can_speculate() includes self-speculation

e3e809c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

server: can_speculate() tests self-spec

38f7c28

server: replace can_speculate() with slot.can_speculate()

917f4bb

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

common: use %zu format specifier for size_t in logging

f1f6584

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

server: can_speculate() requires a task instance

907d094

common: ngram map, config self-speculative decoding

456268f

common: add enum common_speculative_type

b38eb59

common: add vector of speculative states

eb43748

common: add option --spec-draftless

1e29af4

server: cleanup (remove slot.batch_spec, rename)

a1584ac

srogmann and others added 14 commits January 25, 2026 01:16

common: moved self-spec impl to ngram-map

cb3a402

common: cleanup (use common_speculative_state_draft)

af382c3

spec : refactor

924517d

cont : naming

9ac8817

spec: remove --spec-config

8ea068e

doc: (draftless) speculative decoding

288ab50

common: print performance in spec decoding

fd4d803

minor : cleanup

f895bca

common : better names

a330093

minor : cleanup + fix build

1f8d366

minor: comments

72f416e

CODEOWNERS: add common/ngram-map.* (#18471)

dd23149

Merge branch 'master' into pr/18471

351e798

common : rename speculative.draftless_type -> speculative.type

bc33838

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18471: Add self‑speculative decoding (no draft model required)#750

UPSTREAM PR #18471: Add self‑speculative decoding (no draft model required)#750
loci-dev wants to merge 26 commits intomainfrom
upstream-PR18471-branch_srogmann-feature/self-speculative

loci-dev commented Dec 29, 2025

Uh oh!

loci-review bot commented Jan 16, 2026

Uh oh!

loci-review bot commented Jan 24, 2026

Uh oh!

loci-review bot commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

loci-dev commented Dec 29, 2025

Uh oh!

loci-review bot commented Jan 16, 2026

Uh oh!

loci-review bot commented Jan 24, 2026

Performance Review Report: llama.cpp Speculative Decoding Enhancement

Executive Summary

Key Findings

Most-Impacted Functions

Code Changes

Performance Context

Power Consumption

Critical Assessment

Justification

Conclusion

Uh oh!

loci-review bot commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants