UPSTREAM PR #18471: Add self‑speculative decoding (no draft model required)#750
UPSTREAM PR #18471: Add self‑speculative decoding (no draft model required)#750
Conversation
ca06125 to
76fc6ba
Compare
7b3d537 to
9fee55e
Compare
86bf5db to
07aff19
Compare
|
Explore the complete analysis inside the Version Insights |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Performance Review Report: llama.cpp Speculative Decoding EnhancementExecutive SummaryAnalysis of 16 functions across llama-tts and llama-cvector-generator binaries reveals no performance regressions. All changes stem from intentional feature additions implementing self-speculative decoding configuration infrastructure across 12 commits by Sascha Rogmann. Total initialization overhead: 40,954 nanoseconds (0.041 milliseconds) Key FindingsAll analyzed functions reside in Most-Impacted FunctionsLambda #132 (--spec-config parser):
Lambda #133 (--spec-config parser):
Lambda #145 (--spec-config parser):
STL Functions (cbegin/begin):
Code ChangesThe 12 commits introduce comprehensive speculative decoding configuration:
Performance ContextRelative to typical workloads:
Impact: 0.0004% - 0.004% of model loading time Power ConsumptionEstimated energy impact: 80 microjoules per program startup
Critical AssessmentNo performance-critical functions affected. Project insights identify matrix operations (70-90% of inference time), attention mechanisms, and KV cache as critical—none were modified. All changes affect initialization code filtered out for non-server binaries (llama-tts, llama-cvector-generator don't use speculative decoding). STL regressions: The 215% slowdown in JustificationThe 6,099% maximum percentage increase is misleading—it represents new functionality addition (placeholder → production parser) rather than degradation. The implementation prioritizes correctness, flexibility, and user experience appropriately for configuration code. The 40-microsecond absolute overhead is negligible for one-time initialization. ConclusionThis version successfully adds flexible speculative decoding configuration without compromising llama.cpp's performance. Changes are well-architected with proper validation, error handling, and separation of concerns. Recommendation: Approve for production—no optimization needed. See the complete breakdown in Version Insights |
|
No summary available at this time. Visit Version Insights to review detailed analysis. |
Mirrored from ggml-org/llama.cpp#18471
This PR introduces self-speculative decoding: instead of using a dedicated draft model (which is good, if available, see #18039), the current token history is used to predict future tokens. This can provide a speedup in cases where the output contains repeated parts of the prompt. A typical example is making many small changes in a large source file.
Example 1 (
gpt-oss-120bin VRAM): Translation of a few comments in a Python script (chosen as a favorable case).Same prompt with
--draft-min 12 --draft-max 48 --spec-self 1:To keep the PR simple, the new argument
--spec-selfreuses the samedraft-minanddraft-maxvalues as used for a potential draft model. When combining both speculative decoding methods, these values are shared (no independent tuning of min/max for each method).Example 2 (
Qwen3-235B, with heavy offloading):Same prompt with
--draft-min 15 --draft-max 40 --spec-self 1:This speedup factor (from ~12 to ~21 tokens/s) occurs only in favorable cases with large repeated sections!
The algorithm is simple: search for a pattern of length
draft-minin the token history and use the subsequentdraft-maxtokens for speculation. No further optimizations are implemented. I had the idea for this PR while waiting for a source file to finish at 5 t/s ;-)