Skip to content

UPSTREAM PR #17808: server: improve speed of speculative decoding#463

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17808-branch_ngxson-xsn/server_improve_spec
Open

UPSTREAM PR #17808: server: improve speed of speculative decoding#463
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17808-branch_ngxson-xsn/server_improve_spec

Conversation

@loci-dev
Copy link
Copy Markdown

@loci-dev loci-dev commented Dec 6, 2025

Mirrored from ggml-org/llama.cpp#17808

Fix ggml-org/llama.cpp#12968

I'm testing with llama-server --fim-qwen-7b-spec but it seems like the quality degraded significantly. Not sure if this is expected (as we no longer sample single token like before)

TODO: leave a drawing here to explain how it works

@loci-review
Copy link
Copy Markdown

loci-review bot commented Dec 6, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #463

Analysis Overview:
Comparison of version f4ec65a5-3bd7-43ad-b473-ceef01e93350 against base 6ffb5ba3-7159-4bcb-bcc4-0ff094d14c42 for the llama.cpp server speculative decoding optimization.


Summary

This PR refactors speculative decoding to batch draft tokens with the main model inference, eliminating a separate llama_decode() call per iteration. The analysis shows no measurable performance differences between versions, with all binaries reporting 0.0% power consumption change and no function-level Response Time or Throughput Time variations. The code changes are structurally sound, moving draft generation before batch construction and adding rollback logic for token management, but the optimization benefits are not captured in the static analysis environment.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from a2add8a to 6d9272a Compare December 9, 2025 09:10
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from ef96f85 to adf9533 Compare December 14, 2025 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants