UPSTREAM PR #17808: server: improve speed of speculative decoding#463
UPSTREAM PR #17808: server: improve speed of speculative decoding#463
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #463Analysis Overview: SummaryThis PR refactors speculative decoding to batch draft tokens with the main model inference, eliminating a separate |
a2add8a to
6d9272a
Compare
ef96f85 to
adf9533
Compare
Mirrored from ggml-org/llama.cpp#17808
Fix ggml-org/llama.cpp#12968
I'm testing with
llama-server --fim-qwen-7b-specbut it seems like the quality degraded significantly. Not sure if this is expected (as we no longer sample single token like before)TODO: leave a drawing here to explain how it works