server: improve speed of speculative decoding#1119
Conversation
change logs rpc: add recompute spec dec fix
|
@magikRUKKOLA At some point you were actively pursuing speculative decoding. Can you try if this PR works for you and improves performance? Thanks! |
|
Unfortunately getting segfault at master. I might messed something up. Hm... /opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server \
--model /opt/THIREUS/Qwen3-Coder-480B-A35B-Instruct-5.1546bpw/Qwen3-Coder-480B-A35B-Instruct-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00748.gguf \
--alias THIREUS/Qwen3-Coder-480B-A35B-Instruct-5.1546bpw \
--ctx-size $((96 * 1024)) \
--model-draft /opt/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-IQ1_KT.gguf \
--draft-max 16 \
--draft-params "--seed 3407 --split-mode graph --gpu-layers 99 -ctk q4_0 -ctv q4_0 -khad --merge-qkv -cuda fusion=1" \
--ctx-size-draft $((96 * 1024)) \
-b 4096 -ub 4096 \
--mlock \
--temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.1 --repeat-penalty 1.05 \
-ctk q8_0 -ctv q8_0 \
--merge-qkv \
-amb 512 \
--seed 3407 \
--split-mode layer \
-ts 1,1,1 \
--main-gpu 2 \
-khad \
--tensor-split 1,1,1 \
--main-gpu 1 \
--cpu-moe \
--gpu-layers 99 \
--threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
--threads-draft $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
--host 0.0.0.0 \
--port 8080 \
--log-enable \
--logdir /var/log/ \
--jinja \
--special \
--verbose-prompt --verbosity 2 \
--prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \
--slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \
--lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \
--keep -1 \
--slot-prompt-similarity 0.35 \
--metrics \
-cuda fusion=1 |
|
@magikRUKKOLA This is something that used to work but is now causing a segmentation fault? |
|
Yeah, and interestingly, the segfault only occurs if sufficiently long prompt (20k ctx+) is specified. The only thing I changed from the last time is the usage of the |
|
I can reproduce it in this PR with 20K+ context too. |
|
@magikRUKKOLA Test again. The batch size of draft model is still using 2048, which is too small. Setting it to be draft's context size fixed it. |
Uh oh! Indeed that was the problem. If compared to the prev. results for the Qwen3-Coder ( #839 ), so ... before the boost was from 6.6tps to 7.81tps (w/o and with spec. decoding), so +18%. If compared to today's results its from 6.61tps to 8.64tps (+31%). So the results show about 72% overall improvement or so (that is, +31% vs +18%). Details |
Port ggml-org/llama.cpp#17808
This is an improvement in how llama-server does the speculative decoding where it now generates more draft tokens. About 10% improvement in tg speed.
Old behavior: llama-server always do decoding batch, n_tokens = 1 first to generate a single token without speculative decoding, then call
llama_speculative_gen_draftto generate another 4, then goes back to generate a single token again.Other things added in the PR: