Skip to content

server: improve speed of speculative decoding#1119

Merged
ikawrakow merged 2 commits intomainfrom
fcp/speculative_imprv
Jan 10, 2026
Merged

server: improve speed of speculative decoding#1119
ikawrakow merged 2 commits intomainfrom
fcp/speculative_imprv

Conversation

@firecoperana
Copy link
Copy Markdown
Collaborator

Port ggml-org/llama.cpp#17808
This is an improvement in how llama-server does the speculative decoding where it now generates more draft tokens. About 10% improvement in tg speed.
Old behavior: llama-server always do decoding batch, n_tokens = 1 first to generate a single token without speculative decoding, then call llama_speculative_gen_draft to generate another 4, then goes back to generate a single token again.
Other things added in the PR:

  1. Output acceptance rate for speculative decoding in the log
  2. Change log output for prompt evaluation and token evaluation speed
  3. Add the graph recompute in rpc
  4. Allow rpc-server to be used for the draft model

change logs

rpc: add recompute

spec dec fix
@firecoperana firecoperana requested a review from ikawrakow January 8, 2026 04:42
@ikawrakow
Copy link
Copy Markdown
Owner

@magikRUKKOLA At some point you were actively pursuing speculative decoding. Can you try if this PR works for you and improves performance? Thanks!

@magikRUKKOLA
Copy link
Copy Markdown

@ikawrakow

Unfortunately getting segfault at master.

VERB [           process_token] next token | tid="140737345380352" timestamp=1767858335 id_slot=0 id_task=0 token=1986 token_text="This" has_next_token=true n_remain=-1 n_decoded=1 stopped_eos=false stopped_word=false stopped_limit=false stopping_word=""
VERB [            update_slots] max possible draft | tid="140737345380352" timestamp=1767858335 id_slot=0 n_draft_max=16
VERB [       server_sent_event] data stream, to_send: %s | ="data: {\"choices\":[{\"finish_reason\":null,\"index\":0,\"delta\":{\"role\":\"assistant\",\"content\":null}}],\"created\":1767858335,\"id\":\"chatcmpl-FqRqXoJvBFns2EBfQfk36EhXNAhilCcp\",\"model\":\"\",\"object\":\"chat.completion.chunk\",\"usage\":{\"completion_tokens\":1,\"prompt_tokens\":22367,\"total_tokens\":22368}}\n\n"
VERB [       server_sent_event] data stream, to_send: %s | ="data: {\"choices\":[{\"finish_reason\":null,\"index\":0,\"delta\":{\"content\":\"This\"}}],\"created\":1767858335,\"id\":\"chatcmpl-FqRqXoJvBFns2EBfQfk36EhXNAhilCcp\",\"model\":\"\",\"object\":\"chat.completion.chunk\",\"usage\":{\"completion_tokens\":1,\"prompt_tokens\":22367,\"total_tokens\":22368}}\n\n"

Thread 1 "llama-server" received signal SIGSEGV, Segmentation fault.
0x00005555557a968a in llama_batch_add(llama_batch&, int, int, std::vector<int, std::allocator<int> > const&, bool) ()
(gdb) bt full
#0  0x00005555557a968a in llama_batch_add(llama_batch&, int, int, std::vector<int, std::allocator<int> > const&, bool) ()
No symbol table info available.
#1  0x000055555581df0b in llama_speculative_gen_draft(llama_speculative*, llama_speculative_params, std::vector<int, std::allocator<int> > const&, int) ()
No symbol table info available.
#2  0x00005555556ddcc2 in server_context::update_slots() ()
No symbol table info available.
#3  0x0000555555682a90 in server_queue::start_loop() ()
No symbol table info available.
#4  0x00005555555dcc14 in main ()
No symbol table info available.

I might messed something up. Hm...

/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server \
    --model /opt/THIREUS/Qwen3-Coder-480B-A35B-Instruct-5.1546bpw/Qwen3-Coder-480B-A35B-Instruct-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00748.gguf \
    --alias THIREUS/Qwen3-Coder-480B-A35B-Instruct-5.1546bpw \
    --ctx-size $((96 * 1024)) \
    --model-draft /opt/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-IQ1_KT.gguf \
    --draft-max 16 \
    --draft-params "--seed 3407 --split-mode graph --gpu-layers 99 -ctk q4_0 -ctv q4_0 -khad --merge-qkv -cuda fusion=1" \
    --ctx-size-draft $((96 * 1024)) \
    -b 4096 -ub 4096 \
    --mlock \
    --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.1 --repeat-penalty 1.05 \
    -ctk q8_0 -ctv q8_0 \
    --merge-qkv \
    -amb 512 \
    --seed 3407 \
    --split-mode layer \
    -ts 1,1,1 \
    --main-gpu 2 \
    -khad \
    --tensor-split 1,1,1 \
    --main-gpu 1 \
    --cpu-moe \
    --gpu-layers 99 \
    --threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
    --threads-draft $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
    --host 0.0.0.0 \
    --port 8080 \
    --log-enable \
    --logdir /var/log/ \
    --jinja \
    --special \
    --verbose-prompt --verbosity 2 \
    --prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \
    --slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \
    --lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \
    --keep -1 \
    --slot-prompt-similarity 0.35 \
    --metrics \
    -cuda fusion=1

@ikawrakow
Copy link
Copy Markdown
Owner

@magikRUKKOLA This is something that used to work but is now causing a segmentation fault?

@magikRUKKOLA
Copy link
Copy Markdown

magikRUKKOLA commented Jan 8, 2026

@ikawrakow

Yeah, and interestingly, the segfault only occurs if sufficiently long prompt (20k ctx+) is specified.

The only thing I changed from the last time is the usage of the --draft-params (I also incorrectly specified graph mode for the draft model but that should not affect anything).

@firecoperana
Copy link
Copy Markdown
Collaborator Author

I can reproduce it in this PR with 20K+ context too.

@firecoperana
Copy link
Copy Markdown
Collaborator Author

@magikRUKKOLA Test again. The batch size of draft model is still using 2048, which is too small. Setting it to be draft's context size fixed it.

@ikawrakow ikawrakow merged commit c193166 into main Jan 10, 2026
@magikRUKKOLA
Copy link
Copy Markdown

magikRUKKOLA commented Jan 10, 2026

@firecoperana

Test again. The batch size of draft model is still using 2048, which is too small. Setting it to be draft's context size fixed it.

Uh oh!

Indeed that was the problem.

If compared to the prev. results for the Qwen3-Coder ( #839 ), so ... before the boost was from 6.6tps to 7.81tps (w/o and with spec. decoding), so +18%. If compared to today's results its from 6.61tps to 8.64tps (+31%). So the results show about 72% overall improvement or so (that is, +31% vs +18%).

Details
/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server \
    --model /opt/THIREUS/Qwen3-Coder-480B-A35B-Instruct-5.1546bpw/Qwen3-Coder-480B-A35B-Instruct-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00748.gguf \
    --alias THIREUS/Qwen3-Coder-480B-A35B-Instruct-5.1546bpw \
    --ctx-size $((96 * 1024)) \
    --model-draft /opt/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-IQ1_KT.gguf \
    --draft-max 16 \
    --draft-params "--seed 3407 -b 4096 -ub 4096 --split-mode layer --gpu-layers 99 -ctk q4_0 -ctv q4_0 -khad --merge-qkv -cuda fusion=1" \
    --ctx-size-draft $((96 * 1024)) \
    -b 4096 -ub 4096 \
    --mlock \
    --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.1 --repeat-penalty 1.05 \
    -ctk q8_0 -ctv q8_0 \
    --merge-qkv \
    -amb 512 \
    --seed 3407 \
    --split-mode layer \
    -ts 1,1,1 \
    --main-gpu 2 \
    -khad \
    --tensor-split 1,1,1 \
    --main-gpu 1 \
    --cpu-moe \
    --gpu-layers 99 \
    --threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
    --threads-draft $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
    --host 0.0.0.0 \
    --port 8080 \
    --log-enable \
    --logdir /var/log/ \
    --jinja \
    --special \
    --verbose-prompt --verbosity 2 \
                --prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \
    --slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \
    --lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \
                --keep -1 \
                --slot-prompt-similarity 0.35 \
    --metrics \
    -cuda fusion=1

@firecoperana firecoperana deleted the fcp/speculative_imprv branch January 15, 2026 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants