server: improve speed of speculative decoding by firecoperana · Pull Request #1119 · ikawrakow/ik_llama.cpp

firecoperana · 2026-01-08T04:42:48Z

Port ggml-org/llama.cpp#17808
This is an improvement in how llama-server does the speculative decoding where it now generates more draft tokens. About 10% improvement in tg speed.
Old behavior: llama-server always do decoding batch, n_tokens = 1 first to generate a single token without speculative decoding, then call llama_speculative_gen_draft to generate another 4, then goes back to generate a single token again.
Other things added in the PR:

Output acceptance rate for speculative decoding in the log
Change log output for prompt evaluation and token evaluation speed
Add the graph recompute in rpc
Allow rpc-server to be used for the draft model

change logs rpc: add recompute spec dec fix

ikawrakow · 2026-01-08T06:44:23Z

@magikRUKKOLA At some point you were actively pursuing speculative decoding. Can you try if this PR works for you and improves performance? Thanks!

magikRUKKOLA · 2026-01-08T07:48:45Z

@ikawrakow

Unfortunately getting segfault at master.

VERB [           process_token] next token | tid="140737345380352" timestamp=1767858335 id_slot=0 id_task=0 token=1986 token_text="This" has_next_token=true n_remain=-1 n_decoded=1 stopped_eos=false stopped_word=false stopped_limit=false stopping_word=""
VERB [            update_slots] max possible draft | tid="140737345380352" timestamp=1767858335 id_slot=0 n_draft_max=16
VERB [       server_sent_event] data stream, to_send: %s | ="data: {\"choices\":[{\"finish_reason\":null,\"index\":0,\"delta\":{\"role\":\"assistant\",\"content\":null}}],\"created\":1767858335,\"id\":\"chatcmpl-FqRqXoJvBFns2EBfQfk36EhXNAhilCcp\",\"model\":\"\",\"object\":\"chat.completion.chunk\",\"usage\":{\"completion_tokens\":1,\"prompt_tokens\":22367,\"total_tokens\":22368}}\n\n"
VERB [       server_sent_event] data stream, to_send: %s | ="data: {\"choices\":[{\"finish_reason\":null,\"index\":0,\"delta\":{\"content\":\"This\"}}],\"created\":1767858335,\"id\":\"chatcmpl-FqRqXoJvBFns2EBfQfk36EhXNAhilCcp\",\"model\":\"\",\"object\":\"chat.completion.chunk\",\"usage\":{\"completion_tokens\":1,\"prompt_tokens\":22367,\"total_tokens\":22368}}\n\n"

Thread 1 "llama-server" received signal SIGSEGV, Segmentation fault.
0x00005555557a968a in llama_batch_add(llama_batch&, int, int, std::vector<int, std::allocator<int> > const&, bool) ()
(gdb) bt full
#0  0x00005555557a968a in llama_batch_add(llama_batch&, int, int, std::vector<int, std::allocator<int> > const&, bool) ()
No symbol table info available.
#1  0x000055555581df0b in llama_speculative_gen_draft(llama_speculative*, llama_speculative_params, std::vector<int, std::allocator<int> > const&, int) ()
No symbol table info available.
#2  0x00005555556ddcc2 in server_context::update_slots() ()
No symbol table info available.
#3  0x0000555555682a90 in server_queue::start_loop() ()
No symbol table info available.
#4  0x00005555555dcc14 in main ()
No symbol table info available.

I might messed something up. Hm...

/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server \
    --model /opt/THIREUS/Qwen3-Coder-480B-A35B-Instruct-5.1546bpw/Qwen3-Coder-480B-A35B-Instruct-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00748.gguf \
    --alias THIREUS/Qwen3-Coder-480B-A35B-Instruct-5.1546bpw \
    --ctx-size $((96 * 1024)) \
    --model-draft /opt/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-IQ1_KT.gguf \
    --draft-max 16 \
    --draft-params "--seed 3407 --split-mode graph --gpu-layers 99 -ctk q4_0 -ctv q4_0 -khad --merge-qkv -cuda fusion=1" \
    --ctx-size-draft $((96 * 1024)) \
    -b 4096 -ub 4096 \
    --mlock \
    --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.1 --repeat-penalty 1.05 \
    -ctk q8_0 -ctv q8_0 \
    --merge-qkv \
    -amb 512 \
    --seed 3407 \
    --split-mode layer \
    -ts 1,1,1 \
    --main-gpu 2 \
    -khad \
    --tensor-split 1,1,1 \
    --main-gpu 1 \
    --cpu-moe \
    --gpu-layers 99 \
    --threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
    --threads-draft $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
    --host 0.0.0.0 \
    --port 8080 \
    --log-enable \
    --logdir /var/log/ \
    --jinja \
    --special \
    --verbose-prompt --verbosity 2 \
    --prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \
    --slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \
    --lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \
    --keep -1 \
    --slot-prompt-similarity 0.35 \
    --metrics \
    -cuda fusion=1

ikawrakow · 2026-01-08T14:46:13Z

@magikRUKKOLA This is something that used to work but is now causing a segmentation fault?

magikRUKKOLA · 2026-01-08T14:58:30Z

@ikawrakow

Yeah, and interestingly, the segfault only occurs if sufficiently long prompt (20k ctx+) is specified.

The only thing I changed from the last time is the usage of the --draft-params (I also incorrectly specified graph mode for the draft model but that should not affect anything).

firecoperana · 2026-01-08T15:45:16Z

I can reproduce it in this PR with 20K+ context too.

firecoperana · 2026-01-08T18:37:00Z

@magikRUKKOLA Test again. The batch size of draft model is still using 2048, which is too small. Setting it to be draft's context size fixed it.

magikRUKKOLA · 2026-01-10T11:07:00Z

@firecoperana

Test again. The batch size of draft model is still using 2048, which is too small. Setting it to be draft's context size fixed it.

Uh oh!

Indeed that was the problem.

If compared to the prev. results for the Qwen3-Coder ( #839 ), so ... before the boost was from 6.6tps to 7.81tps (w/o and with spec. decoding), so +18%. If compared to today's results its from 6.61tps to 8.64tps (+31%). So the results show about 72% overall improvement or so (that is, +31% vs +18%).

Details

/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server \
    --model /opt/THIREUS/Qwen3-Coder-480B-A35B-Instruct-5.1546bpw/Qwen3-Coder-480B-A35B-Instruct-THIREUS-BF16-SPECIAL_TENSOR-00001-of-00748.gguf \
    --alias THIREUS/Qwen3-Coder-480B-A35B-Instruct-5.1546bpw \
    --ctx-size $((96 * 1024)) \
    --model-draft /opt/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-IQ1_KT.gguf \
    --draft-max 16 \
    --draft-params "--seed 3407 -b 4096 -ub 4096 --split-mode layer --gpu-layers 99 -ctk q4_0 -ctv q4_0 -khad --merge-qkv -cuda fusion=1" \
    --ctx-size-draft $((96 * 1024)) \
    -b 4096 -ub 4096 \
    --mlock \
    --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.1 --repeat-penalty 1.05 \
    -ctk q8_0 -ctv q8_0 \
    --merge-qkv \
    -amb 512 \
    --seed 3407 \
    --split-mode layer \
    -ts 1,1,1 \
    --main-gpu 2 \
    -khad \
    --tensor-split 1,1,1 \
    --main-gpu 1 \
    --cpu-moe \
    --gpu-layers 99 \
    --threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
    --threads-draft $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
    --host 0.0.0.0 \
    --port 8080 \
    --log-enable \
    --logdir /var/log/ \
    --jinja \
    --special \
    --verbose-prompt --verbosity 2 \
                --prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \
    --slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \
    --lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \
                --keep -1 \
                --slot-prompt-similarity 0.35 \
    --metrics \
    -cuda fusion=1

server: improve speed of speculative decoding

be66c65

change logs rpc: add recompute spec dec fix

firecoperana requested a review from ikawrakow January 8, 2026 04:42

Fix n_batch_size not set to context size for draft model

e448b5c

ikawrakow approved these changes Jan 10, 2026

View reviewed changes

ikawrakow merged commit c193166 into main Jan 10, 2026

firecoperana deleted the fcp/speculative_imprv branch January 15, 2026 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: improve speed of speculative decoding#1119

server: improve speed of speculative decoding#1119
ikawrakow merged 2 commits intomainfrom
fcp/speculative_imprv

firecoperana commented Jan 8, 2026

Uh oh!

ikawrakow commented Jan 8, 2026

Uh oh!

magikRUKKOLA commented Jan 8, 2026

Uh oh!

ikawrakow commented Jan 8, 2026

Uh oh!

magikRUKKOLA commented Jan 8, 2026 •

edited

Loading

Uh oh!

firecoperana commented Jan 8, 2026

Uh oh!

firecoperana commented Jan 8, 2026

Uh oh!

magikRUKKOLA commented Jan 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

firecoperana commented Jan 8, 2026

Uh oh!

ikawrakow commented Jan 8, 2026

Uh oh!

magikRUKKOLA commented Jan 8, 2026

Uh oh!

ikawrakow commented Jan 8, 2026

Uh oh!

magikRUKKOLA commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

firecoperana commented Jan 8, 2026

Uh oh!

firecoperana commented Jan 8, 2026

Uh oh!

magikRUKKOLA commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

magikRUKKOLA commented Jan 8, 2026 •

edited

Loading

magikRUKKOLA commented Jan 10, 2026 •

edited

Loading