[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels #14930

WoosukKwon · 2025-03-17T06:29:41Z

This PR optimizes the rejection sampler in #13933 with custom Triton kernels.

By using the Triton kernels, the PR brings the following benefits:

Now we use the flattened shape [num_tokens, vocab_size] for the logits tensors, instead of [batch_size, max_spec_len, vocab_size]. This reduces the GPU memory usage a lot.
Zero synchronization between CPU and GPU.
Remove inefficient data movement (i.e., a bunch of cat, gather, etc.)
(Arguably) easier-to-read code

Performance benchmark: Llama 3.1 8B, ShareGPT, 1xH100, temperature 0.1

SD config: --speculative-model "[ngram]" --ngram_prompt_lookup_min 5 --ngram-prompt-lookup-max 5 --num_speculative_tokens 3

	Throughput (reqs/s)
main (w/o SD)	51.49
main (w/ SD)	54.41
This PR (w/ SD)	64.16

25% throughput increase compared to main w/o SD, and 18% increase compared to main w/ SD.

Accuracy benchmark: GSM8K, Llama 3.1 8B Instruct, 5 shots

	Temperature	Exact match
w/o SD	0.0	75.7
	1.0	50.9
w/ SD	0.0	75.9
	1.0	51.8

Signed-off-by: Woosuk Kwon <[email protected]>

LiuXiaoxuanPKU

Finished the rejection_sampler.py, will continue other files tonight

vllm/v1/sample/rejection_sampler.py

Signed-off-by: Woosuk Kwon <[email protected]>

LiuXiaoxuanPKU

LGTM, thanks!

vllm/v1/sample/ops/utils.py

…m-project#14930) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

…m-project#14930) Signed-off-by: Woosuk Kwon <[email protected]>

…m-project#14930) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Mu Huai <[email protected]>

mmyxym · 2025-08-05T04:51:53Z

vllm/v1/sample/rejection_sampler.py

+GREEDY_TEMPERATURE: tl.constexpr = -1
+# Maximum number of speculative draft tokens allowed per request in a single
+# step. This value is chosen to be large enough to handle typical use cases.
+MAX_SPEC_LEN = 32


Hi @WoosukKwon , is there any limitation MAX_SPEC_LEN should be 32? Can it be larger? Thanks.

@mmyxym There's no blocker to make it 64. Everything should work if you just change the number. I just thought 32 would be enough for all practical use cases.

WoosukKwon added 30 commits March 14, 2025 20:41

tmp

c09ae5e

Signed-off-by: Woosuk Kwon <[email protected]>

minor

e3f3513

Signed-off-by: Woosuk Kwon <[email protected]>

fix shape

be535aa

Signed-off-by: Woosuk Kwon <[email protected]>

minor

be950c7

Signed-off-by: Woosuk Kwon <[email protected]>

minor

1fee177

Signed-off-by: Woosuk Kwon <[email protected]>

Add parse_outputs

d30970e

Signed-off-by: Woosuk Kwon <[email protected]>

minor

32fefa1

Signed-off-by: Woosuk Kwon <[email protected]>

minor

4a93973

Signed-off-by: Woosuk Kwon <[email protected]>

minor

f2455fd

Signed-off-by: Woosuk Kwon <[email protected]>

kernel

fbba0ff

Signed-off-by: Woosuk Kwon <[email protected]>

kernel

255d1ee

Signed-off-by: Woosuk Kwon <[email protected]>

fix

22c9515

Signed-off-by: Woosuk Kwon <[email protected]>

comment

c631935

Signed-off-by: Woosuk Kwon <[email protected]>

minor

566caea

Signed-off-by: Woosuk Kwon <[email protected]>

minor

c427ffd

Signed-off-by: Woosuk Kwon <[email protected]>

fix

d896f41

Signed-off-by: Woosuk Kwon <[email protected]>

fix

cb8e699

Signed-off-by: Woosuk Kwon <[email protected]>

fix

c0bcf5a

Signed-off-by: Woosuk Kwon <[email protected]>

fix

ae3d7fc

Signed-off-by: Woosuk Kwon <[email protected]>

fix

412e2f4

Signed-off-by: Woosuk Kwon <[email protected]>

remove

df66124

Signed-off-by: Woosuk Kwon <[email protected]>

opt

704da77

Signed-off-by: Woosuk Kwon <[email protected]>

minor

4f95ca9

Signed-off-by: Woosuk Kwon <[email protected]>

opt softmax & fix recompilation

803c9de

Signed-off-by: Woosuk Kwon <[email protected]>

minor

9cc9349

Signed-off-by: Woosuk Kwon <[email protected]>

remove envs

2b69e51

Signed-off-by: Woosuk Kwon <[email protected]>

Merge branch 'main' into v1-opt-rej

d374d59

Merge branch 'main' into v1-opt-rej

d4a6437

fix

75e93aa

Signed-off-by: Woosuk Kwon <[email protected]>

fix

5a86ff3

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon added 6 commits March 17, 2025 10:12

fix test

8b7a398

Signed-off-by: Woosuk Kwon <[email protected]>

Merge branch 'main' into v1-opt-rej

b303722

Merge branch 'main' into v1-opt-rej

a0440c8

comment

40f334a

Signed-off-by: Woosuk Kwon <[email protected]>

comment

6935bfd

Signed-off-by: Woosuk Kwon <[email protected]>

fix shape mismatch

0baa33e

Signed-off-by: Woosuk Kwon <[email protected]>

LiuXiaoxuanPKU reviewed Mar 18, 2025

View reviewed changes

vllm/v1/sample/rejection_sampler.py Outdated Show resolved Hide resolved

vllm/v1/sample/rejection_sampler.py Outdated Show resolved Hide resolved

vllm/v1/sample/rejection_sampler.py Show resolved Hide resolved

LiuXiaoxuanPKU reviewed Mar 18, 2025

View reviewed changes

vllm/v1/sample/rejection_sampler.py Outdated Show resolved Hide resolved

WoosukKwon added 4 commits March 18, 2025 12:17

Merge branch 'main' into v1-opt-rej

459b2fa

fix docstrings

aaf2316

Signed-off-by: Woosuk Kwon <[email protected]>

fix dtype

531068e

Signed-off-by: Woosuk Kwon <[email protected]>

add comment

69c88b8

Signed-off-by: Woosuk Kwon <[email protected]>

WoosukKwon requested a review from LiuXiaoxuanPKU March 18, 2025 19:29

LiuXiaoxuanPKU approved these changes Mar 18, 2025

View reviewed changes

WoosukKwon merged commit 99abb8b into main Mar 18, 2025
29 of 32 checks passed

WoosukKwon deleted the v1-opt-rej branch March 18, 2025 21:31

youkaichao reviewed Mar 19, 2025

View reviewed changes

vllm/v1/sample/ops/utils.py Show resolved Hide resolved

CXIAAAAA mentioned this pull request Mar 19, 2025

[Feature]: Add likaixin/InstructCoder as spec decode benchmark dataset option #14045

Closed

1 task

This was referenced Mar 21, 2025

[Bug]: v1 speculate decoding NgramProposer experiences service exceptions during stress testing #14742

Closed

add last slot for the invalid_token in greedy rejection sampler, specdec #14519

Closed

WoosukKwon mentioned this pull request Apr 2, 2025

[Bug]: [V1][SpecDec] RuntimeError: CUDA error: an illegal memory access was encountered #13673

Closed

1 task

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels (vll…

f928001

…m-project#14930) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels (vll…

0e57658

…m-project#14930) Signed-off-by: Woosuk Kwon <[email protected]>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels (vll…

08577f8

…m-project#14930) Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Mu Huai <[email protected]>

mmyxym reviewed Aug 5, 2025

View reviewed changes

mergify bot added the speculative-decoding label Aug 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels #14930

[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels #14930

Uh oh!

WoosukKwon commented Mar 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

LiuXiaoxuanPKU left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LiuXiaoxuanPKU left a comment

Uh oh!

Uh oh!

Uh oh!

mmyxym Aug 5, 2025

Uh oh!

WoosukKwon Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels #14930

[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels #14930

Uh oh!

Conversation

WoosukKwon commented Mar 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance benchmark: Llama 3.1 8B, ShareGPT, 1xH100, temperature 0.1

Accuracy benchmark: GSM8K, Llama 3.1 8B Instruct, 5 shots

Uh oh!

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mmyxym Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

WoosukKwon commented Mar 17, 2025 •

edited by github-actions bot

Loading