[Spec Decode] Add TP parallel Ngram #26056

ekagra-ranjan · 2025-10-01T23:00:41Z

Followup on #24986

The previous PR split the ngram lookup computation among X CPU threads where each thread got BS/X req.
However, in TP > 1 setting, it would need to spawn TP*X threads thereby allocating more than necessary threads.

This PR adds makes the optimization TP aware by allowing only rank 0 to do all the CPU operations on X CPU threads and then broadcasts the results to other ranks. The multithreading was disabled in previous PR. This PR turns it on.

Testing

Added new distributed ngram test
VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=2 -m tests.distributed.test_ngram | grep 'successfully passed!'
Ensuring AL remains same on TP1 and TP2 with and without this PR
cmd: time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method ngram --model-dir meta-llama/Llama-3.1-8B-Instruct --prompt_lookup_min 5 --prompt_lookup_max 15 --num_spec_tokens 15 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --print-output --tp 2

Output

# tp 2
mean acceptance length: 3.25
--------------------------------------------------
acceptance at token 0: 0.60
acceptance at token 1: 0.41
acceptance at token 2: 0.31

Signed-off-by: Ekagra Ranjan <[email protected]>

gemini-code-assist

Code Review

This pull request effectively refactors the N-gram proposer to be tensor-parallelism aware by centralizing the computation on rank 0 and broadcasting the results. This is a solid approach to optimize for multi-GPU environments. The addition of a distributed test is also a valuable contribution to ensure the correctness of this new logic.

I've identified one issue where the number of CPU threads for the N-gram lookup is still being divided by the tensor-parallel size, which is a leftover from the previous implementation. This would limit the performance on the leader rank. I've included a specific comment with a suggestion to address this.

Overall, this is a great enhancement. Once the suggested change is made, this should be ready to go.

vllm/v1/spec_decode/ngram_proposer.py

Signed-off-by: Ekagra Ranjan <[email protected]>

mergify · 2025-10-07T22:16:17Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ekagra-ranjan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Neo9061 · 2025-10-13T16:56:38Z

Hi @ekagra-ranjan not sure if you have insights on why n-gram is not working on Qwen moe? #26594

ekagra-ranjan added 2 commits October 1, 2025 19:12

Add TP parallel ngram

5168082

Signed-off-by: Ekagra Ranjan <[email protected]>

add ngram distributed test

3f74ed9

Signed-off-by: Ekagra Ranjan <[email protected]>

ekagra-ranjan requested review from benchislett and luccafong as code owners October 1, 2025 23:00

mergify bot added ci/build speculative-decoding v1 labels Oct 1, 2025

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

vllm/v1/spec_decode/ngram_proposer.py Outdated Show resolved Hide resolved

ekagra-ranjan added 4 commits October 2, 2025 16:53

fix test path and location

7e0f2a2

Signed-off-by: Ekagra Ranjan <[email protected]>

fix

a3f49e2

Signed-off-by: Ekagra Ranjan <[email protected]>

lint

bc1a118

Signed-off-by: Ekagra Ranjan <[email protected]>

remove //tp_size

dcb8633

Signed-off-by: Ekagra Ranjan <[email protected]>

mergify bot added the needs-rebase label Oct 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Spec Decode] Add TP parallel Ngram #26056

[Spec Decode] Add TP parallel Ngram #26056

Uh oh!

ekagra-ranjan commented Oct 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

Neo9061 commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Spec Decode] Add TP parallel Ngram #26056

Are you sure you want to change the base?

[Spec Decode] Add TP parallel Ngram #26056

Uh oh!

Conversation

ekagra-ranjan commented Oct 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

Neo9061 commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ekagra-ranjan commented Oct 1, 2025 •

edited by github-actions bot

Loading