Skip to content

Conversation

@ekagra-ranjan
Copy link
Contributor

@ekagra-ranjan ekagra-ranjan commented Oct 1, 2025

Followup on #24986

The previous PR split the ngram lookup computation among X CPU threads where each thread got BS/X req.
However, in TP > 1 setting, it would need to spawn TP*X threads thereby allocating more than necessary threads.

This PR adds makes the optimization TP aware by allowing only rank 0 to do all the CPU operations on X CPU threads and then broadcasts the results to other ranks. The multithreading was disabled in previous PR. This PR turns it on.

Testing

  1. Added new distributed ngram test
    VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=2 -m tests.distributed.test_ngram | grep 'successfully passed!'

  2. Ensuring AL remains same on TP1 and TP2 with and without this PR
    cmd: time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method ngram --model-dir meta-llama/Llama-3.1-8B-Instruct --prompt_lookup_min 5 --prompt_lookup_max 15 --num_spec_tokens 15 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --print-output --tp 2

Output

# tp 2
mean acceptance length: 3.25
--------------------------------------------------
acceptance at token 0: 0.60
acceptance at token 1: 0.41
acceptance at token 2: 0.31

Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively refactors the N-gram proposer to be tensor-parallelism aware by centralizing the computation on rank 0 and broadcasting the results. This is a solid approach to optimize for multi-GPU environments. The addition of a distributed test is also a valuable contribution to ensure the correctness of this new logic.

I've identified one issue where the number of CPU threads for the N-gram lookup is still being divided by the tensor-parallel size, which is a leftover from the previous implementation. This would limit the performance on the leader rank. I've included a specific comment with a suggestion to address this.

Overall, this is a great enhancement. Once the suggested change is made, this should be ready to go.

Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
@mergify
Copy link

mergify bot commented Oct 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ekagra-ranjan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 7, 2025
@Neo9061
Copy link

Neo9061 commented Oct 13, 2025

Hi @ekagra-ranjan not sure if you have insights on why n-gram is not working on Qwen moe? #26594

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants