-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[Spec Decode] Add TP parallel Ngram #26056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively refactors the N-gram proposer to be tensor-parallelism aware by centralizing the computation on rank 0 and broadcasting the results. This is a solid approach to optimize for multi-GPU environments. The addition of a distributed test is also a valuable contribution to ensure the correctness of this new logic.
I've identified one issue where the number of CPU threads for the N-gram lookup is still being divided by the tensor-parallel size, which is a leftover from the previous implementation. This would limit the performance on the leader rank. I've included a specific comment with a suggestion to address this.
Overall, this is a great enhancement. Once the suggested change is made, this should be ready to go.
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @ekagra-ranjan not sure if you have insights on why n-gram is not working on Qwen moe? #26594 |
Followup on #24986
The previous PR split the ngram lookup computation among X CPU threads where each thread got BS/X req.
However, in TP > 1 setting, it would need to spawn TP*X threads thereby allocating more than necessary threads.
This PR adds makes the optimization TP aware by allowing only rank 0 to do all the CPU operations on X CPU threads and then broadcasts the results to other ranks. The multithreading was disabled in previous PR. This PR turns it on.
Testing
Added new distributed ngram test
VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=2 -m tests.distributed.test_ngram | grep 'successfully passed!'Ensuring AL remains same on TP1 and TP2 with and without this PR
cmd:
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method ngram --model-dir meta-llama/Llama-3.1-8B-Instruct --prompt_lookup_min 5 --prompt_lookup_max 15 --num_spec_tokens 15 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --print-output --tp 2Output