-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[V1] Support MP Executor for multi node distributed inference #23691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81078278. |
|
@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81078278. |
68187de to
5493fee
Compare
|
@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81078278. |
5493fee to
e2d3fbb
Compare
|
@luccafong has imported this pull request. If you are a Meta employee, you can view this in D81078278. |
|
This pull request has merge conflicts that must be resolved before it can be |
e2d3fbb to
fccd3f1
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
fccd3f1 to
10f033e
Compare
888f0c2 to
159a8ae
Compare
Signed-off-by: Lu Fang <[email protected]>
|
@njhill @youkaichao I have addressed comments and args, retested all combinations on lm_eval. |
Signed-off-by: Lu Fang <[email protected]>
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @luccafong. A few small things...
140dbe5 to
0a064a9
Compare
Signed-off-by: Lu Fang <[email protected]>
0a064a9 to
3178648
Compare
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offline synced with @youkaichao and @njhill, and aligned.
Let's land this for now.
Signed-off-by: Lu Fang <[email protected]>
…roject#23691) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Bram Wasti <[email protected]>
| node_rank: int = 0 | ||
| """distributed node rank for multi-node distributed | ||
| inference when distributed_executor_backend is mp.""" | ||
| nnodes: int = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good idea to add nnodes as an attribute to the very general ParallelConfig class where it is just meaningful when distributed_executor_backend is mp. Many AI labs rely heavily on RAY and when the distibuted executor backend is ray, nnodes will always be wrong here is it'll stay 1 even if tensor_parallel is set to something like 16 on 8xH200.
It's not intuitive when doing self.vllm_config.parallel_config.nnodes to get 1 here for ray backend and it's not enough to just state in the comment that it's only for mp. IMO we need to make sure that nnodes always displays the correct number of nodes no matter the parallel backend. Similarly node_rank also needs to make sense for ray. If it's impossible to put good values for ray we should at least prefix it with nnodes_for_mp_backend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nnodes_for_mp_backend might be pretty hard to use, does set them to None be a better way to resolve the conflicts and confusion? and we can raise error if user set it for ray. @patrickvonplaten
Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]>
Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Signed-off-by: Kurumi5210 <[email protected]>
…roject#23691) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]>
…roject#23691) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]>
Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]>
…roject#23691) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]>
…roject#23691) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Signed-off-by: Xingyu Liu <[email protected]>
Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]>
…roject#23691) Signed-off-by: Lu Fang <[email protected]> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Lucia Fang <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]>
Purpose
Support MP Executor for multi node distributed inference when no ray setup. (Compatible with DP hybrid/internal/external lb)
vllm serve "model_name" -tp <TP_Size> -dp <DP_Size> -pp <PP_Size> --nnodes <# nodes> --node-rank <rank of node> --master-addr <leader_host_ip> [--master-port <port>] [--headless (if not exposing API endpoint)>] [--data-parallel-external-lb (for external lb) or --data-parallel-hybrid-lb]See concrete examples in test plans
Note for DP Compatibility:
--data-parallel-rank--data-parallel-local-size--data-parallel-start-rankbased on-distributed-node-rankfor all 3 dp lb modes, and it works the same with using them explicitly before.--headless(it will auto start a headless DP engine if it is leader(driver) of a DP group; or start a headless Executor Instance if it is non-driver workers)Architecture change
Test Plan
TP=4 (2 Instances )
CUDA_VISIBLE_DEVICES=0,1 vllm serve "Qwen/Qwen3-1.7B" -tp=4 --max-model-len=32768 --nnodes 2 --node-rank 0 --master-addr 127.0.0.1 --port 8000 > /tmp/test_rank0.log 2>&1 &
CUDA_VISIBLE_DEVICES=2,3 vllm serve "Qwen/Qwen3-1.7B" -tp=4 --max-model-len=32768 --nnodes 2 --node-rank 1 --master-addr 127.0.0.1 --headless > /tmp/test_rank1.log 2>&1 &
PP=2 x TP=2 (2 Instances)
CUDA_VISIBLE_DEVICES=0,1 vllm serve "Qwen/Qwen3-1.7B" -pp=2 -tp=2 --max-model-len=32768 --nnodes 2 --node-rank 0 --master-addr 127.0.0.1 --port 8000 > /tmp/test_rank0.log 2>&1 &
CUDA_VISIBLE_DEVICES=2,3 vllm serve "Qwen/Qwen3-1.7B" -pp=2 -tp=2 --max-model-len=32768 --nnodes 2 --node-rank 1 --master-addr 127.0.0.1 --headless > /tmp/test_rank1.log 2>&1 &
DP 2 (External) * TP 4 (2 in node, 2 across node) (4 Instances)
CUDA_VISIBLE_DEVICES=0,1 vllm serve "Qwen/Qwen3-1.7B" -dp=2 -tp=4 --max-model-len=32768 --nnodes 4 --node-rank 0 --master-addr 127.0.0.1 --port 8000 --data-parallel-external-lb > /tmp/test_dp0.log 2>&1 &
CUDA_VISIBLE_DEVICES=2,3 vllm serve "Qwen/Qwen3-1.7B" -dp=2 -tp=4 --max-model-len=32768 --nnodes 4 --node-rank 1 --master-addr 127.0.0.1 --headless > /tmp/test_dp1.log 2>&1 &
CUDA_VISIBLE_DEVICES=4,5 vllm serve "Qwen/Qwen3-1.7B" -dp=2 -tp=4 --max-model-len=32768 --nnodes 4 --node-rank 2 --master-addr 127.0.0.1 --port 8001 --data-parallel-external-lb > /tmp/test_dp2.log 2>&1 &
CUDA_VISIBLE_DEVICES=6,7 vllm serve "Qwen/Qwen3-1.7B" -dp=2 -tp=4 --max-model-len=32768 --nnodes 4 --node-rank 3 --master-addr 127.0.0.1 --headless > /tmp/test_dp3.log 2>&1 &
DP 2 (Internal with headless) * TP 4 (2 in node, 2 across node) (4 Instances)
CUDA_VISIBLE_DEVICES=0,1 vllm serve "Qwen/Qwen3-1.7B" -dp=2 -tp=4 --max-model-len=32768 --nnodes 4 --node-rank 0 --master-addr 127.0.0.1 --port 8000 > /tmp/test_dp0.log 2>&1 &
CUDA_VISIBLE_DEVICES=2,3 vllm serve "Qwen/Qwen3-1.7B" -dp=2 -tp=4 --max-model-len=32768 --nnodes 4 --node-rank 1 --master-addr 127.0.0.1 --headless > /tmp/test_dp1.log 2>&1 &
CUDA_VISIBLE_DEVICES=4,5 vllm serve "Qwen/Qwen3-1.7B" -dp=2 -tp=4 --max-model-len=32768 --nnodes 4 --node-rank 2 --master-addr 127.0.0.1 --headless > /tmp/test_dp2.log 2>&1 &
CUDA_VISIBLE_DEVICES=6,7 vllm serve "Qwen/Qwen3-1.7B" -dp=2 -tp=4 --max-model-len=32768 --nnodes 4 --node-rank 3 --master-addr 127.0.0.1 --headless > /tmp/test_dp3.log 2>&1 &
DP * 4 (Internal) * TP 2 (inner node) (4 Instances)
CUDA_VISIBLE_DEVICES=0,1 vllm serve "Qwen/Qwen3-1.7B" -dp=4 -tp=2 --max-model-len=32768 --nnodes 4 --node-rank 0 --master-addr 127.0.0.1 --port 8000 > /tmp/test_dp0.log 2>&1 &
CUDA_VISIBLE_DEVICES=2,3 vllm serve "Qwen/Qwen3-1.7B" -dp=4 -tp=4 --max-model-len=32768 --nnodes 4 --node-rank 1 --master-addr 127.0.0.1 --headless > /tmp/test_dp1.log 2>&1 &
CUDA_VISIBLE_DEVICES=4,5 vllm serve "Qwen/Qwen3-1.7B" -dp=4 -tp=2 --max-model-len=32768 --nnodes 4 --node-rank 2 --master-addr 127.0.0.1 --headless > /tmp/test_dp2.log 2>&1 &
CUDA_VISIBLE_DEVICES=6,7 vllm serve "Qwen/Qwen3-1.7B" -dp=4 -tp=2 --max-model-len=32768 --nnodes 4 --node-rank 3 --master-addr 127.0.0.1 --headless > /tmp/test_dp3.log 2>&1 &
DP * 4 (External) * TP 2 (inner node) (4 Instances)
CUDA_VISIBLE_DEVICES=0,1 vllm serve "Qwen/Qwen3-1.7B" -dp=4 -tp=2 --max-model-len=32768 --nnodes 4 --node-rank 0 --master-addr 127.0.0.1 --port 8000 --data-parallel-external-lb > /tmp/test_dp0.log 2>&1 &
CUDA_VISIBLE_DEVICES=2,3 vllm serve "Qwen/Qwen3-1.7B" -dp=4 -tp=2 --max-model-len=32768 --nnodes 4 --node-rank 1 --master-addr 127.0.0.1 --port 8001 --data-parallel-external-lb > /tmp/test_dp1.log 2>&1 &
CUDA_VISIBLE_DEVICES=4,5 vllm serve "Qwen/Qwen3-1.7B" -dp=4 -tp=2 --max-model-len=32768 --nnodes 4 --node-rank 2 --master-addr 127.0.0.1 --port 8002 --data-parallel-external-lb > /tmp/test_dp2.log 2>&1 &
CUDA_VISIBLE_DEVICES=6,7 vllm serve "Qwen/Qwen3-1.7B" -dp=4 -tp=2 --max-model-len=32768 --nnodes 4 --node-rank 3 --master-addr 127.0.0.1 --port 8003 --data-parallel-external-lb > /tmp/test_dp3.log 2>&1 &
DP * 2(external) * 2(internal) Hybrid. * TP * 2 (2 Instances)
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve "Qwen/Qwen3-1.7B" -dp=4 -tp=2 --max-model-len=32768 --nnodes 2 --node-rank 0 --master-addr 127.0.0.1 --data-parallel-hybrid-lb > /tmp/test_dp0.log 2>&1 &
CUDA_VISIBLE_DEVICES=4,5,6,7 vllm serve "Qwen/Qwen3-1.7B" -dp=4 -tp=2 --max-model-len=32768 --nnodes 2 --node-rank 1 --master-addr 127.0.0.1 --data-parallel-hybrid-lb --port 8002 > /tmp/test_dp1.log 2>&1 &
DP * 2 (internal) * PP * 2 * TP * 2(across nodes) (8 Instances)
Test Result
Eval tested on all above combinations and on par with baseline, below are some examples.
TP Eval
Baseline
TP Perf
Baseline
TPx PP Eval
TPx PP Perf
Perf (Multi Instance)
Baseline (Single Instance)
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.