-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[compile] Enable sequence parallelism matching w/o custom ops enabled #27126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c1efc65 to
ed10d76
Compare
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on! Could you just add me as a co-author on one of the commits?
| """Base helper for RMSNorm and RMSNorm + Quantization functionalization.""" | ||
| def get_first_out_wrapper(fn): | ||
| @functools.wraps(fn) | ||
| def wrapper(*args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work? I thought that during tracing the pattern matching tracer will think that args is a single parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! updated the test to assert the number of all_reduce/all_gather ops in the graph!
ed10d76 to
5d66118
Compare
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cascade812 could you take a look at this please?
Sure! |
|
@angelayi I have below error if not specify |
|
@angelayi It seems odd to me that enabling AsyncTP results in higher latency for Llama-70B. From our earlier benchmark, we observed about a 10% reduce in average latency for prefill stage with AsyncTP enabled for the same model on 4XH200. |
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer have to skip the FP4 tests!
| # If no fusion, the original ops are checked | ||
| elif RMSNorm.enabled(): | ||
| return [ | ||
| torch.ops._C.fused_add_rms_norm.default, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask why can't we have fused_add_rms_norm_static_fp8_quant be fused in all cases?
ZJY0516
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would make the logic clearer and more straightforward
Signed-off-by: angelayi <[email protected]>
Signed-off-by: Luka Govedič <[email protected]>
Signed-off-by: Luka Govedič <[email protected]>
Signed-off-by: Luka Govedič <[email protected]>
…vllm-project#27126) Signed-off-by: angelayi <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: ProExpertProg <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: George D. Torres <[email protected]>
…#27126) Signed-off-by: angelayi <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: ProExpertProg <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]> (cherry picked from commit f36292d)
…vllm-project#27126) Signed-off-by: angelayi <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: ProExpertProg <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Bram Wasti <[email protected]>
Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]>
Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Signed-off-by: Kurumi5210 <[email protected]>
…vllm-project#27126) Signed-off-by: angelayi <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: ProExpertProg <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]>
…vllm-project#27126) Signed-off-by: angelayi <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: ProExpertProg <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]>
Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: 22dimensions <[email protected]> Co-authored-by: shen-shanshan <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: MengqingCao <[email protected]> Signed-off-by: hfadzxy <[email protected]> Signed-off-by: leo-pony <[email protected]> Co-authored-by: MengqingCao <[email protected]> Co-authored-by: hfadzxy <[email protected]> Co-authored-by: leo-pony <[email protected]>
…vllm-project#27126) Signed-off-by: angelayi <[email protected]> Signed-off-by: Luka Govedič <[email protected]> Signed-off-by: ProExpertProg <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Luka Govedič <[email protected]>
Purpose
Based on #24604, modified sequence-parallelism pass to do custom op matching w/o needing to enable the custom op
Test Plan
pytest -sv tests/compile/test_sequence_parallelism.pyPerformance numbers
I did some benchmarking with the command on H100 w/o flashinfer
while varying
"pass_config": {"enable_async_tp": true, "enable_sequence_parallelism": true}vs."pass_config": {"enable_async_tp": false, "enable_sequence_parallelism": false}"custom_ops":["+quant_fp8", "+rms_norm"]vs."custom_ops":[]