Skip to content

Conversation

@angelayi
Copy link
Contributor

@angelayi angelayi commented Oct 17, 2025

Purpose

Based on #24604, modified sequence-parallelism pass to do custom op matching w/o needing to enable the custom op

Test Plan

pytest -sv tests/compile/test_sequence_parallelism.py

Performance numbers

I did some benchmarking with the command on H100 w/o flashinfer

VLLM_DISABLE_COMPILE_CACHE=1 VLLM_USE_STANDALONE_COMPILE=1 VLLM_LOGGING_LEVEL=DEBUG vllm bench latency --model=nvidia/Llama-3.3-70B-Instruct-FP8 --output-len 1 --input-len 8192 --batch-size 1 --tensor-parallel-size 8 --load-format dummy --num_iters_warmup 5 --num_iters 15 -O '{"level": 3, "use_inductor_graph_partition": false, "splitting_ops":[], "cudagraph_mode": "FULL", }' --no-enable-prefix-caching

while varying

  • "pass_config": {"enable_async_tp": true, "enable_sequence_parallelism": true} vs. "pass_config": {"enable_async_tp": false, "enable_sequence_parallelism": false}
  • "custom_ops":["+quant_fp8", "+rms_norm"] vs. "custom_ops":[]
image

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on! Could you just add me as a co-author on one of the commits?

"""Base helper for RMSNorm and RMSNorm + Quantization functionalization."""
def get_first_out_wrapper(fn):
@functools.wraps(fn)
def wrapper(*args):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work? I thought that during tracing the pattern matching tracer will think that args is a single parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! updated the test to assert the number of all_reduce/all_gather ops in the graph!

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cascade812 could you take a look at this please?

@ProExpertProg
Copy link
Collaborator

Also @angelayi just noticed there's no e2e tests - could you make the existing E2E tests use no custom ops by default (tests/distributed/test_sequence_parallelism.py or something like that) as well as add tests to test_fusions_e2e.py (feel free to grab from #27062)

@cascade812
Copy link
Contributor

@cascade812 could you take a look at this please?

Sure!

@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 22, 2025
@ProExpertProg ProExpertProg enabled auto-merge (squash) October 23, 2025 06:26
@cascade812
Copy link
Contributor

@angelayi I have below error if not specify custom_ops=["+rms_norm"]

torch._inductor.exc.InductorError: RuntimeError: The size of tensor a (s72) must match the size of tensor b ((s72//2)) at non-singleton dimension 0)

@cascade812
Copy link
Contributor

@angelayi It seems odd to me that enabling AsyncTP results in higher latency for Llama-70B. From our earlier benchmark, we observed about a 10% reduce in average latency for prefill stage with AsyncTP enabled for the same model on 4XH200.

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer have to skip the FP4 tests!

# If no fusion, the original ops are checked
elif RMSNorm.enabled():
return [
torch.ops._C.fused_add_rms_norm.default,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why can't we have fused_add_rms_norm_static_fp8_quant be fused in all cases?

Copy link
Contributor

@ZJY0516 ZJY0516 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would make the logic clearer and more straightforward

@ProExpertProg ProExpertProg enabled auto-merge (squash) November 14, 2025 20:49
Signed-off-by: Luka Govedič <[email protected]>
Signed-off-by: Luka Govedič <[email protected]>
Signed-off-by: Luka Govedič <[email protected]>
@ProExpertProg ProExpertProg merged commit f36292d into vllm-project:main Nov 15, 2025
50 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in torch.compile integration Nov 15, 2025
geodavic pushed a commit to geodavic/vllm that referenced this pull request Nov 16, 2025
…vllm-project#27126)

Signed-off-by: angelayi <[email protected]>
Signed-off-by: Luka Govedič <[email protected]>
Signed-off-by: ProExpertProg <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Signed-off-by: George D. Torres <[email protected]>
khluu pushed a commit that referenced this pull request Nov 16, 2025
…#27126)

Signed-off-by: angelayi <[email protected]>
Signed-off-by: Luka Govedič <[email protected]>
Signed-off-by: ProExpertProg <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
(cherry picked from commit f36292d)
bwasti pushed a commit to bwasti/vllm that referenced this pull request Nov 17, 2025
…vllm-project#27126)

Signed-off-by: angelayi <[email protected]>
Signed-off-by: Luka Govedič <[email protected]>
Signed-off-by: ProExpertProg <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Signed-off-by: Bram Wasti <[email protected]>
wangxiyuan added a commit to vllm-project/vllm-ascend that referenced this pull request Nov 26, 2025
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
vllm-project/vllm#26866
2. get_mrope_input_positions is broken by
vllm-project/vllm#28399
3. graph mode is broken by
vllm-project/vllm#25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
vllm-project/vllm#27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
vllm-project/vllm#28534
6. spec decode is broken by
vllm-project/vllm#28771
7. sp feature is broken by
vllm-project/vllm#27126
8. mtp is broken by vllm-project/vllm#27922
9. lora is broken by vllm-project/vllm#21068
10. execute_model is broken by
vllm-project/vllm#26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
vllm-project/vllm#28159
12. kv cahe is broken by vllm-project/vllm#27753
13. dp is broken by vllm-project/vllm#25110

 
What's broken and changed by ourself:
1. qwen vl is broken by vllm-project/vllm#28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
vllm-project/vllm#23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
vllm-project/vllm#28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
vllm-project/vllm#28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by vllm-project/vllm#27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work 
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: 22dimensions <[email protected]>
Co-authored-by: shen-shanshan <[email protected]>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: hfadzxy <[email protected]>
Signed-off-by: leo-pony <[email protected]>
Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Kurumi5210 pushed a commit to lidenghui1110/vllm-ascend that referenced this pull request Nov 26, 2025
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
vllm-project/vllm#26866
2. get_mrope_input_positions is broken by
vllm-project/vllm#28399
3. graph mode is broken by
vllm-project/vllm#25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
vllm-project/vllm#27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
vllm-project/vllm#28534
6. spec decode is broken by
vllm-project/vllm#28771
7. sp feature is broken by
vllm-project/vllm#27126
8. mtp is broken by vllm-project/vllm#27922
9. lora is broken by vllm-project/vllm#21068
10. execute_model is broken by
vllm-project/vllm#26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
vllm-project/vllm#28159
12. kv cahe is broken by vllm-project/vllm#27753
13. dp is broken by vllm-project/vllm#25110

What's broken and changed by ourself:
1. qwen vl is broken by vllm-project/vllm#28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
vllm-project/vllm#23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
vllm-project/vllm#28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
vllm-project/vllm#28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by vllm-project/vllm#27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: 22dimensions <[email protected]>
Co-authored-by: shen-shanshan <[email protected]>

- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: hfadzxy <[email protected]>
Signed-off-by: leo-pony <[email protected]>
Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Signed-off-by: Kurumi5210 <[email protected]>
bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025
…vllm-project#27126)

Signed-off-by: angelayi <[email protected]>
Signed-off-by: Luka Govedič <[email protected]>
Signed-off-by: ProExpertProg <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…vllm-project#27126)

Signed-off-by: angelayi <[email protected]>
Signed-off-by: Luka Govedič <[email protected]>
Signed-off-by: ProExpertProg <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Nov 29, 2025
Bump vLLM version to v0.11.2

What's broken and changed by vLLM:
1. structured_output is broken by
vllm-project/vllm#26866
2. get_mrope_input_positions is broken by
vllm-project/vllm#28399
3. graph mode is broken by
vllm-project/vllm#25110 we'll upgrade torch to
2.8 to fix the problem later
4. embedding is broken by
vllm-project/vllm#27583
5. `get_attn_backend_cls` and attention backend is broken are broken by
vllm-project/vllm#28534
6. spec decode is broken by
vllm-project/vllm#28771
7. sp feature is broken by
vllm-project/vllm#27126
8. mtp is broken by vllm-project/vllm#27922
9. lora is broken by vllm-project/vllm#21068
10. execute_model is broken by
vllm-project/vllm#26866
11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by
vllm-project/vllm#28159
12. kv cahe is broken by vllm-project/vllm#27753
13. dp is broken by vllm-project/vllm#25110

 
What's broken and changed by ourself:
1. qwen vl is broken by vllm-project/vllm#28455
We'll remove model files in the future to avoid this kind of error
2. Engine core is broken by
vllm-project/vllm#23691 We'll remove the patch
file in the future.
3. Ascend scheduler is broken by
vllm-project/vllm#28733 We'll remove ascend
scheudler later.
4. qwen3-next is broken by
vllm-project/vllm#28083 We'll remove model files
in the future to avoid this kind of error
5. qwen vl is broken by vllm-project/vllm#27764.
We'll remove model files in the future

Known issue:
1. ray doesn't work 
2. the accuracy of qwen3-next is not correct
3. qwen3-vl is broken
4. prefix cache+ ascend scheduler + deepseek v2 lite is broken.

Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: 22dimensions <[email protected]>
Co-authored-by: shen-shanshan <[email protected]>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: hfadzxy <[email protected]>
Signed-off-by: leo-pony <[email protected]>
Co-authored-by: MengqingCao <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: leo-pony <[email protected]>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
…vllm-project#27126)

Signed-off-by: angelayi <[email protected]>
Signed-off-by: Luka Govedič <[email protected]>
Signed-off-by: ProExpertProg <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed torch.compile

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants