-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Flashinfer_CUTLASS_MOE fuses quantization for TP #27223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
4c37b7d
96da50c
b725102
947ec6b
e2e90a7
ba51037
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -1769,29 +1769,6 @@ def apply( | |||
| expert_map=expert_map, | ||||
| apply_router_weight_on_input=apply_router_weight_on_input, | ||||
| ) | ||||
| elif ( | ||||
wenscarl marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
| self.allow_flashinfer | ||||
| and self.flashinfer_moe_backend == FlashinferMoeBackend.CUTLASS | ||||
| ): | ||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. will this break PP mode? when run deepseek PP, it will go to the last clause that use cutlass_moe_fp4, which will break on SM120. While on SGLang, flashinfer_cutlass works with sm120 PP.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. well, #27123 added back this and solved my issue. |
||||
| from vllm.model_executor.layers.fused_moe.flashinfer_cutlass_moe import ( # noqa: E501 | ||||
| flashinfer_cutlass_moe_fp4, | ||||
| ) | ||||
|
|
||||
| assert self.moe_quant_config is not None | ||||
|
|
||||
| return flashinfer_cutlass_moe_fp4( | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you update for compressed-tensors too? vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Line 516 in f9e7ad5
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But by deleting this I'm just trying to understand if this is the plan for all cases that use FlashInfer, regardless of distributed strategies, or whether
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see any reason to use the modular kernels for cases that aren't using some kind of all2all communication. In this particular case I think @wenscarl and @leejnau figured out that this was dead code because the CUTLASS case always created a modular kernel. I'm not sure if the same holds true for compressed_tensors.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @bnellnm this PR is updated with a additional
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||||
| hidden_states=x, | ||||
| w1=layer.w13_weight, | ||||
| w2=layer.w2_weight, | ||||
| topk_weights=topk_weights, | ||||
| topk_ids=topk_ids, | ||||
| quant_config=self.moe_quant_config, | ||||
| inplace=False, | ||||
| activation=activation, | ||||
| global_num_experts=global_num_experts, | ||||
| expert_map=expert_map, | ||||
| apply_router_weight_on_input=apply_router_weight_on_input, | ||||
| ) | ||||
| else: | ||||
| # If no modular kernel is provided, use cutlass_moe_fp4 for TP case | ||||
| # only (no EP). | ||||
|
|
||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we always setting
use_dp=False? Doesn'tflashinfer_cutlass_moe_fp4also support dp?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flashinfer_cutlass_moe_fp4(dead code) is removed in this PR. Previously flashinfer_cutlass_moe_fp4 is meant for TP case only. If DP, the fused expert is assemble elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is TP + Flashinfer cutlass handled now? Could we remove this method altogether or do we see any cases where it would be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/vllm-project/vllm/pull/27223/files#diff-5bb9585da825481e1ae1534657a703f846325c7dd72cfddb0c41f878db33d78aR82 differentiate TP vs DP. But in either case, a fused moe expert is assembled. I agree that
flashinfer_cutlass_moe_fp4should be removed. But compressed tensor TP could still depends on it. So its removal is out-of-scope for this PR.