-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 #28971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for FusedMoE LoRA with Triton kernels for mxfp4 models, which provides a significant performance improvement as shown in the benchmarks. The changes are well-structured, adding the necessary logic to select the Triton backend and adapting the kernels for this new path. However, I've identified a critical issue where attributes in OAITritonExperts are used without being initialized, which could lead to a runtime error in non-LoRA use cases. Please address this to ensure the stability of the implementation.
vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
8acf2f1 to
8f24eec
Compare
|
nice speedup! |
b8fd020 to
168d8cd
Compare
| modular_triton_fused_moe, | ||
| try_get_optimal_moe_config, | ||
| ) | ||
| from vllm.model_executor.layers.fused_moe.fused_moe_modular_method import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xyang16 can you add a unit test for gpt-oss lora + triton_kernels. The test can be predicated on has_triton_kernels like in https://github.com/vllm-project/vllm/blob/main/tests/kernels/moe/test_gpt_oss_triton_kernels.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test_modular_oai_triton_moe.py. Thanks!
varun-sundar-rabindranath
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to triton_kernel_fused_experts are invasive and it is a bit confusing reason about the fused_act=True, fuse_sum=True and fuse_act=False,fused_sum=False cases as the assumptions and expectations from matmul_ogs is different in both cases.
The main difference between the non-LoRA and the LoRA case,
- For the non-LoRA case, no assumptions about the sizes of
matmul_ogsoutput tensors are made. The only requirement here is that the secondmatmul_ogsmust return a tensor of size [M, K]. For the LoRA case, we expect the outputs to be of a specific shape - This pattern is similar to TritonExperts - For the non-LoRA case, there are no requirements on the
gather_indxandscatter_indxsizes. The LoRA case requires the tensors in these objects to be a specific shape.
For these reasons, I think it will be better to create a separate implementation of BaseOAITritonExperts class for the LoRA case, naming it something like UnfusedOAITritonExperts. Apart of being easier to assert for expectations, with this we can create correct and adequate workspace shapes for both workspace13 and workspace2 and reuse them properly in the implementation. Please refer to TritonExperts I think the implementation here would be very similar and all the logic could be contained within the apply function, thus not disturbing the existing triton_kernel_fused_experts function. something like,
def apply():
routing_data, gather_indx, scatter_indx = self._make_routing_data(
topk_ids, topk_weights, local_num_experts
)
matmul_ogs(...,
y = intermediate_cache1)
activation(intermediate_cache2, intermediate_cache1)
matmul_ogs(...,
y = intermediate_cache3)
|
|
||
| if with_lora_support: | ||
| return get_mxfp4_backend_with_lora() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should maintain get_mxfp4_backend_with_lora() and return the appropriate backend from within that function. This is because, there is no guarantee that the logic below will choose a LoRA compatible backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added back get_mxfp4_backend_with_lora. Thanks for review!
vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py
Outdated
Show resolved
Hide resolved
| precision_config=quant_config.w1_precision, | ||
| gammas=gammas if apply_router_weight_on_input else None, | ||
| fused_activation=act, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xyang16 To guarantee that matmul_ogs will return an output shaped [M * topk, N], I think it is better to pass in the output tensor ourselves using the argument y. Note that matmul_ogs actually checks if the output is of expected size here https://github.com/triton-lang/triton/blob/c3c476f357f1e9768ea4e45aa5c17528449ab9ef/python/triton_kernels/triton_kernels/matmul_ogs.py#L180 . That way it is guaranteed that matmul_ogs will respect the contract.
Same for the second matmul_ogs also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised a separate PR #29219 for this
| apply_router_weight_on_input: bool, | ||
| weight_and_reduce_impl: mk.TopKWeightAndReduce, | ||
| ) -> None: | ||
| ) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This violates the base class contract at
| def finalize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed back. Thanks.
168d8cd to
1569681
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
1569681 to
ae8d73f
Compare
7d727f8 to
b7ca4ae
Compare
Signed-off-by: Xin Yang <[email protected]>
Signed-off-by: Xin Yang <[email protected]>
Signed-off-by: Xin Yang <[email protected]>
b24893c to
867d973
Compare
@jeejeelee I have resolved the conflicts. Thanks a lot! |
@varun-sundar-rabindranath I have rerun the gpt-oss eval numbers, but I have to set Fixing the !!! issue is beyond the scope of this PR. So I have to set PIECEWISE mode. Thanks!
|
|
@xyang16 All tests pass correctly, but I realized that the GPT-OSS LoRA test does not select the Triton backend. Could you verify whether the Triton backend can pass correctly? |
@jeejeelee Thanks a lot! Triton backend can pass with slight modification of |
|
@xyang16 Any plan to fix this issue, if we have ,I think we can land this PR now |
@jeejeelee Yes, I think the issue should be fixed soon. I talked about this issue with @varun-sundar-rabindranath and he suggested me to create an issue for this, and see what other people think how to fix it. Maybe quickest is to not make FULL_AND_PIECEWISE default (FULL_AND_PIECEWISE was made default recently starting from 0.11.0). |
jeejeelee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for contribution
…oject#28971)" This reverts commit 745a3ba. Signed-off-by: Huamin Li <[email protected]>
|
Reverted by #29697, please redo |
#29697) Signed-off-by: Huamin Li <[email protected]>
) Signed-off-by: Xin Yang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
…oject#28971)" (vllm-project#29697) Signed-off-by: Huamin Li <[email protected]>
) Signed-off-by: Xin Yang <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>
…oject#28971)" (vllm-project#29697) Signed-off-by: Huamin Li <[email protected]> Signed-off-by: Hashem Hashemi <[email protected]>
Purpose
This PR is to support FusedMoE LoRA Triton kernel for mxfp4 model.
y[dst_indx // n_expts_act, :] += x[src_indx, :], so that scatter sum across multiple experts, and collapseM * topktoMrows. Therefore, we need to setrouting_data.n_expts_actto 1, so it doesn't sum across multiple experts, in order unfuse moe_sum in the second matmul_ogs.Test Plan
Test Result
Tests passed.
Benchmark
Baseline (marlin):
PR (triton):
Install triton_kernels
Accuracy Testing
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.