-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
QWEN3 Thinking Fused MoE kernels Optimization configs #24330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QWEN3 Thinking Fused MoE kernels Optimization configs #24330
Conversation
8612a69 to
3916916
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces optimized configurations for Fused MoE kernels, specifically for Qwen3 models on NVIDIA H100, B200, and GB200 GPUs. The changes consist of adding new and updating existing JSON configuration files with tuned parameters for fp8_w8a8 data types. The provided benchmark results demonstrate significant performance improvements in output token throughput and latency, which is a great contribution. The new configurations appear valid and consistent with the expected parameters for Triton kernels. Overall, this is a solid performance optimization.
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's only check in the B200 files?
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tuned fused moe config!
Signed-off-by: Saman Keon <[email protected]>
Signed-off-by: Saman Keon <[email protected]>
Head branch was pushed to by a user without write access
1211e2e to
64c26a7
Compare
…4330) Signed-off-by: Saman Keon <[email protected]>
…4330) Signed-off-by: Saman Keon <[email protected]>
…4330) Signed-off-by: Saman Keon <[email protected]>
…4330) Signed-off-by: Saman Keon <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…4330) Signed-off-by: Saman Keon <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Purpose
Optimize QWEN3 Thinking Fused MoE kernels Optimization configs
We see 13.7% in
Output token throughput (tok/s). 12% inMedian TPOT (ms)and 17% improvement in P99 TPOT (ms).Test Plan
Server side:
Bench:
Test Result
BEFORE:
AFTER:
We see 13.7% in
Output token throughput (tok/s). 12% inMedian TPOT (ms)and 17% improvement in P99 TPOT (ms).Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.