-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Kernel][Performance] Tweak MoE Batched silu_mul_fp8_quant_deep_gemm kernel #21193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel][Performance] Tweak MoE Batched silu_mul_fp8_quant_deep_gemm kernel #21193
Conversation
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces performance optimizations to the silu_mul_fp8_quant_deep_gemm Triton kernel. The changes involve switching from a manual while loop to tl.range to enable software pipelining, and tuning the num_warps and NUM_STAGES parameters.
The code modifications are correct and follow Triton best practices for performance. The provided micro-benchmarks demonstrate a significant performance improvement, which validates the tuning choices. The changes are well-contained and improve the efficiency of the kernel as intended. I have no further comments.
…kernel (vllm-project#21193) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: x22x22 <[email protected]>
…kernel (vllm-project#21193) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
…kernel (vllm-project#21193) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
…kernel (vllm-project#21193) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…kernel (vllm-project#21193) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…kernel (vllm-project#21193) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
…kernel (vllm-project#21193) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Purpose
Tweak the
num_warpsandNUM_STAGES(num pipeline stages for prefetching) values of the kernel.Local micro-benchmark numbers:
main:
This PR:
Note: micro-benchmarking script from https://github.com/tlrmchlsmth/ptgq_fp8
E2E Perf
server command :
VLLM_ALL2ALL_BACKEND="deepep_low_latency" VLLM_USE_DEEP_GEMM=1 canhazgpu run -g 2 -- vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --enable-expert-parallel --data-parallel-size 2 --port 9010 --no-enable-prefix-cachingbenchmark command :
python3 ./benchmarks/benchmark_serving.py --model Qwen/Qwen3-30B-A3B-FP8 --dataset-name sharegpt --port 9010 --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.jsonMethodology: Start the server and execute the benchmark command 3 times. Report the best
Total Token Throughputnumbers.main:This PR:
Test Plan
local testing :
pytest -s tests/kernels/moe/test_silu_mul_fp8_quant_deep_gemm.pye2e testing :
server command :
VLLM_ALL2ALL_BACKEND="deepep_low_latency" VLLM_USE_DEEP_GEMM=1 canhazgpu run -g 2 -- vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --enable-expert-parallel --data-parallel-size 2 --port 9010 --no-enable-prefix-cachinglm_eval command :
lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://127.0.0.1:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100Test Result
tests/kernels/moe/test_silu_mul_fp8_quant_deep_gemm.pytest passes locallylm_eval output :
(Optional) Documentation Update