QWEN3 Thinking Fused MoE kernels Optimization configs #24330

samanamp · 2025-09-05T16:13:22Z

Purpose

Optimize QWEN3 Thinking Fused MoE kernels Optimization configs

We see 13.7% in Output token throughput (tok/s). 12% in Median TPOT (ms) and 17% improvement in P99 TPOT (ms).

Test Plan

Single benchmark

python benchmarks/kernels/benchmark_moe.py --model $MODEL --dtype "fp8_w8a8" --tp 4

Model E2E
Server side:

MODEL=Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
python -m vllm.entrypoints.openai.api_server --model $MODEL --enable-expert-parallel --max-model-len=65000 -tp 4 --port 8101 --gpu_memory_utilization=0.95

Bench:

python benchmarks/benchmark_serving.py --backend vllm --model $MODEL --dataset-name random --random-input-len 7500 --random-output-len 7500 --max-concurrency 64 --num-prompts 128 --port 8101`

Test Result

Single benchmark

Model E2E

BEFORE:

============ Serving Benchmark Result ============                                                                                                                                           
Successful requests:                     128                                                                                                                                                 
Maximum request concurrency:             64                                                                                                                                                  
Benchmark duration (s):                  576.22                                                                                                                                              
Total input tokens:                      959630                                                                                                                                              
Total generated tokens:                  960000                                                                                                                                              
Request throughput (req/s):              0.22                                                                                                                                                
Output token throughput (tok/s):         1666.03                                                                                                                                             
Total Token throughput (tok/s):          3331.42                                                                                                                                             
---------------Time to First Token----------------                                                                                                                                           
Mean TTFT (ms):                          6815.24                                                                                                                                             
Median TTFT (ms):                        1664.75                                                                                                                                             
P99 TTFT (ms):                           23540.15                                                                                                                                            
-----Time per Output Token (excl. 1st token)------                                                                                                                                           
Mean TPOT (ms):                          37.46                                                                                                                                               
Median TPOT (ms):                        38.16                                                                                                                                               
P99 TPOT (ms):                           40.42                                                                                                                                               
---------------Inter-token Latency----------------                                                                                                                                           
Mean ITL (ms):                           37.46                                                                                                                                               
Median ITL (ms):                         33.29                                                                                                                                               
P99 ITL (ms):                            136.71                                                                                                                                              
==================================================

AFTER:

============ Serving Benchmark Result ============
Successful requests:                     128        
Maximum request concurrency:             64         
Benchmark duration (s):                  506.50     
Total input tokens:                      959630     
Total generated tokens:                  960000     
Request throughput (req/s):              0.25       
Output token throughput (tok/s):         1895.36   
Total Token throughput (tok/s):          3789.99   
---------------Time to First Token----------------
Mean TTFT (ms):                          6768.80   
Median TTFT (ms):                        1666.67   
P99 TTFT (ms):                           23394.16  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.82      
Median TPOT (ms):                        33.53      
P99 TPOT (ms):                           33.65      
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.82      
Median ITL (ms):                         30.64      
P99 ITL (ms):                            33.69      
==================================================

We see 13.7% in Output token throughput (tok/s). 12% in Median TPOT (ms) and 17% improvement in P99 TPOT (ms).

Essential Elements of an Effective PR Description Checklist

[ x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
[ x] The test plan, such as providing test command.
[ x] The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces optimized configurations for Fused MoE kernels, specifically for Qwen3 models on NVIDIA H100, B200, and GB200 GPUs. The changes consist of adding new and updating existing JSON configuration files with tuned parameters for fp8_w8a8 data types. The provided benchmark results demonstrate significant performance improvements in output token throughput and latency, which is a great contribution. The new configurations appear valid and consistent with the expected parameters for Triton kernels. Overall, this is a solid performance optimization.

houseroad

Let's only check in the B200 files?

houseroad

Thanks for the tuned fused moe config!

Signed-off-by: Saman Keon <[email protected]>

…4330) Signed-off-by: Saman Keon <[email protected]>

…4330) Signed-off-by: Saman Keon <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

mergify bot added the qwen Related to Qwen models label Sep 5, 2025

samanamp force-pushed the qwen-thinking-kernel-configs branch from 8612a69 to 3916916 Compare September 5, 2025 16:14

gemini-code-assist bot reviewed Sep 5, 2025

View reviewed changes

houseroad reviewed Sep 5, 2025

View reviewed changes

houseroad approved these changes Sep 5, 2025

View reviewed changes

houseroad added performance Performance-related issues moe ready ONLY add when PR is ready to merge/full CI is needed labels Sep 5, 2025

houseroad enabled auto-merge (squash) September 5, 2025 16:33

samanamp added 2 commits September 5, 2025 12:27

added all qwen-thinking kernel configs

ad077c0

Signed-off-by: Saman Keon <[email protected]>

removing h100

64c26a7

Signed-off-by: Saman Keon <[email protected]>

auto-merge was automatically disabled September 5, 2025 19:27
Head branch was pushed to by a user without write access

samanamp force-pushed the qwen-thinking-kernel-configs branch from 1211e2e to 64c26a7 Compare September 5, 2025 19:27

samanamp added 2 commits September 5, 2025 12:32

Merge branch 'main' into qwen-thinking-kernel-configs

95dbebf

Merge branch 'main' into qwen-thinking-kernel-configs

f66fe11

houseroad enabled auto-merge (squash) September 6, 2025 15:57

Merge branch 'main' into qwen-thinking-kernel-configs

3ad3e8e

houseroad merged commit 7533495 into vllm-project:main Sep 7, 2025
41 checks passed

samanamp deleted the qwen-thinking-kernel-configs branch September 7, 2025 14:56

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025

QWEN3 Thinking Fused MoE kernels Optimization configs (vllm-project#2…

a1b6ad3

…4330) Signed-off-by: Saman Keon <[email protected]>

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

QWEN3 Thinking Fused MoE kernels Optimization configs (vllm-project#2…

d93f13b

…4330) Signed-off-by: Saman Keon <[email protected]>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

QWEN3 Thinking Fused MoE kernels Optimization configs (vllm-project#2…

9e48a25

…4330) Signed-off-by: Saman Keon <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

QWEN3 Thinking Fused MoE kernels Optimization configs (vllm-project#2…

44231f1

…4330) Signed-off-by: Saman Keon <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

QWEN3 Thinking Fused MoE kernels Optimization configs (vllm-project#2…

bec1a98

…4330) Signed-off-by: Saman Keon <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

QWEN3 Thinking Fused MoE kernels Optimization configs #24330

QWEN3 Thinking Fused MoE kernels Optimization configs #24330

Uh oh!

samanamp commented Sep 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

houseroad left a comment

Uh oh!

houseroad left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

QWEN3 Thinking Fused MoE kernels Optimization configs #24330

QWEN3 Thinking Fused MoE kernels Optimization configs #24330

Uh oh!

Conversation

samanamp commented Sep 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samanamp commented Sep 5, 2025 •

edited by github-actions bot

Loading