Skip to content

Conversation

@samanamp
Copy link
Contributor

@samanamp samanamp commented Sep 5, 2025

Purpose

Optimize QWEN3 Thinking Fused MoE kernels Optimization configs

We see 13.7% in Output token throughput (tok/s). 12% in Median TPOT (ms) and 17% improvement in P99 TPOT (ms).

Test Plan

  1. Single benchmark
python benchmarks/kernels/benchmark_moe.py --model $MODEL --dtype "fp8_w8a8" --tp 4
  1. Model E2E
    Server side:
MODEL=Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
python -m vllm.entrypoints.openai.api_server --model $MODEL --enable-expert-parallel --max-model-len=65000 -tp 4 --port 8101 --gpu_memory_utilization=0.95

Bench:

python benchmarks/benchmark_serving.py --backend vllm --model $MODEL --dataset-name random --random-input-len 7500 --random-output-len 7500 --max-concurrency 64 --num-prompts 128 --port 8101`

Test Result

  1. Single benchmark
Screenshot 2025-09-05 at 9 16 13 AM Screenshot 2025-09-05 at 9 16 25 AM
  1. Model E2E

BEFORE:

============ Serving Benchmark Result ============                                                                                                                                           
Successful requests:                     128                                                                                                                                                 
Maximum request concurrency:             64                                                                                                                                                  
Benchmark duration (s):                  576.22                                                                                                                                              
Total input tokens:                      959630                                                                                                                                              
Total generated tokens:                  960000                                                                                                                                              
Request throughput (req/s):              0.22                                                                                                                                                
Output token throughput (tok/s):         1666.03                                                                                                                                             
Total Token throughput (tok/s):          3331.42                                                                                                                                             
---------------Time to First Token----------------                                                                                                                                           
Mean TTFT (ms):                          6815.24                                                                                                                                             
Median TTFT (ms):                        1664.75                                                                                                                                             
P99 TTFT (ms):                           23540.15                                                                                                                                            
-----Time per Output Token (excl. 1st token)------                                                                                                                                           
Mean TPOT (ms):                          37.46                                                                                                                                               
Median TPOT (ms):                        38.16                                                                                                                                               
P99 TPOT (ms):                           40.42                                                                                                                                               
---------------Inter-token Latency----------------                                                                                                                                           
Mean ITL (ms):                           37.46                                                                                                                                               
Median ITL (ms):                         33.29                                                                                                                                               
P99 ITL (ms):                            136.71                                                                                                                                              
==================================================

AFTER:

============ Serving Benchmark Result ============
Successful requests:                     128        
Maximum request concurrency:             64         
Benchmark duration (s):                  506.50     
Total input tokens:                      959630     
Total generated tokens:                  960000     
Request throughput (req/s):              0.25       
Output token throughput (tok/s):         1895.36   
Total Token throughput (tok/s):          3789.99   
---------------Time to First Token----------------
Mean TTFT (ms):                          6768.80   
Median TTFT (ms):                        1666.67   
P99 TTFT (ms):                           23394.16  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.82      
Median TPOT (ms):                        33.53      
P99 TPOT (ms):                           33.65      
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.82      
Median ITL (ms):                         30.64      
P99 ITL (ms):                            33.69      
==================================================

We see 13.7% in Output token throughput (tok/s). 12% in Median TPOT (ms) and 17% improvement in P99 TPOT (ms).


Essential Elements of an Effective PR Description Checklist
  • [ x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [ x] The test plan, such as providing test command.
  • [ x] The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the qwen Related to Qwen models label Sep 5, 2025
@samanamp samanamp force-pushed the qwen-thinking-kernel-configs branch from 8612a69 to 3916916 Compare September 5, 2025 16:14
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces optimized configurations for Fused MoE kernels, specifically for Qwen3 models on NVIDIA H100, B200, and GB200 GPUs. The changes consist of adding new and updating existing JSON configuration files with tuned parameters for fp8_w8a8 data types. The provided benchmark results demonstrate significant performance improvements in output token throughput and latency, which is a great contribution. The new configurations appear valid and consistent with the expected parameters for Triton kernels. Overall, this is a solid performance optimization.

Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's only check in the B200 files?

Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tuned fused moe config!

@houseroad houseroad added performance Performance-related issues moe ready ONLY add when PR is ready to merge/full CI is needed labels Sep 5, 2025
@houseroad houseroad enabled auto-merge (squash) September 5, 2025 16:33
auto-merge was automatically disabled September 5, 2025 19:27

Head branch was pushed to by a user without write access

@samanamp samanamp force-pushed the qwen-thinking-kernel-configs branch from 1211e2e to 64c26a7 Compare September 5, 2025 19:27
@houseroad houseroad enabled auto-merge (squash) September 6, 2025 15:57
@houseroad houseroad merged commit 7533495 into vllm-project:main Sep 7, 2025
41 checks passed
@samanamp samanamp deleted the qwen-thinking-kernel-configs branch September 7, 2025 14:56
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

moe performance Performance-related issues qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants