[Performance][Fix] update nvfp4 code to support renorm routing #28569

jiahanc · 2025-11-12T17:20:32Z

Purpose

Add multi routing method to flashinfer fp4 trtllm moe to support models like Qwen3
Add flashinfer trtllm moe into global_sf list which was missed

Test Plan

VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve nvidia/Qwen3-235B-A22B-FP4   --max-num-batched-tokens 8192     --max-model-len 16384     --no-enable-prefix-caching     --cuda_graph_sizes 1024     --async-scheduling  -tp 2   --enable-expert-parallel

lm_eval --model local-completions --tasks gsm8k --model_args model=nvidia/Qwen3-235B-A22B-FP4,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5

Test Result

[2025-11-12 20:50:33] INFO evaluation_tracker.py:280: Output path not provided, skipping saving results aggregated
local-completions (model=nvidia/Qwen3-235B-A22B-FP4,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192,trust_remote_code=True), gen_kwargs: (None), limit: 0.5, num_fewshot: None, batch_size: 2048
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9348|±  |0.0096|
|     |       |strict-match    |     5|exact_match|↑  |0.9348|±  |0.0096|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

vllm/model_executor/layers/quantization/modelopt.py

yewentao256

Thanks for the work!

vllm/model_executor/layers/quantization/modelopt.py

pavanimajety

Thanks for the fix, LGTM.

Signed-off-by: jiahanc <[email protected]>

mgoin

LGTM, thank you

…project#28569) Signed-off-by: jiahanc <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Bram Wasti <[email protected]>

…project#28569) Signed-off-by: jiahanc <[email protected]> Co-authored-by: Michael Goin <[email protected]>

…project#28569) Signed-off-by: jiahanc <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Xingyu Liu <[email protected]>

…project#28569) Signed-off-by: jiahanc <[email protected]> Co-authored-by: Michael Goin <[email protected]>

mergify bot added frontend performance Performance-related issues labels Nov 12, 2025

jiahanc force-pushed the Qwen3nvfp4 branch from 462c8c2 to 05906c1 Compare November 12, 2025 17:22

pavanimajety reviewed Nov 12, 2025

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Show resolved Hide resolved

Victor49152 mentioned this pull request Nov 12, 2025

Add Flashinfer trtllm moe to compressed tensor FP4 path #28090

Closed

5 tasks

pavanimajety reviewed Nov 12, 2025

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Outdated Show resolved Hide resolved

mergify bot added the nvidia label Nov 13, 2025

github-project-automation bot added this to NVIDIA Nov 13, 2025

jiahanc marked this pull request as ready for review November 13, 2025 17:14

jiahanc requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners November 13, 2025 17:14

jiahanc requested a review from pavanimajety November 13, 2025 17:16

jiahanc force-pushed the Qwen3nvfp4 branch from 313c5a3 to 3b9152d Compare November 13, 2025 17:24

jiahanc changed the title ~~[Performance] update nvfp4 code to support renorm routing~~ [Performance][Fix] update nvfp4 code to support renorm routing Nov 13, 2025

wangshangsam mentioned this pull request Nov 13, 2025

[Bug]: Can't run Flashinfer MoE TRTLLM backend FP4 for Qwen3 235B #28007

Closed

1 task

yewentao256 reviewed Nov 13, 2025

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Show resolved Hide resolved

vllm/model_executor/layers/quantization/modelopt.py Show resolved Hide resolved

pavanimajety approved these changes Nov 13, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Nov 13, 2025

pavanimajety added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2025

jiahanc added 2 commits November 14, 2025 09:23

update nvfp4 code to support renorm routing

d1d0bc5

Signed-off-by: jiahanc <[email protected]>

add trtllm to global sf path

6ba336c

Signed-off-by: jiahanc <[email protected]>

jiahanc force-pushed the Qwen3nvfp4 branch from 3b9152d to 6ba336c Compare November 14, 2025 17:23

mgoin approved these changes Nov 14, 2025

View reviewed changes

mgoin enabled auto-merge (squash) November 14, 2025 17:46

mgoin added 2 commits November 14, 2025 13:53

Merge branch 'main' into Qwen3nvfp4

6171533

Merge branch 'main' into Qwen3nvfp4

b24ab39

wangshangsam assigned jiahanc Nov 15, 2025

pavanimajety disabled auto-merge November 15, 2025 04:55

pavanimajety enabled auto-merge (squash) November 15, 2025 04:55

vllm-bot merged commit 561253b into vllm-project:main Nov 17, 2025
52 of 53 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Nov 17, 2025

bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025

[Performance][Fix] update nvfp4 code to support renorm routing (vllm-…

d83e51a

…project#28569) Signed-off-by: jiahanc <[email protected]> Co-authored-by: Michael Goin <[email protected]>

shyeh25 mentioned this pull request Nov 28, 2025

[Bug]: DSR1 fp4 MTP with spec num 3 has perf drop #29660

Closed

1 task

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Performance][Fix] update nvfp4 code to support renorm routing (vllm-…

c25cfa3

…project#28569) Signed-off-by: jiahanc <[email protected]> Co-authored-by: Michael Goin <[email protected]>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[Performance][Fix] update nvfp4 code to support renorm routing (vllm-…

05db3ae

…project#28569) Signed-off-by: jiahanc <[email protected]> Co-authored-by: Michael Goin <[email protected]>

Zhathw pushed a commit to Zhathw/vllm that referenced this pull request Dec 6, 2025

[Performance][Fix] update nvfp4 code to support renorm routing (vllm-…

749043f

…project#28569) Signed-off-by: jiahanc <[email protected]> Co-authored-by: Michael Goin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance][Fix] update nvfp4 code to support renorm routing #28569

[Performance][Fix] update nvfp4 code to support renorm routing #28569

Uh oh!

jiahanc commented Nov 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Uh oh!

yewentao256 left a comment

Uh oh!

Uh oh!

Uh oh!

pavanimajety left a comment

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[Performance][Fix] update nvfp4 code to support renorm routing #28569

[Performance][Fix] update nvfp4 code to support renorm routing #28569

Uh oh!

Conversation

jiahanc commented Nov 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jiahanc commented Nov 12, 2025 •

edited by github-actions bot

Loading