[Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next #27492

jiahanc · 2025-10-24T23:22:39Z

Purpose

Integrate multiple routing methods for FP8 flashinfer trtllm MOE, currently only DS and Llama4
Add FP8 flashinfer trtllm MOE support on Qwen3 and Qwen3-next

Test Plan

Qwen3-Next-80B-A3B-Instruct-FP8 on 2xB200 TP2

VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_USE_DEEP_GEMM=0 VLLM_USE_TRTLLM_ATTENTION=0 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
    --max-num-batched-tokens 8192 \
    --max-model-len 16384 \
    --no-enable-prefix-caching \
    --async-scheduling \
    --compilation_config.pass_config.enable_fi_allreduce_fusion true \
    --compilation_config.pass_config.enable_noop true \
    --compilation_config.cudagraph_mode FULL_DECODE_ONLY \
    --compilation_config.splitting_ops [] \
    -tp 2

lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5

Qwen3-30B-A3B-Instruct-2507-FP8 on 2xB200 TP2

VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_USE_DEEP_GEMM=0 vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 \
    --max-num-batched-tokens 8192 \
    --max-model-len 16384 \
    --no-enable-prefix-caching \
    --async-scheduling \
    --compilation_config.pass_config.enable_fi_allreduce_fusion true \
    --compilation_config.pass_config.enable_noop true \
    --compilation_config.cudagraph_mode FULL_DECODE_ONLY \
    --compilation_config.splitting_ops [] \
    -tp 2

lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-Instruct-2507-FP8,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5

Test Result

Qwen3-Next-80B-A3B-Instruct-FP8

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9515|±  |0.0084|
|     |       |strict-match    |     5|exact_match|↑  |0.9197|±  |0.0106|

Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9379|±  |0.0094|
|     |       |strict-match    |     5|exact_match|↑  |0.9364|±  |0.0095|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-10-29T20:57:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jiahanc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jiahanc · 2025-10-30T16:52:31Z

Qwen3-Next-80B-A3B-Instruct-FP8 on 1xB200 1k/1k benchmark

jiahanc · 2025-10-30T17:16:27Z

@mgoin @pavanimajety may you help review the PR?
pre-commit failure is unrelated to any changed file. Might be introduced in other PRs

vllm/model_executor/layers/quantization/utils/flashinfer_utils.py

vllm/model_executor/models/qwen3_next.py

requirements/cuda.txt

vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py

mxz297 · 2025-10-31T20:51:07Z

If this PR is merged, can vllm still run with older flashinfer? We are internally just upgrading to flashinfer nightly-v0.4.1-20251027. This seems to bump flashinfer version again. Is it possible to consider some backward compatibility with older flashinfer version?

cc @houseroad @yeqcharlotte

jiahanc · 2025-10-31T20:59:37Z

If this PR is merged, can vllm still run with older flashinfer? We are internally just upgrading to flashinfer nightly-v0.4.1-20251027. This seems to bump flashinfer version again. Is it possible to consider some backward compatibility with older flashinfer version?

cc @houseroad @yeqcharlotte

Hi @mxz297 ,
There is no api change compared to v0.4.1. :)

vllm/model_executor/models/qwen3_moe.py

vllm/model_executor/layers/quantization/fp8.py

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py

vllm/model_executor/layers/fused_moe/config.py

docker/Dockerfile

pavanimajety · 2025-11-07T18:45:22Z

@alexm-redhat / @mgoin Could you please review? Thanks!

pavanimajety

LGTM, thanks for the PR

jiahanc · 2025-11-07T20:11:00Z

There is already 1 PR update FI version:#27952
Merge 27952 first and will rebase my PR

jiahanc · 2025-11-08T00:34:35Z

There is already 1 PR update FI version:#27952 Merge 27952 first and will rebase my PR

Rebased. Ready to merge 😄

Signed-off-by: jiahanc <[email protected]>

mgoin

LGTM to get in now. We should make an issue to use RoutingMethod more broadly

…xt (vllm-project#27492) Signed-off-by: jiahanc <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…xt (vllm-project#27492) Signed-off-by: jiahanc <[email protected]>

jiahanc changed the title ~~[Performance] Support flashinfer TRTLLM MOE on Qwen3 and Qwen3next~~ [Performance] Support flashinfer FP8 TRTLLM MOE on Qwen3 and Qwen3next Oct 24, 2025

mergify bot added the qwen Related to Qwen models label Oct 24, 2025

jiahanc changed the title ~~[Performance] Support flashinfer FP8 TRTLLM MOE on Qwen3 and Qwen3next~~ [Performance] Support flashinfer FP8 TRTLLM MOE on Qwen3 and Qwen-3next Oct 24, 2025

jiahanc changed the title ~~[Performance] Support flashinfer FP8 TRTLLM MOE on Qwen3 and Qwen-3next~~ [Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next Oct 24, 2025

jiahanc force-pushed the qwen3next_trtllmgen_moe branch from 9aaf36c to aa947da Compare October 24, 2025 23:53

ZJY0516 mentioned this pull request Oct 25, 2025

[Tracking Issue]: Qwen3-next performance optimisations #27225

Open

8 tasks

jiahanc force-pushed the qwen3next_trtllmgen_moe branch from c3863df to 15b457c Compare October 28, 2025 17:15

mergify bot added the needs-rebase label Oct 29, 2025

jiahanc force-pushed the qwen3next_trtllmgen_moe branch from 15b457c to fccb4d0 Compare October 29, 2025 21:10

mergify bot removed the needs-rebase label Oct 29, 2025

jiahanc marked this pull request as ready for review October 30, 2025 16:50

jiahanc requested review from mgoin, pavanimajety, robertgshaw2-redhat, sighingnow, tlrmchlsmth and yewentao256 as code owners October 30, 2025 16:50

mergify bot added the ci/build label Oct 30, 2025

jiahanc force-pushed the qwen3next_trtllmgen_moe branch from 08dcd1b to 2b9022e Compare October 30, 2025 17:00

pavanimajety reviewed Oct 30, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/flashinfer_utils.py Show resolved Hide resolved

pavanimajety reviewed Oct 30, 2025

View reviewed changes

vllm/model_executor/models/qwen3_next.py Outdated Show resolved Hide resolved

nvpohanh reviewed Oct 31, 2025

View reviewed changes

requirements/cuda.txt Outdated Show resolved Hide resolved

bnellnm reviewed Oct 31, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py Outdated Show resolved Hide resolved

mgoin reviewed Nov 1, 2025

View reviewed changes

nvpohanh reviewed Nov 3, 2025

View reviewed changes

docker/Dockerfile Outdated Show resolved Hide resolved

pavanimajety approved these changes Nov 7, 2025

View reviewed changes

jiahanc force-pushed the qwen3next_trtllmgen_moe branch from 5e99086 to 1b3d32c Compare November 8, 2025 00:33

jiahanc force-pushed the qwen3next_trtllmgen_moe branch from 1b3d32c to 09dd654 Compare November 8, 2025 00:35

jiahanc added 13 commits November 7, 2025 18:18

support multi routing method in flashinfer trtllm moe

a20ba7a

Signed-off-by: jiahanc <[email protected]>

fix

a34e3df

Signed-off-by: jiahanc <[email protected]>

update code

af97ced

Signed-off-by: jiahanc <[email protected]>

update work

9a35fba

Signed-off-by: jiahanc <[email protected]>

add Qwen3 and fix lint

28a9697

Signed-off-by: jiahanc <[email protected]>

lint

77b6bef

Signed-off-by: jiahanc <[email protected]>

format

907e19b

Signed-off-by: jiahanc <[email protected]>

per comment

970a919

Signed-off-by: jiahanc <[email protected]>

per comment

711ac5d

Signed-off-by: jiahanc <[email protected]>

update dtype

b6cd21b

Signed-off-by: jiahanc <[email protected]>

fix conflict

01b4a9e

Signed-off-by: jiahanc <[email protected]>

remove getattr

9361c86

Signed-off-by: jiahanc <[email protected]>

update routing

ec5ba87

Signed-off-by: jiahanc <[email protected]>

jiahanc force-pushed the qwen3next_trtllmgen_moe branch from 09dd654 to ec5ba87 Compare November 8, 2025 02:19

mgoin approved these changes Nov 10, 2025

View reviewed changes

mgoin merged commit 34553b9 into vllm-project:main Nov 10, 2025
64 checks passed

mgoin mentioned this pull request Nov 10, 2025

[Feature]: Generalize RoutingMethodType for broader MoE routing control #28408

Open

1 task

mgoin added this to NVIDIA Nov 11, 2025

mgoin moved this to Done in NVIDIA Nov 11, 2025

mgoin added the nvidia label Nov 11, 2025

ZJY0516 mentioned this pull request Nov 12, 2025

update qwen3-next fp8 vllm-project/recipes#121

Merged

jiahanc mentioned this pull request Nov 14, 2025

[Bug]: Can't run Flashinfer MoE TRTLLM backend FP4 for Qwen3 235B #28007

Closed

1 task

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3ne…

0500aa8

…xt (vllm-project#27492) Signed-off-by: jiahanc <[email protected]>

Uh oh!

[Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next #27492

[Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next #27492

Uh oh!

Conversation

jiahanc commented Oct 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 29, 2025

Uh oh!

jiahanc commented Oct 30, 2025

Uh oh!

jiahanc commented Oct 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mxz297 commented Oct 31, 2025

Uh oh!

jiahanc commented Oct 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pavanimajety commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

jiahanc commented Nov 7, 2025

Uh oh!

jiahanc commented Nov 8, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jiahanc commented Oct 24, 2025 •

edited by github-actions bot

Loading

pavanimajety commented Nov 7, 2025 •

edited

Loading