Skip to content

Conversation

@jiahanc
Copy link
Contributor

@jiahanc jiahanc commented Oct 24, 2025

Purpose

  • Integrate multiple routing methods for FP8 flashinfer trtllm MOE, currently only DS and Llama4
  • Add FP8 flashinfer trtllm MOE support on Qwen3 and Qwen3-next

Test Plan

Qwen3-Next-80B-A3B-Instruct-FP8 on 2xB200 TP2

VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_USE_DEEP_GEMM=0 VLLM_USE_TRTLLM_ATTENTION=0 VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
    --max-num-batched-tokens 8192 \
    --max-model-len 16384 \
    --no-enable-prefix-caching \
    --async-scheduling \
    --compilation_config.pass_config.enable_fi_allreduce_fusion true \
    --compilation_config.pass_config.enable_noop true \
    --compilation_config.cudagraph_mode FULL_DECODE_ONLY \
    --compilation_config.splitting_ops [] \
    -tp 2 
lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5

Qwen3-30B-A3B-Instruct-2507-FP8 on 2xB200 TP2

VLLM_USE_FLASHINFER_MOE_FP8=1 VLLM_FLASHINFER_MOE_BACKEND=latency VLLM_USE_DEEP_GEMM=0 vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507-FP8 \
    --max-num-batched-tokens 8192 \
    --max-model-len 16384 \
    --no-enable-prefix-caching \
    --async-scheduling \
    --compilation_config.pass_config.enable_fi_allreduce_fusion true \
    --compilation_config.pass_config.enable_noop true \
    --compilation_config.cudagraph_mode FULL_DECODE_ONLY \
    --compilation_config.splitting_ops [] \
    -tp 2
lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-Instruct-2507-FP8,base_url=http://0.0.0.0:8000/v1/completions,max_retries=3,tokenized_requests=False,timeout=1200,max_gen_toks=2048,max_length=8192 --batch_size 2048 --trust_remote_code --limit 0.5

Test Result

Qwen3-Next-80B-A3B-Instruct-FP8

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9515|±  |0.0084|
|     |       |strict-match    |     5|exact_match|↑  |0.9197|±  |0.0106|

Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9379|±  |0.0094|
|     |       |strict-match    |     5|exact_match|↑  |0.9364|±  |0.0095|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@jiahanc jiahanc changed the title [Performance] Support flashinfer TRTLLM MOE on Qwen3 and Qwen3next [Performance] Support flashinfer FP8 TRTLLM MOE on Qwen3 and Qwen3next Oct 24, 2025
@mergify mergify bot added the qwen Related to Qwen models label Oct 24, 2025
@jiahanc jiahanc changed the title [Performance] Support flashinfer FP8 TRTLLM MOE on Qwen3 and Qwen3next [Performance] Support flashinfer FP8 TRTLLM MOE on Qwen3 and Qwen-3next Oct 24, 2025
@jiahanc jiahanc changed the title [Performance] Support flashinfer FP8 TRTLLM MOE on Qwen3 and Qwen-3next [Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next Oct 24, 2025
@jiahanc jiahanc force-pushed the qwen3next_trtllmgen_moe branch from 9aaf36c to aa947da Compare October 24, 2025 23:53
@jiahanc jiahanc force-pushed the qwen3next_trtllmgen_moe branch from c3863df to 15b457c Compare October 28, 2025 17:15
@mergify
Copy link

mergify bot commented Oct 29, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jiahanc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 29, 2025
@jiahanc jiahanc force-pushed the qwen3next_trtllmgen_moe branch from 15b457c to fccb4d0 Compare October 29, 2025 21:10
@mergify mergify bot removed the needs-rebase label Oct 29, 2025
@jiahanc jiahanc marked this pull request as ready for review October 30, 2025 16:50
@mergify mergify bot added the ci/build label Oct 30, 2025
@jiahanc
Copy link
Contributor Author

jiahanc commented Oct 30, 2025

Qwen3-Next-80B-A3B-Instruct-FP8 on 1xB200 1k/1k benchmark
image

@jiahanc jiahanc force-pushed the qwen3next_trtllmgen_moe branch from 08dcd1b to 2b9022e Compare October 30, 2025 17:00
@jiahanc
Copy link
Contributor Author

jiahanc commented Oct 30, 2025

@mgoin @pavanimajety may you help review the PR?
pre-commit failure is unrelated to any changed file. Might be introduced in other PRs

@mxz297
Copy link
Contributor

mxz297 commented Oct 31, 2025

If this PR is merged, can vllm still run with older flashinfer? We are internally just upgrading to flashinfer nightly-v0.4.1-20251027. This seems to bump flashinfer version again. Is it possible to consider some backward compatibility with older flashinfer version?

cc @houseroad @yeqcharlotte

@jiahanc
Copy link
Contributor Author

jiahanc commented Oct 31, 2025

If this PR is merged, can vllm still run with older flashinfer? We are internally just upgrading to flashinfer nightly-v0.4.1-20251027. This seems to bump flashinfer version again. Is it possible to consider some backward compatibility with older flashinfer version?

cc @houseroad @yeqcharlotte

Hi @mxz297 ,
There is no api change compared to v0.4.1. :)

@pavanimajety
Copy link
Collaborator

pavanimajety commented Nov 7, 2025

@alexm-redhat / @mgoin Could you please review? Thanks!

Copy link
Collaborator

@pavanimajety pavanimajety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the PR

@jiahanc
Copy link
Contributor Author

jiahanc commented Nov 7, 2025

There is already 1 PR update FI version:#27952
Merge 27952 first and will rebase my PR

@jiahanc jiahanc force-pushed the qwen3next_trtllmgen_moe branch from 5e99086 to 1b3d32c Compare November 8, 2025 00:33
@jiahanc
Copy link
Contributor Author

jiahanc commented Nov 8, 2025

There is already 1 PR update FI version:#27952 Merge 27952 first and will rebase my PR

Rebased. Ready to merge 😄

@jiahanc jiahanc force-pushed the qwen3next_trtllmgen_moe branch from 1b3d32c to 09dd654 Compare November 8, 2025 00:35
Signed-off-by: jiahanc <[email protected]>
Signed-off-by: jiahanc <[email protected]>
Signed-off-by: jiahanc <[email protected]>
Signed-off-by: jiahanc <[email protected]>
Signed-off-by: jiahanc <[email protected]>
Signed-off-by: jiahanc <[email protected]>
Signed-off-by: jiahanc <[email protected]>
Signed-off-by: jiahanc <[email protected]>
Signed-off-by: jiahanc <[email protected]>
Signed-off-by: jiahanc <[email protected]>
Signed-off-by: jiahanc <[email protected]>
Signed-off-by: jiahanc <[email protected]>
@jiahanc jiahanc force-pushed the qwen3next_trtllmgen_moe branch from 09dd654 to ec5ba87 Compare November 8, 2025 02:19
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to get in now. We should make an issue to use RoutingMethod more broadly

@mgoin mgoin merged commit 34553b9 into vllm-project:main Nov 10, 2025
64 checks passed
@mgoin mgoin added this to NVIDIA Nov 11, 2025
@mgoin mgoin moved this to Done in NVIDIA Nov 11, 2025
@mgoin mgoin added the nvidia label Nov 11, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Nov 13, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build nvidia qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants