[ModelOpt] Load w13/w2_input_scale for all experts, nvfp4 #26135

wenscarl · 2025-10-02T20:53:18Z

This matches trt-llm usage. Use all experts' input scales to compute alpha and quantize. This change doesn't affect perf.

Purpose

Test Plan

VLLM_WORKER_MULTIPROC_METHOD="spawn" \
VLLM_USE_STANDALONE_COMPILE=0 \
VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_FLASHINFER_MOE_BACKEND="throughput" \
lm_eval --model vllm --model_args pretrained=nvidia/DeepSeek-R1-FP4,quantization=modelopt_fp4,data_parallel_size=8,enable_expert_parallel=True,tensor_parallel_size=1,max_model_len=2048,enforce_eager=True --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

Test Result

After change:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9431|±  |0.0064|
|     |       |strict-match    |     5|exact_match|↑  |0.9401|±  |0.0065|```

Before change:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.934	±	0.0068
		strict-match	5	exact_match	↑	0.931	±	0.0070

---
<details>
<summary> Essential Elements of an Effective PR Description Checklist </summary>

- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
- [ ] The test plan, such as providing test command.
- [ ] The test results, such as pasting the results comparison before and after, or e2e results
- [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model.
- [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0).
</details>

mergify · 2025-10-07T02:51:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wenscarl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/model_executor/layers/fused_moe/layer.py

Signed-off-by: Shu Wang <[email protected]>

vllm/model_executor/layers/quantization/modelopt.py

Signed-off-by: Shu Wang. <[email protected]>

vllm/model_executor/layers/fused_moe/layer.py

leejnau · 2025-10-17T22:00:57Z

For the nvidia/DeepSeek-R1-0528-FP4-v2 model, in both TP4 and DP4 modes, with the FlashInfer backend, this PR raises the accuracy from ~2% to ~95%.

server (TP4):

VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_FLASHINFER_MOE_BACKEND="throughput" vllm serve nvidia/DeepSeek-R1-0528-FP4-v2 --quantization="modelopt_fp4" --trust-remote-code --gpu_memory_utilization=0.8 --tensor-parallel-size 4 --data-parallel-size 1

client:

python3 tests/evals/gsm8k/gsm8k_eval.py
commit: 29350922c64a808a6de3b0e31fbadc2aebd6ba3f
Accuracy: 0.025
commit: 281be34de010fd4b106341e8aa3996f01f121c61
Accuracy: 0.955

server (DP4):

VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_FLASHINFER_MOE_BACKEND="throughput" vllm serve nvidia/DeepSeek-R1-0528-FP4-v2 --quantization="modelopt_fp4" --trust-remote-code --gpu_memory_utilization=0.8 --tensor-parallel-size 1 --data-parallel-size 4

client:

python3 tests/evals/gsm8k/gsm8k_eval.py
commit: 29350922c64a808a6de3b0e31fbadc2aebd6ba3f
Accuracy: 0.022
commit: 281be34de010fd4b106341e8aa3996f01f121c61
Accuracy: 0.955

vllm/model_executor/layers/quantization/modelopt.py

Signed-off-by: Shu Wang. <[email protected]>

mgoin · 2025-10-20T23:55:06Z

vllm/model_executor/layers/fused_moe/layer.py

        # Case input scale: input_scale loading is only supported for fp8
        if "input_scale" in weight_name:
            # this is needed for compressed-tensors only


Nit: looks like these comments are out of date

mgoin · 2025-10-20T23:57:36Z

vllm/model_executor/layers/fused_moe/layer.py

+        allow_flashinfer = getattr(self.quant_method, "allow_flashinfer", False)
+        moe_backend = getattr(self.quant_method, "flashinfer_moe_backend", None)
+
+        use_global_sf = (
+            allow_flashinfer
+            and is_flashinfer_supporting_global_sf(moe_backend)
+            and "input_scale" in weight_name
+            and quant_method_name == "ModelOptNvFp4FusedMoE"
+        )


Would be good to put these three lines together and leave a comment on what use_global_sf means in this case since we are in fused_moe/layer.py

mgoin

LGTM, although checking for local attrs in fused_moe/layer.py doesn't feel good to keep doing I don't have a better option atm

…ct#26135) Signed-off-by: Shu Wang <[email protected]> Signed-off-by: Shu Wang. <[email protected]> Co-authored-by: Michael Goin <[email protected]>

…ct#26135) Signed-off-by: Shu Wang <[email protected]> Signed-off-by: Shu Wang. <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>

…ct#26135) Signed-off-by: Shu Wang <[email protected]> Signed-off-by: Shu Wang. <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: 0xrushi <[email protected]>

…ct#26135) Signed-off-by: Shu Wang <[email protected]> Signed-off-by: Shu Wang. <[email protected]> Co-authored-by: Michael Goin <[email protected]>

wenscarl changed the title ~~Load w13/w2_input_scale for all experts~~ [ModelOpt] Load w13/w2_input_scale for all experts, nvfp4 Oct 3, 2025

mergify bot added the needs-rebase label Oct 7, 2025

wenscarl marked this pull request as ready for review October 7, 2025 02:53

wenscarl requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners October 7, 2025 02:53

chatgpt-codex-connector bot reviewed Oct 7, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Show resolved Hide resolved

mergify bot removed the needs-rebase label Oct 7, 2025

wenscarl force-pushed the nvfp4_glb_sf branch 2 times, most recently from 49543bd to cbfdd6d Compare October 14, 2025 03:54

Load w13/w2_input_scale for all experts

597672d

Signed-off-by: Shu Wang <[email protected]>

wenscarl force-pushed the nvfp4_glb_sf branch from cbfdd6d to 597672d Compare October 14, 2025 03:56

pavanimajety reviewed Oct 16, 2025

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Show resolved Hide resolved

Limit the scope to only flashinfer backends

281be34

Signed-off-by: Shu Wang. <[email protected]>

wenscarl requested a review from pavanimajety October 17, 2025 14:53

bnellnm reviewed Oct 17, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

pavanimajety reviewed Oct 18, 2025

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Show resolved Hide resolved

Address comments

373b59e

Signed-off-by: Shu Wang. <[email protected]>

wenscarl requested review from bnellnm and pavanimajety October 20, 2025 19:21

wenscarl mentioned this pull request Oct 20, 2025

Flashinfer_CUTLASS_MOE fuses quantization for TP #27223

Merged

5 tasks

mgoin added bug Something isn't working quantization ready ONLY add when PR is ready to merge/full CI is needed deepseek Related to DeepSeek models labels Oct 20, 2025

mgoin reviewed Oct 20, 2025

View reviewed changes

mgoin approved these changes Oct 20, 2025

View reviewed changes

Merge branch 'main' into nvfp4_glb_sf

7823488

mgoin merged commit f95da13 into vllm-project:main Oct 21, 2025
57 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ModelOpt] Load w13/w2_input_scale for all experts, nvfp4 #26135

[ModelOpt] Load w13/w2_input_scale for all experts, nvfp4 #26135

Uh oh!

wenscarl commented Oct 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leejnau commented Oct 17, 2025

Uh oh!

Uh oh!

mgoin Oct 20, 2025

Uh oh!

mgoin Oct 20, 2025

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[ModelOpt] Load w13/w2_input_scale for all experts, nvfp4 #26135

[ModelOpt] Load w13/w2_input_scale for all experts, nvfp4 #26135

Uh oh!

Conversation

wenscarl commented Oct 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Oct 7, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leejnau commented Oct 17, 2025

server (TP4):

client:

server (DP4):

client:

Uh oh!

Uh oh!

mgoin Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wenscarl commented Oct 2, 2025 •

edited by github-actions bot

Loading