[NVIDIA] Add support for cudnn fp4 gemm via flashinfer by kaixih · Pull Request #26107 · vllm-project/vllm

kaixih · 2025-10-02T16:14:46Z

Purpose

Add support for the cuDNN FP4 GEMM through FlashInfer.
This operator is primarily used in the shared expert and output projection layers (currently available only in nvidia/DeepSeek-R1-FP4-v2).
Preliminary latency benchmarks indicate an ~1% end-to-end performance improvement.

To enable this feature, users need to install the required dependencies and set the environment variable:

pip install nvidia-cudnn-cu12
pip install nvidia-cudnn-frontend
export VLLM_USE_CUDNN_FP4_GEMM=1

Test Plan

Accuracy:

  export VLLM_WORKER_MULTIPROC_METHOD="spawn"
  export VLLM_USE_FLASHINFER_MOE_FP8="1"
  export VLLM_FLASHINFER_MOE_BACKEND="latency"
  export VLLM_USE_CUDNN_FP4_GEMM="1"

  model_dir="/model/nvidia-DeepSeek-R1-FP4-v2"
  model_args="pretrained=${model_dir},trust_remote_code=True,tensor_parallel_size=8,quantization=modelopt_fp4"
  lm_eval --model vllm --model_args $model_args \
    --gen_kwargs temperature=0.0 \
    --limit 1319 \
    --trust_remote_code \
    --tasks gsm8k \
    --num_fewshot 5 \
    --batch_size 200

Benchmark:

  cd /scratch/repo/vllm

  export VLLM_WORKER_MULTIPROC_METHOD="spawn"
  export VLLM_USE_CUDNN_FP4_GEMM=$2

  model_dir="/model/nvidia-DeepSeek-R1-FP4-v2"
  vllm bench throughput --model=$model_dir \
    --quantization=modelopt_fp4 \
    --trust_remote_code \
    --tensor-parallel-size=8 \
    --dataset-name=random \
    --input-len=1024 \
    --output-len=1024 \
    --num-prompts 2048 \
    --max-num-seqs 1024

Test Result

Accuracy:

# Before with cutlass gemm
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9484|±  |0.0061|
|     |       |strict-match    |     5|exact_match|↑  |0.9462|±  |0.0062|

# After with cudnn gemm
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9462|±  |0.0062|
|     |       |strict-match    |     5|exact_match|↑  |0.9454|±  |0.0063|

Benchmark:

# Before with cutlass gemm
Throughput: 5.09 requests/s, 10426.68 total tokens/s, 5213.34 output tokens/s
Total num prompt tokens:  2097152
Total num output tokens:  2097152

# After with cudnn gemm
Throughput: 5.13 requests/s, 10510.00 total tokens/s, 5255.00 output tokens/s
Total num prompt tokens:  2097152
Total num output tokens:  2097152

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request adds support for cuDNN FP4 GEMM via FlashInfer, which is a great addition for performance on NVIDIA GPUs. The changes are well-contained and primarily involve adding a new environment variable and the corresponding logic to select the cuDNN backend in ModelOptNvFp4LinearMethod.

My review focuses on ensuring the robustness of this new feature. I've identified a critical issue where a missing dependency check could lead to an unhandled runtime error. Adding an explicit check will improve user experience by providing a clear error message if the environment is not set up correctly.

Overall, the implementation is straightforward and the performance gains are a welcome improvement.

gemini-code-assist · 2025-10-02T16:16:46Z

vllm/model_executor/layers/quantization/modelopt.py

+        if envs.VLLM_USE_CUDNN_FP4_GEMM:
+            self.backend = "flashinfer-cudnn"


To ensure robustness and provide clear error messages, it's important to verify that flashinfer is installed when VLLM_USE_CUDNN_FP4_GEMM is enabled. The current implementation will lead to an ImportError later in the apply method if flashinfer is missing. Adding an assertion here will fail-fast with a more informative message, which is consistent with the check for VLLM_USE_TRTLLM_FP4_GEMM.

Suggested change

if envs.VLLM_USE_CUDNN_FP4_GEMM:

self.backend = "flashinfer-cudnn"

if envs.VLLM_USE_CUDNN_FP4_GEMM:

assert has_flashinfer(), "CUDNN FP4 GEMM requires FlashInfer"

self.backend = "flashinfer-cudnn"

mergify · 2025-10-06T02:54:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kaixih.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/quantization/modelopt.py

vllm/envs.py

vllm/model_executor/layers/quantization/modelopt.py

Signed-off-by: kaixih <[email protected]>

mergify · 2025-10-08T12:10:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kaixih.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: mgoin <[email protected]>

pavanimajety

Thanks @kaixih and @mgoin, PR looks good to me and all the checks have passed

…26107) Signed-off-by: kaixih <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>

…26107) Signed-off-by: kaixih <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>

…26107) Signed-off-by: kaixih <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: 0xrushi <[email protected]>

…26107) Signed-off-by: kaixih <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>

kaixih requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners October 2, 2025 16:14

gemini-code-assist bot reviewed Oct 2, 2025

View reviewed changes

mergify bot added the needs-rebase label Oct 6, 2025

pavanimajety reviewed Oct 6, 2025

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Outdated Show resolved Hide resolved

pavanimajety reviewed Oct 6, 2025

View reviewed changes

vllm/envs.py Outdated Show resolved Hide resolved

kaixih force-pushed the add_flashinfer_cudnn_fp4_gemm branch from 6080885 to 7601836 Compare October 6, 2025 20:24

mergify bot removed the needs-rebase label Oct 6, 2025

kaixih force-pushed the add_flashinfer_cudnn_fp4_gemm branch 4 times, most recently from 6199104 to edd06d3 Compare October 6, 2025 21:15

mgoin reviewed Oct 6, 2025

View reviewed changes

vllm/envs.py Outdated Show resolved Hide resolved

pavanimajety reviewed Oct 6, 2025

View reviewed changes

vllm/model_executor/layers/quantization/modelopt.py Outdated Show resolved Hide resolved

Add support for cudnn fp4 gemm

4733d7c

Signed-off-by: kaixih <[email protected]>

kaixih force-pushed the add_flashinfer_cudnn_fp4_gemm branch from edd06d3 to 4733d7c Compare October 6, 2025 22:18

mergify bot added the needs-rebase label Oct 8, 2025

Merge branch 'main' into add_flashinfer_cudnn_fp4_gemm

e8398c3

Signed-off-by: mgoin <[email protected]>

mergify bot removed the needs-rebase label Oct 14, 2025

mgoin added 2 commits October 14, 2025 18:06

Fix

2be2a12

Signed-off-by: mgoin <[email protected]>

Bad envs merge

d7fdeb1

Signed-off-by: mgoin <[email protected]>

mgoin added kernel ready ONLY add when PR is ready to merge/full CI is needed labels Oct 14, 2025

Merge branch 'main' into add_flashinfer_cudnn_fp4_gemm

cd2cf15

pavanimajety approved these changes Oct 15, 2025

View reviewed changes

mgoin approved these changes Oct 15, 2025

View reviewed changes

mgoin merged commit de92d91 into vllm-project:main Oct 15, 2025
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVIDIA] Add support for cudnn fp4 gemm via flashinfer#26107

[NVIDIA] Add support for cudnn fp4 gemm via flashinfer#26107
mgoin merged 5 commits intovllm-project:mainfrom
kaixih:add_flashinfer_cudnn_fp4_gemm

kaixih commented Oct 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 2, 2025

Uh oh!

mergify bot commented Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

pavanimajety left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if envs.VLLM_USE_CUDNN_FP4_GEMM:
		self.backend = "flashinfer-cudnn"

Uh oh!

Conversation

kaixih commented Oct 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

pavanimajety left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaixih commented Oct 2, 2025 •

edited by github-actions bot

Loading