Skip to content

[NVIDIA] Add support for cudnn fp4 gemm via flashinfer#26107

Merged
mgoin merged 5 commits intovllm-project:mainfrom
kaixih:add_flashinfer_cudnn_fp4_gemm
Oct 15, 2025
Merged

[NVIDIA] Add support for cudnn fp4 gemm via flashinfer#26107
mgoin merged 5 commits intovllm-project:mainfrom
kaixih:add_flashinfer_cudnn_fp4_gemm

Conversation

@kaixih
Copy link
Copy Markdown
Contributor

@kaixih kaixih commented Oct 2, 2025

Purpose

Add support for the cuDNN FP4 GEMM through FlashInfer.
This operator is primarily used in the shared expert and output projection layers (currently available only in nvidia/DeepSeek-R1-FP4-v2).
Preliminary latency benchmarks indicate an ~1% end-to-end performance improvement.

To enable this feature, users need to install the required dependencies and set the environment variable:

pip install nvidia-cudnn-cu12
pip install nvidia-cudnn-frontend
export VLLM_USE_CUDNN_FP4_GEMM=1

Test Plan

Accuracy:

  export VLLM_WORKER_MULTIPROC_METHOD="spawn"
  export VLLM_USE_FLASHINFER_MOE_FP8="1"
  export VLLM_FLASHINFER_MOE_BACKEND="latency"
  export VLLM_USE_CUDNN_FP4_GEMM="1"

  model_dir="/model/nvidia-DeepSeek-R1-FP4-v2"
  model_args="pretrained=${model_dir},trust_remote_code=True,tensor_parallel_size=8,quantization=modelopt_fp4"
  lm_eval --model vllm --model_args $model_args \
    --gen_kwargs temperature=0.0 \
    --limit 1319 \
    --trust_remote_code \
    --tasks gsm8k \
    --num_fewshot 5 \
    --batch_size 200

Benchmark:

  cd /scratch/repo/vllm

  export VLLM_WORKER_MULTIPROC_METHOD="spawn"
  export VLLM_USE_CUDNN_FP4_GEMM=$2

  model_dir="/model/nvidia-DeepSeek-R1-FP4-v2"
  vllm bench throughput --model=$model_dir \
    --quantization=modelopt_fp4 \
    --trust_remote_code \
    --tensor-parallel-size=8 \
    --dataset-name=random \
    --input-len=1024 \
    --output-len=1024 \
    --num-prompts 2048 \
    --max-num-seqs 1024

Test Result

Accuracy:

# Before with cutlass gemm
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9484|±  |0.0061|
|     |       |strict-match    |     5|exact_match|↑  |0.9462|±  |0.0062|

# After with cudnn gemm
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9462|±  |0.0062|
|     |       |strict-match    |     5|exact_match|↑  |0.9454|±  |0.0063|

Benchmark:

# Before with cutlass gemm
Throughput: 5.09 requests/s, 10426.68 total tokens/s, 5213.34 output tokens/s
Total num prompt tokens:  2097152
Total num output tokens:  2097152

# After with cudnn gemm
Throughput: 5.13 requests/s, 10510.00 total tokens/s, 5255.00 output tokens/s
Total num prompt tokens:  2097152
Total num output tokens:  2097152

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for cuDNN FP4 GEMM via FlashInfer, which is a great addition for performance on NVIDIA GPUs. The changes are well-contained and primarily involve adding a new environment variable and the corresponding logic to select the cuDNN backend in ModelOptNvFp4LinearMethod.

My review focuses on ensuring the robustness of this new feature. I've identified a critical issue where a missing dependency check could lead to an unhandled runtime error. Adding an explicit check will improve user experience by providing a clear error message if the environment is not set up correctly.

Overall, the implementation is straightforward and the performance gains are a welcome improvement.

Comment on lines +855 to +856
if envs.VLLM_USE_CUDNN_FP4_GEMM:
self.backend = "flashinfer-cudnn"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

To ensure robustness and provide clear error messages, it's important to verify that flashinfer is installed when VLLM_USE_CUDNN_FP4_GEMM is enabled. The current implementation will lead to an ImportError later in the apply method if flashinfer is missing. Adding an assertion here will fail-fast with a more informative message, which is consistent with the check for VLLM_USE_TRTLLM_FP4_GEMM.

Suggested change
if envs.VLLM_USE_CUDNN_FP4_GEMM:
self.backend = "flashinfer-cudnn"
if envs.VLLM_USE_CUDNN_FP4_GEMM:
assert has_flashinfer(), "CUDNN FP4 GEMM requires FlashInfer"
self.backend = "flashinfer-cudnn"

@mergify
Copy link
Copy Markdown

mergify bot commented Oct 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kaixih.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 6, 2025
@kaixih kaixih force-pushed the add_flashinfer_cudnn_fp4_gemm branch from 6080885 to 7601836 Compare October 6, 2025 20:24
@mergify mergify bot removed the needs-rebase label Oct 6, 2025
@kaixih kaixih force-pushed the add_flashinfer_cudnn_fp4_gemm branch 4 times, most recently from 6199104 to edd06d3 Compare October 6, 2025 21:15
@kaixih kaixih force-pushed the add_flashinfer_cudnn_fp4_gemm branch from edd06d3 to 4733d7c Compare October 6, 2025 22:18
@mergify
Copy link
Copy Markdown

mergify bot commented Oct 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kaixih.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 8, 2025
@mergify mergify bot removed the needs-rebase label Oct 14, 2025
mgoin added 2 commits October 14, 2025 18:06
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
@mgoin mgoin added kernel ready ONLY add when PR is ready to merge/full CI is needed labels Oct 14, 2025
Copy link
Copy Markdown
Collaborator

@pavanimajety pavanimajety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kaixih and @mgoin, PR looks good to me and all the checks have passed

@mgoin mgoin merged commit de92d91 into vllm-project:main Oct 15, 2025
53 checks passed
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 16, 2025
…26107)

Signed-off-by: kaixih <[email protected]>
Signed-off-by: mgoin <[email protected]>
Co-authored-by: mgoin <[email protected]>
Signed-off-by: Alberto Perdomo <[email protected]>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants