[NVIDIA] Add support for cudnn fp4 gemm via flashinfer#26107
[NVIDIA] Add support for cudnn fp4 gemm via flashinfer#26107mgoin merged 5 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for cuDNN FP4 GEMM via FlashInfer, which is a great addition for performance on NVIDIA GPUs. The changes are well-contained and primarily involve adding a new environment variable and the corresponding logic to select the cuDNN backend in ModelOptNvFp4LinearMethod.
My review focuses on ensuring the robustness of this new feature. I've identified a critical issue where a missing dependency check could lead to an unhandled runtime error. Adding an explicit check will improve user experience by providing a clear error message if the environment is not set up correctly.
Overall, the implementation is straightforward and the performance gains are a welcome improvement.
| if envs.VLLM_USE_CUDNN_FP4_GEMM: | ||
| self.backend = "flashinfer-cudnn" |
There was a problem hiding this comment.
To ensure robustness and provide clear error messages, it's important to verify that flashinfer is installed when VLLM_USE_CUDNN_FP4_GEMM is enabled. The current implementation will lead to an ImportError later in the apply method if flashinfer is missing. Adding an assertion here will fail-fast with a more informative message, which is consistent with the check for VLLM_USE_TRTLLM_FP4_GEMM.
| if envs.VLLM_USE_CUDNN_FP4_GEMM: | |
| self.backend = "flashinfer-cudnn" | |
| if envs.VLLM_USE_CUDNN_FP4_GEMM: | |
| assert has_flashinfer(), "CUDNN FP4 GEMM requires FlashInfer" | |
| self.backend = "flashinfer-cudnn" |
|
This pull request has merge conflicts that must be resolved before it can be |
6080885 to
7601836
Compare
6199104 to
edd06d3
Compare
Signed-off-by: kaixih <[email protected]>
edd06d3 to
4733d7c
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
…26107) Signed-off-by: kaixih <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>
…26107) Signed-off-by: kaixih <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>
…26107) Signed-off-by: kaixih <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>
…26107) Signed-off-by: kaixih <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: 0xrushi <[email protected]>
…26107) Signed-off-by: kaixih <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: 0xrushi <[email protected]>
…26107) Signed-off-by: kaixih <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>
…26107) Signed-off-by: kaixih <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: mgoin <[email protected]>
Purpose
Add support for the cuDNN FP4 GEMM through FlashInfer.
This operator is primarily used in the shared expert and output projection layers (currently available only in nvidia/DeepSeek-R1-FP4-v2).
Preliminary latency benchmarks indicate an ~1% end-to-end performance improvement.
To enable this feature, users need to install the required dependencies and set the environment variable:
Test Plan
Accuracy:
Benchmark:
Test Result
Accuracy:
Benchmark:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.