[Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer #28479

benchislett · 2025-11-11T18:39:44Z

Purpose

Revised implementation of #26937

This PR makes _cudagraph_support a private member and uses get_cudagraph_support(vllm_config, kv_cache_spec). Also updates _check_and_update_cudagraph_mode to consider support per-backend, per-kv-group.

TRTLLM-gen kernels support full cuda graphs, but are only used with FlashInfer on Blackwell under certain conditions.
It might not be safe to change FlashInfer's cudagraph_support to UNIFORM_BATCH always, but we can still set it when we know TRTLLM-gen backend will be used.

Also update the docs to reflect the FlashInfer cuda graph compatibility, and fill in the missing entry for FlashInferMLA.

FIX #26856

Test Plan

See #26937 for functional correctness testing / benchmarking. Rerunning on this branch gives the same results.

Local test run passes for tests/v1/attention.

Signed-off-by: Benjamin Chislett <[email protected]>

mergify · 2025-11-11T18:40:25Z

Documentation preview: https://vllm--28479.org.readthedocs.build/en/28479/

gemini-code-assist

Code Review

This pull request introduces a well-designed refactoring to enable more flexible and dynamic CUDA graph support for attention backends. By making cudagraph_support a private member and introducing a new get_cudagraph_support method, the code now dynamically determines the CUDA graph capability on a per-backend, per-KV-group basis. This change is crucial for enabling full CUDA graph support for speculative decoding with FlashInfer on specific hardware like Blackwell. The updates to _check_and_update_cudagraph_mode and the corresponding documentation changes are clear and correct. Overall, this is a solid performance enhancement with clean implementation.

LucasWilkinson

LGTM

Id like to work towards reverting #27427 (and move back to this being an instance property) in the future; but we need broader cudagraph refactors to get there

vadiklyutiy

Before we use use_trtllm_attention for checking both prefill and decode.
Right now seem use_trtllm_attention is using for checking prefill only and can_use_trtllm_attention .

May we refactor:

use proper name like use_trtllm_prefill_attn and use_trtllm_decode_attn
remove from use_trtllm_attention processing of decode case

Maybe it's worth to do in separate PR

vadiklyutiy

Before we use use_trtllm_attention for checking both prefill and decode.
Right now seem use_trtllm_attention is using for checking prefill only and can_use_trtllm_attention .

May we refactor:

use proper name like use_trtllm_prefill_attn and use_trtllm_decode_attn
remove from use_trtllm_attention processing of decode case

Maybe it's worth to do in separate PR

vadiklyutiy

One more style
Is there some reason to hold [_]cudagraph_support and get_cudagraph_support in *MetaBuilder classes, maybe *Backend(AttentionBackend) is better place?

benchislett · 2025-11-12T16:54:04Z

@vadiklyutiy

Currently, we have to force the use of TRTLLM attention for decodes if it is supported, so that we can statically decide whether the backend can support UNIFORM_BATCH cuda graphs or not. Ideally, we would only lock-in to that decision when we know that UNIFORM_BATCH is actually being used, and otherwise we would switch dynamically. So I am hoping to keep use_trtllm_attention untouched so that we can re-enable in the near future.
I don't know exactly why it's defined on the builder and not the backend, but this seems like a reasonable convention for now. I expect it will be refactored separately in the future.
These interfaces are often undergoing changes, so I am trying to keep this PR's diff minimal until they become more cemented.

… decoding with FlashInfer (vllm-project#28479) Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: George D. Torres <[email protected]>

… decoding with FlashInfer (vllm-project#28479) Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Bram Wasti <[email protected]>

… decoding with FlashInfer (vllm-project#28479) Signed-off-by: Benjamin Chislett <[email protected]>

benchislett added 2 commits November 11, 2025 18:33

make cudagraph_support a class method

b6492d8

Signed-off-by: Benjamin Chislett <[email protected]>

update docs for flashinfer

1b28db4

Signed-off-by: Benjamin Chislett <[email protected]>

benchislett requested review from LucasWilkinson, gshtras, mgoin, pavanimajety, tdoublep and tjtanaa as code owners November 11, 2025 18:39

mergify bot added documentation Improvements or additions to documentation nvidia rocm Related to AMD ROCm v1 labels Nov 11, 2025

github-project-automation bot added this to NVIDIA Nov 11, 2025

gemini-code-assist bot reviewed Nov 11, 2025

View reviewed changes

benchislett mentioned this pull request Nov 11, 2025

[Perf] Enable full CUDA graphs for spec decoding with FlashInfer #26937

Closed

LucasWilkinson approved these changes Nov 11, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Nov 11, 2025

benchislett added ready ONLY add when PR is ready to merge/full CI is needed and removed rocm Related to AMD ROCm labels Nov 11, 2025

mergify bot added the rocm Related to AMD ROCm label Nov 11, 2025

vadiklyutiy reviewed Nov 12, 2025

View reviewed changes

LucasWilkinson merged commit 3044195 into vllm-project:main Nov 12, 2025
59 of 60 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Nov 12, 2025

This was referenced Nov 14, 2025

[Bug]: Llama4 on B200 flashinfer produces garbage #28604

Closed

[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting #28739

Merged

[Bugfix] Use PIECEWISE cudagraphs on Blackwell if max_model_len > 131072 #27114

Merged

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Perf] Refactor cudagraph_support to enable full CUDA graphs for spec…

2bcf0e1

… decoding with FlashInfer (vllm-project#28479) Signed-off-by: Benjamin Chislett <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer #28479

[Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer #28479

Uh oh!

benchislett commented Nov 11, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

LucasWilkinson left a comment

Uh oh!

vadiklyutiy left a comment

Uh oh!

vadiklyutiy left a comment

Uh oh!

vadiklyutiy left a comment

Uh oh!

benchislett commented Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer #28479

[Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer #28479

Uh oh!

Conversation

benchislett commented Nov 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy left a comment

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy left a comment

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy left a comment

Choose a reason for hiding this comment

Uh oh!

benchislett commented Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benchislett commented Nov 11, 2025 •

edited by github-actions bot

Loading