Skip to content

Conversation

@benchislett
Copy link
Collaborator

@benchislett benchislett commented Nov 11, 2025

Purpose

Revised implementation of #26937

This PR makes _cudagraph_support a private member and uses get_cudagraph_support(vllm_config, kv_cache_spec). Also updates _check_and_update_cudagraph_mode to consider support per-backend, per-kv-group.

TRTLLM-gen kernels support full cuda graphs, but are only used with FlashInfer on Blackwell under certain conditions.
It might not be safe to change FlashInfer's cudagraph_support to UNIFORM_BATCH always, but we can still set it when we know TRTLLM-gen backend will be used.

Also update the docs to reflect the FlashInfer cuda graph compatibility, and fill in the missing entry for FlashInferMLA.

FIX #26856

Test Plan

See #26937 for functional correctness testing / benchmarking. Rerunning on this branch gives the same results.

Local test run passes for tests/v1/attention.

Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
@mergify
Copy link

mergify bot commented Nov 11, 2025

Documentation preview: https://vllm--28479.org.readthedocs.build/en/28479/

@mergify mergify bot added documentation Improvements or additions to documentation nvidia rocm Related to AMD ROCm v1 labels Nov 11, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a well-designed refactoring to enable more flexible and dynamic CUDA graph support for attention backends. By making cudagraph_support a private member and introducing a new get_cudagraph_support method, the code now dynamically determines the CUDA graph capability on a per-backend, per-KV-group basis. This change is crucial for enabling full CUDA graph support for speculative decoding with FlashInfer on specific hardware like Blackwell. The updates to _check_and_update_cudagraph_mode and the corresponding documentation changes are clear and correct. Overall, this is a solid performance enhancement with clean implementation.

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Id like to work towards reverting #27427 (and move back to this being an instance property) in the future; but we need broader cudagraph refactors to get there

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Nov 11, 2025
@benchislett benchislett added ready ONLY add when PR is ready to merge/full CI is needed and removed rocm Related to AMD ROCm labels Nov 11, 2025
@mergify mergify bot added the rocm Related to AMD ROCm label Nov 11, 2025
Copy link
Collaborator

@vadiklyutiy vadiklyutiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we use use_trtllm_attention for checking both prefill and decode.
Right now seem use_trtllm_attention is using for checking prefill only and can_use_trtllm_attention .

May we refactor:

  • use proper name like use_trtllm_prefill_attn and use_trtllm_decode_attn
  • remove from use_trtllm_attention processing of decode case

Maybe it's worth to do in separate PR

Copy link
Collaborator

@vadiklyutiy vadiklyutiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we use use_trtllm_attention for checking both prefill and decode.
Right now seem use_trtllm_attention is using for checking prefill only and can_use_trtllm_attention .

May we refactor:

  • use proper name like use_trtllm_prefill_attn and use_trtllm_decode_attn
  • remove from use_trtllm_attention processing of decode case

Maybe it's worth to do in separate PR

Copy link
Collaborator

@vadiklyutiy vadiklyutiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more style
Is there some reason to hold [_]cudagraph_support and get_cudagraph_support in *MetaBuilder classes, maybe *Backend(AttentionBackend) is better place?

@benchislett
Copy link
Collaborator Author

@vadiklyutiy

  • Currently, we have to force the use of TRTLLM attention for decodes if it is supported, so that we can statically decide whether the backend can support UNIFORM_BATCH cuda graphs or not. Ideally, we would only lock-in to that decision when we know that UNIFORM_BATCH is actually being used, and otherwise we would switch dynamically. So I am hoping to keep use_trtllm_attention untouched so that we can re-enable in the near future.
  • I don't know exactly why it's defined on the builder and not the backend, but this seems like a reasonable convention for now. I expect it will be refactored separately in the future.
    These interfaces are often undergoing changes, so I am trying to keep this PR's diff minimal until they become more cemented.

@LucasWilkinson LucasWilkinson merged commit 3044195 into vllm-project:main Nov 12, 2025
59 of 60 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Nov 12, 2025
geodavic pushed a commit to geodavic/vllm that referenced this pull request Nov 16, 2025
… decoding with FlashInfer (vllm-project#28479)

Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: George D. Torres <[email protected]>
bwasti pushed a commit to bwasti/vllm that referenced this pull request Nov 17, 2025
… decoding with FlashInfer (vllm-project#28479)

Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Bram Wasti <[email protected]>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Performance]: FalshInfer attn backend. Use dynamic AttentionCGSupport

3 participants