[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting #28739

benchislett · 2025-11-14T17:16:01Z

Purpose

Bugfix for incorrect outputs on llama4 caused by the refactor in #28479: ChunkedLocalAttentionBuilder incorrectly inherits FlashInfer's get_cudagraph_support method, causing the _cudagraph_support to be ignored and CUDA graphs to always get used.

Test Plan

Breaking command:

python examples/offline_inference/basic/generate.py --model=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --kv-cache-dtype=fp8 --max-model-len=1024

Test Result

Now works. Ongoing discussion to make ChunkedLocalAttention compatible with CUDA graphs, but for now it should stay as NEVER to avoid errors.

FIX #28604

Signed-off-by: Benjamin Chislett <[email protected]>

gemini-code-assist

Code Review

This pull request provides a crucial bugfix for ChunkedLocalAttention. The root cause, an incorrect inheritance of get_cudagraph_support from the underlying attention backend's builder, led to CUDA graphs being improperly enabled and causing incorrect model outputs. The proposed solution correctly overrides the get_cudagraph_support method to always return AttentionCGSupport.NEVER, effectively disabling CUDA graphs for this attention mechanism as intended. This change is robust, well-targeted, and prevents a critical correctness issue. The addition of an issubclass assertion is a good defensive measure. The implementation is clean and I have no further suggestions for improvement.

mgoin

LGTM as an immediate fix, thank you

LucasWilkinson

LGTM; thanks for the quick fix!

…8739) Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: George D. Torres <[email protected]>

…8739) Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Bram Wasti <[email protected]>

ProExpertProg · 2025-11-17T23:44:03Z

@benchislett does this fix mean we can re-enable llama4 E2E fusion tests using the FI attention backend on Blackwell?

benchislett · 2025-11-18T06:10:03Z

I'm not sure about the overall status of Llama4 support, but this fix definitely should clear up a major blocker for blackwell. Seems like it's worth trying it out to see where it's at. CC @pavanimajety @xinli-sw who might have some context?

…8739) Signed-off-by: Benjamin Chislett <[email protected]>

fix chunked local attn

bd66253

Signed-off-by: Benjamin Chislett <[email protected]>

benchislett self-assigned this Nov 14, 2025

benchislett requested a review from LucasWilkinson as a code owner November 14, 2025 17:16

benchislett added bug Something isn't working llama Related to Llama models labels Nov 14, 2025

benchislett mentioned this pull request Nov 14, 2025

[Bug]: Llama4 on B200 flashinfer produces garbage #28604

Closed

1 task

mergify bot added the nvidia label Nov 14, 2025

github-project-automation bot added this to NVIDIA Nov 14, 2025

gemini-code-assist bot reviewed Nov 14, 2025

View reviewed changes

benchislett mentioned this pull request Nov 14, 2025

[Bug]: FlashInfer attention backend on Hopper fails with llama4-scout and llama3 with fp8 kvcache #28568

Open

1 task

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 14, 2025

mgoin approved these changes Nov 14, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Nov 14, 2025

mgoin enabled auto-merge (squash) November 14, 2025 18:53

LucasWilkinson approved these changes Nov 14, 2025

View reviewed changes

vllm-bot merged commit bf3ffb6 into vllm-project:main Nov 14, 2025
46 of 51 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Nov 14, 2025

geodavic pushed a commit to geodavic/vllm that referenced this pull request Nov 16, 2025

[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting (vllm-project#2…

6e4cd35

…8739) Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: George D. Torres <[email protected]>

bwasti pushed a commit to bwasti/vllm that referenced this pull request Nov 17, 2025

[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting (vllm-project#2…

2b4fc26

…8739) Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Bram Wasti <[email protected]>

Copilot AI mentioned this pull request Nov 18, 2025

Re-enable FlashInfer for Llama4 on Blackwell in e2e fusion tests #28966

Merged

5 tasks

bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025

[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting (vllm-project#2…

1f36150

…8739) Signed-off-by: Benjamin Chislett <[email protected]>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting (vllm-project#2…

77a93c2

…8739) Signed-off-by: Benjamin Chislett <[email protected]>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting (vllm-project#2…

5b7d79e

…8739) Signed-off-by: Benjamin Chislett <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting #28739

[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting #28739

Uh oh!

benchislett commented Nov 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mgoin left a comment

Uh oh!

LucasWilkinson left a comment

Uh oh!

Uh oh!

ProExpertProg commented Nov 17, 2025

Uh oh!

benchislett commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting #28739

[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting #28739

Uh oh!

Conversation

benchislett commented Nov 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ProExpertProg commented Nov 17, 2025

Uh oh!

benchislett commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

benchislett commented Nov 14, 2025 •

edited by github-actions bot

Loading