Skip to content

Conversation

@benchislett
Copy link
Collaborator

@benchislett benchislett commented Nov 14, 2025

Purpose

Bugfix for incorrect outputs on llama4 caused by the refactor in #28479: ChunkedLocalAttentionBuilder incorrectly inherits FlashInfer's get_cudagraph_support method, causing the _cudagraph_support to be ignored and CUDA graphs to always get used.

Test Plan

Breaking command:

python examples/offline_inference/basic/generate.py --model=nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 --kv-cache-dtype=fp8 --max-model-len=1024

Test Result

Now works. Ongoing discussion to make ChunkedLocalAttention compatible with CUDA graphs, but for now it should stay as NEVER to avoid errors.

FIX #28604

Signed-off-by: Benjamin Chislett <[email protected]>
@benchislett benchislett self-assigned this Nov 14, 2025
@benchislett benchislett added bug Something isn't working llama Related to Llama models labels Nov 14, 2025
@mergify mergify bot added the nvidia label Nov 14, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a crucial bugfix for ChunkedLocalAttention. The root cause, an incorrect inheritance of get_cudagraph_support from the underlying attention backend's builder, led to CUDA graphs being improperly enabled and causing incorrect model outputs. The proposed solution correctly overrides the get_cudagraph_support method to always return AttentionCGSupport.NEVER, effectively disabling CUDA graphs for this attention mechanism as intended. This change is robust, well-targeted, and prevents a critical correctness issue. The addition of an issubclass assertion is a good defensive measure. The implementation is clean and I have no further suggestions for improvement.

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 14, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as an immediate fix, thank you

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Nov 14, 2025
@mgoin mgoin enabled auto-merge (squash) November 14, 2025 18:53
Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks for the quick fix!

@vllm-bot vllm-bot merged commit bf3ffb6 into vllm-project:main Nov 14, 2025
46 of 51 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Nov 14, 2025
geodavic pushed a commit to geodavic/vllm that referenced this pull request Nov 16, 2025
bwasti pushed a commit to bwasti/vllm that referenced this pull request Nov 17, 2025
@ProExpertProg
Copy link
Collaborator

@benchislett does this fix mean we can re-enable llama4 E2E fusion tests using the FI attention backend on Blackwell?

@benchislett
Copy link
Collaborator Author

I'm not sure about the overall status of Llama4 support, but this fix definitely should clear up a major blocker for blackwell. Seems like it's worth trying it out to see where it's at. CC @pavanimajety @xinli-sw who might have some context?

bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working llama Related to Llama models nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: Llama4 on B200 flashinfer produces garbage

5 participants