-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting #28739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting #28739
Conversation
Signed-off-by: Benjamin Chislett <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a crucial bugfix for ChunkedLocalAttention. The root cause, an incorrect inheritance of get_cudagraph_support from the underlying attention backend's builder, led to CUDA graphs being improperly enabled and causing incorrect model outputs. The proposed solution correctly overrides the get_cudagraph_support method to always return AttentionCGSupport.NEVER, effectively disabling CUDA graphs for this attention mechanism as intended. This change is robust, well-targeted, and prevents a critical correctness issue. The addition of an issubclass assertion is a good defensive measure. The implementation is clean and I have no further suggestions for improvement.
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as an immediate fix, thank you
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; thanks for the quick fix!
…8739) Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: George D. Torres <[email protected]>
…8739) Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: Bram Wasti <[email protected]>
|
@benchislett does this fix mean we can re-enable llama4 E2E fusion tests using the FI attention backend on Blackwell? |
|
I'm not sure about the overall status of Llama4 support, but this fix definitely should clear up a major blocker for blackwell. Seems like it's worth trying it out to see where it's at. CC @pavanimajety @xinli-sw who might have some context? |
…8739) Signed-off-by: Benjamin Chislett <[email protected]>
…8739) Signed-off-by: Benjamin Chislett <[email protected]>
…8739) Signed-off-by: Benjamin Chislett <[email protected]>
Purpose
Bugfix for incorrect outputs on llama4 caused by the refactor in #28479: ChunkedLocalAttentionBuilder incorrectly inherits FlashInfer's
get_cudagraph_supportmethod, causing the_cudagraph_supportto be ignored and CUDA graphs to always get used.Test Plan
Breaking command:
Test Result
Now works. Ongoing discussion to make ChunkedLocalAttention compatible with CUDA graphs, but for now it should stay as NEVER to avoid errors.
FIX #28604