-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Prefer FlashAttention MLA as default over FlashMLA #27363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefer FlashAttention MLA as default over FlashMLA #27363
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
vllm/vllm/config/multimodal.py
Lines 160 to 176 in c8a0a46
| @field_validator("mm_encoder_attn_backend", mode="before") | |
| @classmethod | |
| def _validate_mm_encoder_attn_backend(cls, value: object) -> _Backend | None: | |
| from vllm.attention.backends.registry import ( | |
| _Backend as BackendEnum, | |
| ) | |
| from vllm.attention.backends.registry import ( | |
| backend_name_to_enum, | |
| ) | |
| if value is None or isinstance(value, BackendEnum): | |
| return value | |
| if isinstance(value, str): | |
| candidate = backend_name_to_enum(value.upper()) | |
| if candidate is not None: | |
| return candidate |
The commit removes backend_name_to_enum from the attention backend registry, but MultiModalConfig._validate_mm_encoder_attn_backend still imports and calls it. Importing vllm.config.multimodal now raises ImportError: cannot import name 'backend_name_to_enum', preventing multimodal configurations from even loading. Either reintroduce the helper or update the validator to use _Backend[...] directly.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and well-executed refactoring of the attention backend selection logic. The new implementation is more structured, robust, and maintainable by centralizing configuration validation within the backend classes themselves and using a priority-based selection mechanism. This also correctly implements the preference for FlashAttention MLA over FlashMLA on Hopper GPUs as intended. The changes are extensive and touch many parts of the attention subsystem, but they represent a clear improvement in design. I've identified a critical issue in the implementation for FlashAttentionBackend and FlashAttnMLABackend that could lead to a runtime crash, and have provided suggestions for a fix.
c8a0a46 to
db96fd6
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
Related issue: #27787 |
Signed-off-by: Matthew Bonanni <[email protected]>
db96fd6 to
a30f40e
Compare
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
| AttentionBackendEnum.FLASHMLA, | ||
| AttentionBackendEnum.FLASH_ATTN_MLA, | ||
| AttentionBackendEnum.FLASHMLA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just with respect to Blackwell, do we actually know how flashmla and fa mla work on there? I'm curious if fa mla is even a valid option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually marked FA MLA as supporting only CC 9.x, so it'll get skipped on Blackwell regardless of its spot in the priority list. It could probably just be removed from the list, but I'll test it out anyway to check. FlashMLA supports Blackwell but I haven't benchmarked it
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]> Signed-off-by: Xingyu Liu <[email protected]>
Purpose
Based on benchmarking in #26835, FlashAttention MLA is faster than FlashMLA across head counts, batch sizes, and query lengths on Hopper. This PR sets it as the preferred backend.
Based on #24794, merge that first
Note: Results shown below apply the fix of #27368, FlashMLA performance is worse prior to the bugfix
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.