-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. #30125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Isotr0py <[email protected]> Co-authored-by: tjtanaa <[email protected]> Signed-off-by: shen-shanshan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a good step towards refactoring the attention mechanisms and making the codebase more modular by introducing MMEncoderAttention as a CustomOp. The unification of the vision attention backend logic is also a welcome improvement.
I've found a critical bug in vllm/attention/layer.py where a variable was renamed but not all its usages were updated, which would cause a runtime error. I've also pointed out an instance of code duplication in the new mm_encoder_attention.py file that should be addressed to improve maintainability.
Once these issues are resolved, this PR will be a solid contribution to the project's architecture.
| self._flash_attn_varlen_func = maybe_get_vit_flash_attn_backend( | ||
| self.attn_backend, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a typo here. The __init__ method now sets self.backend, but self.attn_backend is used here, which is no longer defined. This will lead to an AttributeError at runtime. You should probably use self.backend instead.
| self._flash_attn_varlen_func = maybe_get_vit_flash_attn_backend( | |
| self.attn_backend, | |
| ) | |
| self._flash_attn_varlen_func = maybe_get_vit_flash_attn_backend( | |
| self.backend, | |
| ) |
| def maybe_get_vit_flash_attn_backend( | ||
| attn_backend: AttentionBackendEnum | None, | ||
| ) -> Callable | None: | ||
| # At this point, | ||
| # we already have the attn_backend, | ||
| # overriding logic is done in the platform-specific implementation. | ||
| # so we don't need to override backend here. | ||
| # Just return the attn_backend and flash_attn_varlen_func. | ||
|
|
||
| if ( | ||
| attn_backend == AttentionBackendEnum.FLASH_ATTN | ||
| and current_platform.is_cuda_alike() | ||
| ): | ||
| from flash_attn import flash_attn_varlen_func | ||
| elif attn_backend == AttentionBackendEnum.FLASH_ATTN and current_platform.is_xpu(): | ||
| from vllm.attention.utils.fa_utils import flash_attn_varlen_func | ||
| elif attn_backend == AttentionBackendEnum.ROCM_AITER_FA: | ||
| from aiter import flash_attn_varlen_func | ||
| else: | ||
| flash_attn_varlen_func = None | ||
|
|
||
| # if attn_backend is TORCH_SDPA, | ||
| # it will reach here and the flash_attn_varlen_func will be None. | ||
| return flash_attn_varlen_func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function maybe_get_vit_flash_attn_backend is a duplicate of the one defined in vllm/attention/layer.py. To avoid code duplication and improve maintainability, it would be better to import it from vllm.attention.layer instead of redefining it here. This would also ensure that any future changes to this utility function are automatically reflected here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
vllm/vllm/model_executor/models/paddleocr_vl.py
Lines 612 to 616 in a995c14
| self.attn_backend = attn_backend | |
| self.attn_backend, self.flash_attn_varlen_func = ( | |
| maybe_get_vit_flash_attn_backend( | |
| self.attn_backend, | |
| attn_backend_override=attn_backend_override, |
maybe_get_vit_flash_attn_backend now only accepts the backend and returns a single function, but the PaddleOCR vision attention still unpacks two return values and passes attn_backend_override. Instantiating this module will now raise TypeError: maybe_get_vit_flash_attn_backend() got an unexpected keyword argument 'attn_backend_override', preventing the model from loading.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| self._flash_attn_varlen_func = maybe_get_vit_flash_attn_backend( | ||
| self.attn_backend, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MultiHeadAttention uses undefined attn_backend
The multimodal MultiHeadAttention now stores the selected backend in self.backend, but immediately passes self.attn_backend to maybe_get_vit_flash_attn_backend and the subsequent checks. Because self.attn_backend is never initialized, constructing this module raises AttributeError before any forward call, breaking all multimodal encoder attention layers.
Useful? React with 👍 / 👎.
Purpose
To avoid maintaining a variety of modeling files in vllm-ascend, we propose to remove all files in
modelsdir in vllm-ascend. After this, the only thing a vllm plugin need to do is just registering their custom device-specific OOT ops to vllm when adding a new model. To achieve this, there are some refactors need to be done both in vllm and vllm-ascend, such as extracting some general layers as CustomOp, find more details at vllm-project/vllm-ascend#4084.Following #27919 and #27147, this PR has unified the getting logic of
vit_attn_backendand extractedMMEncoderAttentionas a CustomOp.To be specific, vision attention backend should only be checked and overwritten in the platform-specific implementation. We should not overwrite this logic in any other places, such as
model_executor/models/<model_name>.py. In addition, I have moved scattered forward dispatch logic into this CustomOp to avoid verification forcurrent_platformin any other places.To minimize the influence, I only replaced the backend of
QwenVisionAttentionwith this CustomOp and have tested this PR both on Ascend A2 NPU and NVIDIA A100 GPU (TODO). I will modify other modeling files and delete the oldMultiHeadAttentionin the future if this PR could be merged.Test Plan
Test Result
✅ Ascend A2 NPU
Run Qwen2.5-VL:
Output:
Run Qwen3-VL:
Output:
NVIDIA A100 GPU
TO BE DONE...
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Signed-off-by: shen-shanshan [email protected]
Co-authored-by: Isotr0py [email protected]
Co-authored-by: tjtanaa [email protected]