[CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. #30125

shen-shanshan · 2025-12-05T09:47:09Z

Purpose

To avoid maintaining a variety of modeling files in vllm-ascend, we propose to remove all files in models dir in vllm-ascend. After this, the only thing a vllm plugin need to do is just registering their custom device-specific OOT ops to vllm when adding a new model. To achieve this, there are some refactors need to be done both in vllm and vllm-ascend, such as extracting some general layers as CustomOp, find more details at vllm-project/vllm-ascend#4084.

Following #27919 and #27147, this PR has unified the getting logic of vit_attn_backend and extracted MMEncoderAttention as a CustomOp.

To be specific, vision attention backend should only be checked and overwritten in the platform-specific implementation. We should not overwrite this logic in any other places, such as model_executor/models/<model_name>.py. In addition, I have moved scattered forward dispatch logic into this CustomOp to avoid verification for current_platform in any other places.

To minimize the influence, I only replaced the backend of QwenVisionAttention with this CustomOp and have tested this PR both on Ascend A2 NPU and NVIDIA A100 GPU (TODO). I will modify other modeling files and delete the old MultiHeadAttention in the future if this PR could be merged.

Test Plan

Test this PR together with [CustomOp][MM] Register AscendMMEncoderAttention CustomOp and remove related patch vllm-ascend#4750 on Ascend A2 NPU.
Test this PR on NVIDIA A100 GPU.

Test Result

✅ Ascend A2 NPU

Run Qwen2.5-VL:

vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384 \
--max-num-batched-tokens 16384 \
--tensor-parallel-size 2 \
--enforce-eager

Output:

{"id":"chatcmpl-b4e3053f30ab2442","object":"chat.completion","created":1764922950,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" being slightly larger than \"Qwen.\" The design includes a geometric, abstract shape on the left side of the logo, which complements the text.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":162,"completion_tokens":84,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Run Qwen3-VL:

vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max_model_len 16384 \
--tensor-parallel-size 2 \
--enforce-eager

Output:

{"id":"chatcmpl-97571fbda8267bd1","object":"chat.completion","created":1764923306,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

NVIDIA A100 GPU

TO BE DONE...

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: shen-shanshan [email protected]
Co-authored-by: Isotr0py [email protected]
Co-authored-by: tjtanaa [email protected]

Co-authored-by: Isotr0py <[email protected]> Co-authored-by: tjtanaa <[email protected]> Signed-off-by: shen-shanshan <[email protected]>

gemini-code-assist

Code Review

This pull request is a good step towards refactoring the attention mechanisms and making the codebase more modular by introducing MMEncoderAttention as a CustomOp. The unification of the vision attention backend logic is also a welcome improvement.

I've found a critical bug in vllm/attention/layer.py where a variable was renamed but not all its usages were updated, which would cause a runtime error. I've also pointed out an instance of code duplication in the new mm_encoder_attention.py file that should be addressed to improve maintainability.

Once these issues are resolved, this PR will be a solid contribution to the project's architecture.

gemini-code-assist · 2025-12-05T09:49:17Z

vllm/attention/layer.py

+        self._flash_attn_varlen_func = maybe_get_vit_flash_attn_backend(
+            self.attn_backend,
        )


There seems to be a typo here. The __init__ method now sets self.backend, but self.attn_backend is used here, which is no longer defined. This will lead to an AttributeError at runtime. You should probably use self.backend instead.

Suggested change

self._flash_attn_varlen_func = maybe_get_vit_flash_attn_backend(

self.attn_backend,

)

self._flash_attn_varlen_func = maybe_get_vit_flash_attn_backend(

self.backend,

)

gemini-code-assist · 2025-12-05T09:49:18Z

vllm/attention/layers/mm_encoder_attention.py

+def maybe_get_vit_flash_attn_backend(
+    attn_backend: AttentionBackendEnum | None,
+) -> Callable | None:
+    # At this point,
+    # we already have the attn_backend,
+    # overriding logic is done in the platform-specific implementation.
+    # so we don't need to override backend here.
+    # Just return the attn_backend and flash_attn_varlen_func.
+
+    if (
+        attn_backend == AttentionBackendEnum.FLASH_ATTN
+        and current_platform.is_cuda_alike()
+    ):
+        from flash_attn import flash_attn_varlen_func
+    elif attn_backend == AttentionBackendEnum.FLASH_ATTN and current_platform.is_xpu():
+        from vllm.attention.utils.fa_utils import flash_attn_varlen_func
+    elif attn_backend == AttentionBackendEnum.ROCM_AITER_FA:
+        from aiter import flash_attn_varlen_func
+    else:
+        flash_attn_varlen_func = None
+
+    # if attn_backend is TORCH_SDPA,
+    # it will reach here and the flash_attn_varlen_func will be None.
+    return flash_attn_varlen_func


This function maybe_get_vit_flash_attn_backend is a duplicate of the one defined in vllm/attention/layer.py. To avoid code duplication and improve maintainability, it would be better to import it from vllm.attention.layer instead of redefining it here. This would also ensure that any future changes to this utility function are automatically reflected here.

chatgpt-codex-connector

💡 Codex Review

vllm/vllm/model_executor/models/paddleocr_vl.py

Lines 612 to 616 in a995c14

    
           self.attn_backend = attn_backend 
        
           self.attn_backend, self.flash_attn_varlen_func = ( 
        
               maybe_get_vit_flash_attn_backend( 
        
                   self.attn_backend, 
        
                   attn_backend_override=attn_backend_override,

PaddleOCR vision attention calls helper with outdated signature

maybe_get_vit_flash_attn_backend now only accepts the backend and returns a single function, but the PaddleOCR vision attention still unpacks two return values and passes attn_backend_override. Instantiating this module will now raise TypeError: maybe_get_vit_flash_attn_backend() got an unexpected keyword argument 'attn_backend_override', preventing the model from loading.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-05T09:50:26Z

vllm/attention/layer.py

+        self._flash_attn_varlen_func = maybe_get_vit_flash_attn_backend(
+            self.attn_backend,


MultiHeadAttention uses undefined attn_backend

The multimodal MultiHeadAttention now stores the selected backend in self.backend, but immediately passes self.attn_backend to maybe_get_vit_flash_attn_backend and the subsequent checks. Because self.attn_backend is never initialized, constructing this module raises AttributeError before any forward call, breaking all multimodal encoder attention layers.

Useful? React with 👍 / 👎.

extract mm encoder attention as custom op.

a995c14

Co-authored-by: Isotr0py <[email protected]> Co-authored-by: tjtanaa <[email protected]> Signed-off-by: shen-shanshan <[email protected]>

shen-shanshan requested review from LucasWilkinson, NickLucche, jikunshang, sighingnow and tjtanaa as code owners December 5, 2025 09:47

mergify bot added qwen Related to Qwen models nvidia rocm Related to AMD ROCm tpu Related to Google TPUs labels Dec 5, 2025

github-project-automation bot added this to NVIDIA Dec 5, 2025

shen-shanshan mentioned this pull request Dec 5, 2025

[RFC]: Remove VL Modeling Files vllm-project/vllm-ascend#4084

Open

15 tasks

gemini-code-assist bot reviewed Dec 5, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Dec 5, 2025

View reviewed changes

shen-shanshan marked this pull request as draft December 5, 2025 09:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. #30125

[CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. #30125

shen-shanshan commented Dec 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 5, 2025

Uh oh!

gemini-code-assist bot Dec 5, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	self.attn_backend = attn_backend
	self.attn_backend, self.flash_attn_varlen_func = (
	maybe_get_vit_flash_attn_backend(
	self.attn_backend,
	attn_backend_override=attn_backend_override,

		self._flash_attn_varlen_func = maybe_get_vit_flash_attn_backend(
		self.attn_backend,

Uh oh!

[CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. #30125

Are you sure you want to change the base?

[CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. #30125

Conversation

shen-shanshan commented Dec 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

✅ Ascend A2 NPU

NVIDIA A100 GPU

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shen-shanshan commented Dec 5, 2025 •

edited by github-actions bot

Loading