[MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention #4349

shen-shanshan · 2025-11-21T10:11:02Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Functional test (eager mode)
Functional test (graph mode)
Benchmark

✅ Functional Test - Eager Mode

Run:

vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384 \
--max-num-batched-tokens 16384 \
--tensor-parallel-size 2 \
--enforce-eager

Output:

{"id":"chatcmpl-dcec377ffdda48f99252e6c45a418721","object":"chat.completion","created":1763971067,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image appears to be a series of characters that are not standard letters or symbols. It looks like a pattern or a code rather than readable text. The characters seem to be a mix of shapes and lines, possibly representing a specific language or a stylized form of writing, but without additional context, it's difficult to determine its meaning or purpose. If you have any specific questions about the pattern or need help with a particular aspect of the image, please let me know!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":177,"completion_tokens":99,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

✅ Functional Test - Graph Mode

Run:

vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384 \
--max-num-batched-tokens 16384 \
--tensor-parallel-size 2

Output:

{"id":"chatcmpl-dcec377ffdda48f99252e6c45a418721","object":"chat.completion","created":1763971067,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image appears to be a series of characters that are not standard letters or symbols. It looks like a pattern or a code rather than readable text. The characters seem to be a mix of shapes and lines, possibly representing a specific language or a stylized form of writing, but without additional context, it's difficult to determine its meaning or purpose. If you have any specific questions about the pattern or need help with a particular aspect of the image, please let me know!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":177,"completion_tokens":99,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

✅ Benchmark

After replacing the whole modeling file with that in latest vllm, we have got a considerable performance improvement.

Mean TTFT has has been reduced 24.76%.
Mean TPOT has been reduced 18.03.

Before removing:

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Request rate configured (RPS):           16.00     
Benchmark duration (s):                  45.35     
Total input tokens:                      20026     
Total generated tokens:                  20430     
Request throughput (req/s):              4.41      
Output token throughput (tok/s):         450.48    
Peak output token throughput (tok/s):    2055.00   
Peak concurrent requests:                194.00    
Total Token throughput (tok/s):          892.06    
---------------Time to First Token----------------
Mean TTFT (ms):                          11300.73  
Median TTFT (ms):                        11307.59  
P99 TTFT (ms):                           23844.70  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          243.95    
Median TPOT (ms):                        235.41    
P99 TPOT (ms):                           454.32    
---------------Inter-token Latency----------------
Mean ITL (ms):                           219.73    
Median ITL (ms):                         79.90     
P99 ITL (ms):                            666.86    
==================================================

After removing:

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Request rate configured (RPS):           16.00     
Benchmark duration (s):                  37.55     
Total input tokens:                      20026     
Total generated tokens:                  19561     
Request throughput (req/s):              5.33      
Output token throughput (tok/s):         520.90    
Peak output token throughput (tok/s):    1950.00   
Peak concurrent requests:                191.00    
Total Token throughput (tok/s):          1054.19   
---------------Time to First Token----------------
Mean TTFT (ms):                          8501.95   
Median TTFT (ms):                        8653.75   
P99 TTFT (ms):                           16959.75  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          200.08    
Median TPOT (ms):                        193.59    
P99 TPOT (ms):                           358.28    
---------------Inter-token Latency----------------
Mean ITL (ms):                           173.30    
Median ITL (ms):                         81.29     
P99 ITL (ms):                            574.16    
==================================================

vLLM version: v0.11.2
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

gemini-code-assist

Code Review

This pull request refactors the Ascend-specific implementation for Qwen2_5_VisionAttention. The changes move the custom attention logic from subclassing Qwen2_5_VisionAttention to monkey-patching it, and switches from weight padding to activation padding. This simplifies the code in vllm_ascend/models/qwen2_5_vl.py by removing a lot of redundant code.

However, the new implementation in vllm_ascend/patch/worker/patch_qwen2_5_vl.py introduces a critical bug in the forward method due to incorrect state management. Instance attributes are modified in a way that will cause incorrect behavior on subsequent calls. I've provided a detailed comment and a suggested fix for this issue.

vllm_ascend/patch/worker/patch_qwen2_5_vl.py

github-actions · 2025-11-21T10:38:08Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-11-24T09:10:54Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-11-26T03:52:12Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: shen-shanshan <[email protected]>

wangxiyuan · 2025-11-28T06:12:49Z

CI passed: https://github.com/vllm-project/vllm-ascend/actions/runs/19752005126/job/56596527080?pr=4349

Signed-off-by: shen-shanshan <[email protected]>

### What this PR does / why we need it? Following #4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]>

### What this PR does / why we need it? Following #4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Shanshan Shen <[email protected]>

…VisionAttention (vllm-project#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]>

### What this PR does / why we need it? Following vllm-project#4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]>

### What this PR does / why we need it? Following vllm-project#4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Shanshan Shen <[email protected]>

…VisionAttention (vllm-project#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Che Ruan <[email protected]>

### What this PR does / why we need it? Following vllm-project#4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Che Ruan <[email protected]>

### What this PR does / why we need it? Following vllm-project#4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Shanshan Shen <[email protected]> Signed-off-by: Che Ruan <[email protected]>

…VisionAttention (vllm-project#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Che Ruan <[email protected]>

### What this PR does / why we need it? Following vllm-project#4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Che Ruan <[email protected]>

### What this PR does / why we need it? Following vllm-project#4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Shanshan Shen <[email protected]> Signed-off-by: Che Ruan <[email protected]>

shen-shanshan mentioned this pull request Nov 21, 2025

[RFC]: Remove VL Modeling Files #4084

Open

15 tasks

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

vllm_ascend/patch/worker/patch_qwen2_5_vl.py Outdated Show resolved Hide resolved

shen-shanshan force-pushed the mm-patch branch from ced4ab1 to 3ed998e Compare November 24, 2025 08:21

github-actions bot added module:tests merge-conflicts labels Nov 24, 2025

shen-shanshan mentioned this pull request Nov 24, 2025

[MM][Perf] Replace VisionPatchEmbed with that in vllm for better performance #4198

Closed

shen-shanshan force-pushed the mm-patch branch from 6ee38c9 to 599d3e0 Compare November 24, 2025 12:42

github-actions bot removed the merge-conflicts label Nov 24, 2025

shen-shanshan changed the title ~~[MM][Patch] Patch AscendQwen2_5_VisionAttention and remove redundant code~~ [MM][Model] Remove Qwen2.5-VL modeling file and patch AscendQwen2_5_VisionAttention Nov 25, 2025

shen-shanshan changed the title ~~[MM][Model] Remove Qwen2.5-VL modeling file and patch AscendQwen2_5_VisionAttention~~ [MM][Model] Remove Qwen2.5-VL modeling file Nov 25, 2025

shen-shanshan changed the title ~~[MM][Model] Remove Qwen2.5-VL modeling file~~ [MM][Model] Remove Qwen2.5-VL modeling files and add necessary patch Nov 25, 2025

shen-shanshan changed the title ~~[MM][Model] Remove Qwen2.5-VL modeling files and add necessary patch~~ [MM][Model] Remove Qwen2.5-VL modeling files and add unavoidable patch Nov 25, 2025

github-actions bot added the module:core label Nov 25, 2025

shen-shanshan changed the title ~~[MM][Model] Remove Qwen2.5-VL modeling files and add unavoidable patch~~ [MM][Model] Remove Qwen2.5-VL modeling files and add patch for VisionAttention Nov 25, 2025

github-actions bot added the merge-conflicts label Nov 26, 2025

shen-shanshan force-pushed the mm-patch branch from 3a9acfc to 3e16d66 Compare November 26, 2025 07:49

shen-shanshan removed the merge-conflicts label Nov 26, 2025

shen-shanshan changed the title ~~[MM][Model] Remove Qwen2.5-VL modeling files and add patch for VisionAttention~~ [MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention Nov 26, 2025

shen-shanshan added ready read for review ready-for-test start test by label for PR labels Nov 27, 2025

shen-shanshan added 7 commits November 28, 2025 02:03

patch AscendQwen2_5_VisionAttention

d4c8b87

Signed-off-by: shen-shanshan <[email protected]>

remove more modeling files

29fae87

Signed-off-by: shen-shanshan <[email protected]>

remove qwen2.5-vl ut

a4b291a

Signed-off-by: shen-shanshan <[email protected]>

add padding manager

c198058

Signed-off-by: shen-shanshan <[email protected]>

sync main

a674f80

Signed-off-by: shen-shanshan <[email protected]>

remove qwen2.5-vl without padding

f391ac1

Signed-off-by: shen-shanshan <[email protected]>

update

73ab6c4

Signed-off-by: shen-shanshan <[email protected]>

shen-shanshan added 5 commits November 28, 2025 02:03

fix lint

f0af5cc

Signed-off-by: shen-shanshan <[email protected]>

fix lint

f3a6ad8

Signed-off-by: shen-shanshan <[email protected]>

fix lint

e02e99c

Signed-off-by: shen-shanshan <[email protected]>

fix lint

523ab29

Signed-off-by: shen-shanshan <[email protected]>

fix lint

f27ac29

Signed-off-by: shen-shanshan <[email protected]>

shen-shanshan force-pushed the mm-patch branch from ffdf103 to f27ac29 Compare November 28, 2025 02:03

shen-shanshan added ready read for review ready-for-test start test by label for PR and removed ready read for review ready-for-test start test by label for PR labels Nov 28, 2025

fix platform

9c62d27

Signed-off-by: shen-shanshan <[email protected]>

wangxiyuan approved these changes Nov 28, 2025

View reviewed changes

github-actions bot removed the module:core label Nov 28, 2025

wangxiyuan merged commit e52ebf8 into vllm-project:main Nov 28, 2025
21 checks passed

shen-shanshan mentioned this pull request Nov 28, 2025

[MM][Model] Remove Qwen2-VL modeling files #4534

Merged

Meihan-chen mentioned this pull request Nov 28, 2025

[Bug]: Accuracy issue on Qwen3-Omni-30B-A3B-Thinking #4513

Open

zhangxinyuehfad mentioned this pull request Nov 29, 2025

[Bugfix] Fix Qwen2.5-Omni-7B accuarcy test #4556

Merged

shen-shanshan mentioned this pull request Nov 29, 2025

[MM][Model] Remove Qwen3-VL modeling files #4577

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention #4349

[MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention #4349

shen-shanshan commented Nov 21, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

wangxiyuan commented Nov 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention #4349

[MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention #4349

Conversation

shen-shanshan commented Nov 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

✅ Functional Test - Eager Mode

✅ Functional Test - Graph Mode

✅ Benchmark

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

wangxiyuan commented Nov 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shen-shanshan commented Nov 21, 2025 •

edited by github-actions bot

Loading