Skip to content

Conversation

@shen-shanshan
Copy link
Collaborator

@shen-shanshan shen-shanshan commented Nov 21, 2025

What this PR does / why we need it?

  • Patch Qwen2_5_VisionAttention with AscendQwen2_5_VisionAttention.
  • Replace AscendQwen2_5_VisionTransformer with Qwen2_5_VisionTransformer in vllm.
  • Move padding logic (q/k/v and cos/sin) before FA to forward() of Qwen2_5_VisionAttention.
  • Covert cu_seqlens in Qwen2_5_VisionAttention from cumulative form to intervals and move it to cpu (compatible for npu FA).
  • Remove Qwen2.5-VL modeling files.
  • Remove Qwen2.5-VL (without padding) modeling files.
  • Remove related UT.
  • Make set_forward_context pluggable when getting MM embedding. Find more details at [Platform] Make forward context manager pluggable for other device vllm#29388.
  • Simplify padding logic for FA.
  • Add patch for [Model][Perf] Use cos and sin cache in QwenVL vllm#28798.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • Functional test (eager mode)
  • Functional test (graph mode)
  • Benchmark

✅ Functional Test - Eager Mode

Run:

vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384 \
--max-num-batched-tokens 16384 \
--tensor-parallel-size 2 \
--enforce-eager

Output:

{"id":"chatcmpl-dcec377ffdda48f99252e6c45a418721","object":"chat.completion","created":1763971067,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image appears to be a series of characters that are not standard letters or symbols. It looks like a pattern or a code rather than readable text. The characters seem to be a mix of shapes and lines, possibly representing a specific language or a stylized form of writing, but without additional context, it's difficult to determine its meaning or purpose. If you have any specific questions about the pattern or need help with a particular aspect of the image, please let me know!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":177,"completion_tokens":99,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

✅ Functional Test - Graph Mode

Run:

vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384 \
--max-num-batched-tokens 16384 \
--tensor-parallel-size 2

Output:

{"id":"chatcmpl-dcec377ffdda48f99252e6c45a418721","object":"chat.completion","created":1763971067,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image appears to be a series of characters that are not standard letters or symbols. It looks like a pattern or a code rather than readable text. The characters seem to be a mix of shapes and lines, possibly representing a specific language or a stylized form of writing, but without additional context, it's difficult to determine its meaning or purpose. If you have any specific questions about the pattern or need help with a particular aspect of the image, please let me know!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":177,"completion_tokens":99,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

✅ Benchmark

After replacing the whole modeling file with that in latest vllm, we have got a considerable performance improvement.

  • Mean TTFT has has been reduced 24.76%.
  • Mean TPOT has been reduced 18.03.

Before removing:

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Request rate configured (RPS):           16.00     
Benchmark duration (s):                  45.35     
Total input tokens:                      20026     
Total generated tokens:                  20430     
Request throughput (req/s):              4.41      
Output token throughput (tok/s):         450.48    
Peak output token throughput (tok/s):    2055.00   
Peak concurrent requests:                194.00    
Total Token throughput (tok/s):          892.06    
---------------Time to First Token----------------
Mean TTFT (ms):                          11300.73  
Median TTFT (ms):                        11307.59  
P99 TTFT (ms):                           23844.70  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          243.95    
Median TPOT (ms):                        235.41    
P99 TPOT (ms):                           454.32    
---------------Inter-token Latency----------------
Mean ITL (ms):                           219.73    
Median ITL (ms):                         79.90     
P99 ITL (ms):                            666.86    
==================================================

After removing:

============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Request rate configured (RPS):           16.00     
Benchmark duration (s):                  37.55     
Total input tokens:                      20026     
Total generated tokens:                  19561     
Request throughput (req/s):              5.33      
Output token throughput (tok/s):         520.90    
Peak output token throughput (tok/s):    1950.00   
Peak concurrent requests:                191.00    
Total Token throughput (tok/s):          1054.19   
---------------Time to First Token----------------
Mean TTFT (ms):                          8501.95   
Median TTFT (ms):                        8653.75   
P99 TTFT (ms):                           16959.75  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          200.08    
Median TPOT (ms):                        193.59    
P99 TPOT (ms):                           358.28    
---------------Inter-token Latency----------------
Mean ITL (ms):                           173.30    
Median ITL (ms):                         81.29     
P99 ITL (ms):                            574.16    
==================================================

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Ascend-specific implementation for Qwen2_5_VisionAttention. The changes move the custom attention logic from subclassing Qwen2_5_VisionAttention to monkey-patching it, and switches from weight padding to activation padding. This simplifies the code in vllm_ascend/models/qwen2_5_vl.py by removing a lot of redundant code.

However, the new implementation in vllm_ascend/patch/worker/patch_qwen2_5_vl.py introduces a critical bug in the forward method due to incorrect state management. Instance attributes are modified in a way that will cause incorrect behavior on subsequent calls. I've provided a detailed comment and a suggested fix for this issue.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@shen-shanshan shen-shanshan changed the title [MM][Patch] Patch AscendQwen2_5_VisionAttention and remove redundant code [MM][Model] Remove Qwen2.5-VL modeling file and patch AscendQwen2_5_VisionAttention Nov 25, 2025
@shen-shanshan shen-shanshan changed the title [MM][Model] Remove Qwen2.5-VL modeling file and patch AscendQwen2_5_VisionAttention [MM][Model] Remove Qwen2.5-VL modeling file Nov 25, 2025
@shen-shanshan shen-shanshan changed the title [MM][Model] Remove Qwen2.5-VL modeling file [MM][Model] Remove Qwen2.5-VL modeling files and add necessary patch Nov 25, 2025
@shen-shanshan shen-shanshan changed the title [MM][Model] Remove Qwen2.5-VL modeling files and add necessary patch [MM][Model] Remove Qwen2.5-VL modeling files and add unavoidable patch Nov 25, 2025
@shen-shanshan shen-shanshan changed the title [MM][Model] Remove Qwen2.5-VL modeling files and add unavoidable patch [MM][Model] Remove Qwen2.5-VL modeling files and add patch for VisionAttention Nov 25, 2025
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@shen-shanshan shen-shanshan changed the title [MM][Model] Remove Qwen2.5-VL modeling files and add patch for VisionAttention [MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention Nov 26, 2025
@shen-shanshan shen-shanshan added ready read for review ready-for-test start test by label for PR labels Nov 27, 2025
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
@shen-shanshan shen-shanshan added ready read for review ready-for-test start test by label for PR and removed ready read for review ready-for-test start test by label for PR labels Nov 28, 2025
@wangxiyuan
Copy link
Collaborator

Signed-off-by: shen-shanshan <[email protected]>
@wangxiyuan wangxiyuan merged commit e52ebf8 into vllm-project:main Nov 28, 2025
21 checks passed
wangxiyuan pushed a commit that referenced this pull request Nov 29, 2025
### What this PR does / why we need it?

Following #4349, remove
Qwen2-VL modeling files.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shen-shanshan <[email protected]>
wangxiyuan pushed a commit that referenced this pull request Dec 1, 2025
### What this PR does / why we need it?
Following #4349, remove
Qwen3-VL modeling files.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: Shanshan Shen <[email protected]>
ChenCangtao pushed a commit to ChenCangtao/vllm-ascend that referenced this pull request Dec 3, 2025
…VisionAttention (vllm-project#4349)

### What this PR does / why we need it?

- [x] Patch `Qwen2_5_VisionAttention` with
`AscendQwen2_5_VisionAttention`.
- [x] Replace `AscendQwen2_5_VisionTransformer` with
`Qwen2_5_VisionTransformer` in vllm.
- [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of
`Qwen2_5_VisionAttention`.
- [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative
form to intervals and move it to cpu (compatible for npu FA).
- [x] Remove Qwen2.5-VL modeling files.
- [x] Remove Qwen2.5-VL (without padding) modeling files.
- [x] Remove related UT.
- [x] Make `set_forward_context` pluggable when getting MM embedding.
Find more details at vllm-project/vllm#29388.
- [x] Simplify padding logic for FA.
- [x] Add patch for vllm-project/vllm#28798.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [x] Functional test (eager mode)
- [x] Functional test (graph mode)
- [x] Benchmark

- vLLM version: v0.11.2

---------

Signed-off-by: shen-shanshan <[email protected]>
ChenCangtao pushed a commit to ChenCangtao/vllm-ascend that referenced this pull request Dec 3, 2025
### What this PR does / why we need it?

Following vllm-project#4349, remove
Qwen2-VL modeling files.


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shen-shanshan <[email protected]>
ChenCangtao pushed a commit to ChenCangtao/vllm-ascend that referenced this pull request Dec 3, 2025
### What this PR does / why we need it?
Following vllm-project#4349, remove
Qwen3-VL modeling files.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: Shanshan Shen <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
…VisionAttention (vllm-project#4349)

### What this PR does / why we need it?

- [x] Patch `Qwen2_5_VisionAttention` with
`AscendQwen2_5_VisionAttention`.
- [x] Replace `AscendQwen2_5_VisionTransformer` with
`Qwen2_5_VisionTransformer` in vllm.
- [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of
`Qwen2_5_VisionAttention`.
- [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative
form to intervals and move it to cpu (compatible for npu FA).
- [x] Remove Qwen2.5-VL modeling files.
- [x] Remove Qwen2.5-VL (without padding) modeling files.
- [x] Remove related UT.
- [x] Make `set_forward_context` pluggable when getting MM embedding.
Find more details at vllm-project/vllm#29388.
- [x] Simplify padding logic for FA.
- [x] Add patch for vllm-project/vllm#28798.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [x] Functional test (eager mode)
- [x] Functional test (graph mode)
- [x] Benchmark

- vLLM version: v0.11.2

---------

Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
### What this PR does / why we need it?

Following vllm-project#4349, remove
Qwen2-VL modeling files.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
### What this PR does / why we need it?
Following vllm-project#4349, remove
Qwen3-VL modeling files.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: Shanshan Shen <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
…VisionAttention (vllm-project#4349)

### What this PR does / why we need it?

- [x] Patch `Qwen2_5_VisionAttention` with
`AscendQwen2_5_VisionAttention`.
- [x] Replace `AscendQwen2_5_VisionTransformer` with
`Qwen2_5_VisionTransformer` in vllm.
- [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of
`Qwen2_5_VisionAttention`.
- [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative
form to intervals and move it to cpu (compatible for npu FA).
- [x] Remove Qwen2.5-VL modeling files.
- [x] Remove Qwen2.5-VL (without padding) modeling files.
- [x] Remove related UT.
- [x] Make `set_forward_context` pluggable when getting MM embedding.
Find more details at vllm-project/vllm#29388.
- [x] Simplify padding logic for FA.
- [x] Add patch for vllm-project/vllm#28798.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

- [x] Functional test (eager mode)
- [x] Functional test (graph mode)
- [x] Benchmark

- vLLM version: v0.11.2

---------

Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
### What this PR does / why we need it?

Following vllm-project#4349, remove
Qwen2-VL modeling files.

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
### What this PR does / why we need it?
Following vllm-project#4349, remove
Qwen3-VL modeling files.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: Shanshan Shen <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants