-
Notifications
You must be signed in to change notification settings - Fork 624
[MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention #4349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the Ascend-specific implementation for Qwen2_5_VisionAttention. The changes move the custom attention logic from subclassing Qwen2_5_VisionAttention to monkey-patching it, and switches from weight padding to activation padding. This simplifies the code in vllm_ascend/models/qwen2_5_vl.py by removing a lot of redundant code.
However, the new implementation in vllm_ascend/patch/worker/patch_qwen2_5_vl.py introduces a critical bug in the forward method due to incorrect state management. Instance attributes are modified in a way that will cause incorrect behavior on subsequent calls. I've provided a detailed comment and a suggested fix for this issue.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
ced4ab1 to
3ed998e
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
6ee38c9 to
599d3e0
Compare
AscendQwen2_5_VisionAttention and remove redundant codeAscendQwen2_5_VisionAttention
AscendQwen2_5_VisionAttention|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
3a9acfc to
3e16d66
Compare
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
Signed-off-by: shen-shanshan <[email protected]>
ffdf103 to
f27ac29
Compare
Signed-off-by: shen-shanshan <[email protected]>
### What this PR does / why we need it? Following #4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]>
### What this PR does / why we need it? Following #4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Shanshan Shen <[email protected]>
…VisionAttention (vllm-project#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Shanshan Shen <[email protected]>
…VisionAttention (vllm-project#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Che Ruan <[email protected]>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Che Ruan <[email protected]>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Shanshan Shen <[email protected]> Signed-off-by: Che Ruan <[email protected]>
…VisionAttention (vllm-project#4349) ### What this PR does / why we need it? - [x] Patch `Qwen2_5_VisionAttention` with `AscendQwen2_5_VisionAttention`. - [x] Replace `AscendQwen2_5_VisionTransformer` with `Qwen2_5_VisionTransformer` in vllm. - [x] Move padding logic (q/k/v and cos/sin) before FA to `forward()` of `Qwen2_5_VisionAttention`. - [x] Covert `cu_seqlens` in `Qwen2_5_VisionAttention` from cumulative form to intervals and move it to cpu (compatible for npu FA). - [x] Remove Qwen2.5-VL modeling files. - [x] Remove Qwen2.5-VL (without padding) modeling files. - [x] Remove related UT. - [x] Make `set_forward_context` pluggable when getting MM embedding. Find more details at vllm-project/vllm#29388. - [x] Simplify padding logic for FA. - [x] Add patch for vllm-project/vllm#28798. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Functional test (eager mode) - [x] Functional test (graph mode) - [x] Benchmark - vLLM version: v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Che Ruan <[email protected]>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen2-VL modeling files. - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Che Ruan <[email protected]>
### What this PR does / why we need it? Following vllm-project#4349, remove Qwen3-VL modeling files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: shen-shanshan <[email protected]> Signed-off-by: Shanshan Shen <[email protected]> Signed-off-by: Che Ruan <[email protected]>
What this PR does / why we need it?
Qwen2_5_VisionAttentionwithAscendQwen2_5_VisionAttention.AscendQwen2_5_VisionTransformerwithQwen2_5_VisionTransformerin vllm.forward()ofQwen2_5_VisionAttention.cu_seqlensinQwen2_5_VisionAttentionfrom cumulative form to intervals and move it to cpu (compatible for npu FA).set_forward_contextpluggable when getting MM embedding. Find more details at [Platform] Make forward context manager pluggable for other device vllm#29388.Does this PR introduce any user-facing change?
No.
How was this patch tested?
✅ Functional Test - Eager Mode
Run:
Output:
{"id":"chatcmpl-dcec377ffdda48f99252e6c45a418721","object":"chat.completion","created":1763971067,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image appears to be a series of characters that are not standard letters or symbols. It looks like a pattern or a code rather than readable text. The characters seem to be a mix of shapes and lines, possibly representing a specific language or a stylized form of writing, but without additional context, it's difficult to determine its meaning or purpose. If you have any specific questions about the pattern or need help with a particular aspect of the image, please let me know!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":177,"completion_tokens":99,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}✅ Functional Test - Graph Mode
Run:
Output:
{"id":"chatcmpl-dcec377ffdda48f99252e6c45a418721","object":"chat.completion","created":1763971067,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image appears to be a series of characters that are not standard letters or symbols. It looks like a pattern or a code rather than readable text. The characters seem to be a mix of shapes and lines, possibly representing a specific language or a stylized form of writing, but without additional context, it's difficult to determine its meaning or purpose. If you have any specific questions about the pattern or need help with a particular aspect of the image, please let me know!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":177,"completion_tokens":99,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}✅ Benchmark
After replacing the whole modeling file with that in latest vllm, we have got a considerable performance improvement.
Before removing:
After removing: