Skip to content

Conversation

@ECMGit
Copy link

@ECMGit ECMGit commented Nov 27, 2025

Purpose

Fix the issue reported in #29417, Flash attention backend requires head_dim to be a multiple of 32 in vllm/model_executor/models/qwen2_5_vl.py , Fall back to TORCH_SDPA backend as a workaround to fix this issue.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@ECMGit ECMGit requested a review from sighingnow as a code owner November 27, 2025 15:13
@mergify mergify bot added the qwen Related to Qwen models label Nov 27, 2025
@ECMGit ECMGit changed the title [BugFix] Cosmos-Reason1-7B Model Flash Attention head dim bug, fall back to TORCH_SDPA backend [BugFix] Cosmos-Reason1-7B Model Flash Attention requires head dim to be a multiple of 32 Nov 27, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for a crash that occurs when using FlashAttention with a head dimension that is not a multiple of 32 in the Qwen2.5-VL vision encoder. The change correctly detects this condition and falls back to the TORCH_SDPA backend, which is a robust solution to prevent the runtime error. The implementation is clear and correctly placed within the Qwen2_5_VisionAttention module's initialization. The fix is well-contained and effectively resolves the reported issue.

AttentionBackendEnum.ROCM_AITER_FA,
} and self.hidden_size_per_attention_head % 32 != 0:
logger.warning(
f"Flash attention backend requires head_dim to be a multiple of 32, "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix pre-commit

@DarkLight1337
Copy link
Member

cc @tjtanaa @Isotr0py @ywang96 so you're aware of this

if self.attn_backend in {
AttentionBackendEnum.FLASH_ATTN,
AttentionBackendEnum.ROCM_AITER_FA,
} and self.hidden_size_per_attention_head % 32 != 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #28763 has added head_size=80 (used by cosmos-7b's ViT) support to FA. Can you try to install the latest nightly wheel?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have verify this bug does not occur in latest nightly build, verify to close this PR and issue.

self.use_upstream_fa = False
# Flash attention requires head_dim to be a multiple of 32
# Fall back to TORCH_SDPA if the head dimension is incompatible
if self.attn_backend in {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ECMGit can you skip this check on rocm for now? We have different conditions.

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mergify
Copy link

mergify bot commented Nov 28, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ECMGit.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase qwen Related to Qwen models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants