Skip to content

fix: guard flash-attn rotary import#42679

Merged
yewentao256 merged 3 commits into
vllm-project:mainfrom
he-yufeng:fix/flash-attn-rotary-spec
Jun 11, 2026
Merged

fix: guard flash-attn rotary import#42679
yewentao256 merged 3 commits into
vllm-project:mainfrom
he-yufeng:fix/flash-attn-rotary-spec

Conversation

@he-yufeng

@he-yufeng he-yufeng commented May 14, 2026

Copy link
Copy Markdown
Contributor

Fixes #42675.

Summary

  • import flash_attn.ops.triton.rotary directly when vLLM is not on CPU
  • treat a missing rotary module as unavailable instead of crashing during ApplyRotaryEmb construction
  • keep the existing fallback path when FA4 no longer provides the old rotary module

FA4 can leave the flash_attn root package importable while moving or removing flash_attn.ops.triton.rotary. Checking only the root package therefore selects a fast path that fails during import. The branch now follows the maintainer-requested direct import shape and catches ModuleNotFoundError around that exact import.

Duplicate work check

I searched open, closed, and recently merged vLLM PRs for #42675, flash_attn.ops.triton.rotary, ApplyRotaryEmb, and the FA4 import failure before opening and updating this PR. I did not find another active fix for this path.

AI assistance disclosure

I used OpenAI Codex to help inspect the repository, update the implementation after review, and run local validation. I reviewed the final diff and can explain the failure mode and fallback behavior. The commits include the required Assisted-by attribution trailers.

Reproduce on main

With an FA4 environment where flash_attn exists but flash_attn.ops.triton.rotary does not, run:

vllm bench latency \
  --model=Qwen/Qwen3-1.7B \
  --quantization=fp8 \
  --attention-backend=TRITON_ATTN \
  --input-len=1024 \
  --output-len=128 \
  --batch-size=8

main imports the missing rotary module after probing only the root flash_attn package, which raises ModuleNotFoundError. On this branch, the direct rotary import is guarded and the existing fallback remains active.

Validation

Passed locally on Windows:

  • python -m py_compile vllm/model_executor/layers/rotary_embedding/common.py
  • python -m ruff check vllm/model_executor/layers/rotary_embedding/common.py
  • python -m ruff format --check vllm/model_executor/layers/rotary_embedding/common.py
  • git diff --check

The previous branch head also passed the vLLM Buildkite PR suite. I do not have the required FA4/GPU environment to rerun the full latency reproduction locally.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@he-yufeng he-yufeng force-pushed the fix/flash-attn-rotary-spec branch from eff8da3 to d53627d Compare May 14, 2026 20:58

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the ApplyRotaryEmb class to check for the specific submodule flash_attn.ops.triton.rotary instead of just the base flash_attn package before attempting to import it. This ensures that the rotary embedding logic correctly handles environments where the base package is present but the required Triton operations are missing. Additionally, a unit test was added to verify this behavior using mocks. I have no feedback to provide as there were no review comments.

Comment on lines +32 to +55
def test_apply_rotary_emb_skips_flash_attn_without_rotary_module(monkeypatch):
monkeypatch.setattr(rotary_common.current_platform, "is_cpu", lambda: False)

def fake_find_spec(name):
if name == "flash_attn":
return object()
if name == "flash_attn.ops.triton.rotary":
return None
raise AssertionError(f"unexpected import probe: {name}")

monkeypatch.setattr(rotary_common, "find_spec", fake_find_spec)

compilation_config = CompilationConfig(custom_ops=["all"])
monkeypatch.setattr(
custom_op_module,
"get_cached_compilation_config",
lambda: compilation_config,
)

op = rotary_common.ApplyRotaryEmb()

assert op.apply_rotary_emb_flash_attn is None


Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_apply_rotary_emb_skips_flash_attn_without_rotary_module(monkeypatch):
monkeypatch.setattr(rotary_common.current_platform, "is_cpu", lambda: False)
def fake_find_spec(name):
if name == "flash_attn":
return object()
if name == "flash_attn.ops.triton.rotary":
return None
raise AssertionError(f"unexpected import probe: {name}")
monkeypatch.setattr(rotary_common, "find_spec", fake_find_spec)
compilation_config = CompilationConfig(custom_ops=["all"])
monkeypatch.setattr(
custom_op_module,
"get_cached_compilation_config",
lambda: compilation_config,
)
op = rotary_common.ApplyRotaryEmb()
assert op.apply_rotary_emb_flash_attn is None

Thanks for the work! For this small update, we don't need a specific unit test

@joonyoo181

Copy link
Copy Markdown

@he-yufeng are you able to finish this? otherwise, i can take over

@he-yufeng he-yufeng force-pushed the fix/flash-attn-rotary-spec branch from d53627d to d42c6dc Compare June 1, 2026 22:14
@he-yufeng

Copy link
Copy Markdown
Contributor Author

I pushed d42c6dc and removed the dedicated unit test per review, keeping the import guard focused on the production change. Local validation: py_compile, ruff check, ruff format --check, and git diff --check. CI is running now.

yewentao256
yewentao256 previously approved these changes Jun 3, 2026

@yewentao256 yewentao256 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

Please also add the full reproduce command in main and error report in PR description, then the unit tests pass in this branch

@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 3, 2026
@he-yufeng he-yufeng force-pushed the fix/flash-attn-rotary-spec branch from d42c6dc to 1114322 Compare June 4, 2026 03:33
@he-yufeng

Copy link
Copy Markdown
Contributor Author

Rebased the branch onto the latest upstream/main and kept the PR scoped to the production import guard only.

Validation run locally:

  • python -m py_compile vllm/model_executor/layers/rotary_embedding/common.py: passed
  • python -m ruff check vllm/model_executor/layers/rotary_embedding/common.py: passed
  • python -m ruff format --check vllm/model_executor/layers/rotary_embedding/common.py: passed
  • git diff --check upstream/main..HEAD: passed

@he-yufeng he-yufeng force-pushed the fix/flash-attn-rotary-spec branch from 1114322 to ed20d62 Compare June 4, 2026 06:51
@he-yufeng

Copy link
Copy Markdown
Contributor Author

Pushed ed20d62ee to fix the Buildkite failure on this branch.

Root cause from the failed Buildkite logs:

ModuleNotFoundError: No module named 'flash_attn'
...
find_spec("flash_attn.ops.triton.rotary")

importlib.util.find_spec() can raise ModuleNotFoundError when the parent package is absent. The branch now wraps that spec probe in _has_flash_attn_rotary() and treats a missing parent package as "rotary module unavailable", while still leaving the actual apply_rotary import explicit when the spec exists.

Validation run locally:

python -m py_compile vllm/model_executor/layers/rotary_embedding/common.py
python -m ruff check vllm/model_executor/layers/rotary_embedding/common.py
python -m ruff format --check vllm/model_executor/layers/rotary_embedding/common.py
git diff --check

I also reproduced the failing condition with the same helper logic in a local Python snippet: missing flash_attn now returns False instead of raising.


self.apply_rotary_emb_flash_attn = None
if not current_platform.is_cpu() and find_spec("flash_attn") is not None:
if not current_platform.is_cpu() and _has_flash_attn_rotary():

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not current_platform.is_cpu() and _has_flash_attn_rotary():
if not current_platform.is_cpu() and find_spec("flash_attn.ops.triton.rotary") is not None():

Why can't we do like this?

@he-yufeng he-yufeng Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_spec("flash_attn.ops.triton.rotary") is not None was the first shape I tried, but it still raises when the parent package is absent.

The Buildkite failure on the previous head was exactly that case:

ModuleNotFoundError: No module named 'flash_attn'
...
find_spec("flash_attn.ops.triton.rotary")

find_spec() only returns None when the searched module is absent under an importable parent package. If flash_attn itself is not installed, it raises ModuleNotFoundError while resolving the parent. The helper keeps the production guard to the same check, but catches that missing-parent case and treats it as unavailable.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, please also add the full reproduce command in main and error report in PR description, then the same command pass in this branch, I will take a look

@he-yufeng

Copy link
Copy Markdown
Contributor Author

Updated the PR description with the requested main-branch repro command, error text, branch behavior, and current validation/CI status.

@yewentao256 yewentao256 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @he-yufeng
Seems your issue has been fixed in main, or there is an issue with your current env

[yewentao256@nm-frk-h200-03-preserve vllm-source]$ python - <<'PY'
from vllm.model_executor.layers.rotary_embedding.common import ApplyRotaryEmb

ApplyRotaryEmb(
    rotary_dim=128,
    neox_style=True,
    head_size=128,
    is_segmented=False,
)
PY
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
TypeError: ApplyRotaryEmb.__init__() got an unexpected keyword argument 'rotary_dim'

The error is from your command line instead of the missing of import.

@yewentao256

yewentao256 commented Jun 7, 2026

Copy link
Copy Markdown
Member

Close this PR as not scheduled. Feel free to reopen if I am wrong

@yewentao256 yewentao256 closed this Jun 7, 2026
@joonyoo181

Copy link
Copy Markdown

@yewentao256 i dont think main fixes the issue. the issue here is with calling find_spec("flash_attn") to check if flash_attn.ops.triton.rotary module exists. With FA4, flash_attn exists but flash_attn.ops.triton.rotary was moved elsewhere, so the code causes import error. You need to have FA4 installed on your environment in order for the command to trigger the import error.

check #42675

@yewentao256

Copy link
Copy Markdown
Member

@joonyoo181 Please update FA4 to newer version like

pip show flash-attn-4
Name: flash-attn-4
Version: 4.0.0b16
Home-page: https://github.com/Dao-AILab/flash-attention
Author: Tri Dao
Author-email: 
License: BSD 3-Clause License
Location: /home/yewentao256/.venv/lib/python3.12/site-packages
Requires: apache-tvm-ffi, einops, nvidia-cutlass-dsl, quack-kernels, torch, torch-c-dlpack-ext, typing_extensions
Required-by:

Or uninstall it, vLLM has its own FA compilation.

@yewentao256

Copy link
Copy Markdown
Member

Opps, this can be reproduce by vllm bench latency --model=Qwen/Qwen3-1.7B --quantization=fp8 --attention-backend=TRITON_ATTN --input-len=1024 --output-len=128 --batch-size=8

@yewentao256 yewentao256 reopened this Jun 8, 2026
@@ -135,7 +142,7 @@ def __init__(
self.enable_fp32_compute = enable_fp32_compute

self.apply_rotary_emb_flash_attn = None
if not current_platform.is_cpu() and find_spec("flash_attn") is not None:
if not current_platform.is_cpu() and _has_flash_attn_rotary():

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not current_platform.is_cpu() and _has_flash_attn_rotary():
if not current_platform.is_cpu():
try:
apply_rotary = import_module(
"flash_attn.ops.triton.rotary"
).apply_rotary
except ModuleNotFoundError:
apply_rotary = None

I'd prefer this fix

he-yufeng added 2 commits June 9, 2026 18:23
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Assisted-by: OpenAI Codex
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Assisted-by: OpenAI Codex
@he-yufeng he-yufeng force-pushed the fix/flash-attn-rotary-spec branch from ed20d62 to f53d644 Compare June 9, 2026 10:23
@he-yufeng

Copy link
Copy Markdown
Contributor Author

Implemented the direct-import approach in f53d644. The code now attempts import_module("flash_attn.ops.triton.rotary").apply_rotary only on non-CPU platforms and leaves the existing fallback selected when that exact module is missing. I used contextlib.suppress(ModuleNotFoundError) because the repository's ruff configuration rejects a bare try/except/pass with SIM105.

I also added the repository-required AI assistance disclosure and Assisted-by commit trailers. Local py_compile, ruff check, ruff format check, and diff check pass.

@yewentao256 yewentao256 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work and iterations!

@he-yufeng

Copy link
Copy Markdown
Contributor Author

Checked the remaining red Buildkite shard. The failing job is multi-modal-models-standard-1-qwen2, and the root error is in Qwen/Qwen2.5-VL-3B-Instruct processor loading:

OSError: Can't load image processor for 'Qwen/Qwen2.5-VL-3B-Instruct'

The failure is in tests/models/multimodal/generation/test_common.py::test_multi_image_models[qwen2_5_vl-test_case47], with repeated Insufficient Permissions warnings while detecting available memory. The CPU/basic/model-executor/rotary-adjacent shards are green, and this does not look related to the rotary_embedding/common.py fallback change in this PR.

Could someone rerun that Buildkite shard or the PR build?

@yewentao256 yewentao256 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's retry once before force merge

@yewentao256 yewentao256 merged commit 1f60771 into vllm-project:main Jun 11, 2026
72 checks passed
ryttry pushed a commit to ryttry/vllm that referenced this pull request Jun 11, 2026
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: FA4 causes no module named 'flash_attn.ops'

3 participants