Skip to content

[NPU] Align with GPUModelRunner#1114

Merged
david6666666 merged 4 commits intovllm-project:mainfrom
gcanlin:align
Jan 31, 2026
Merged

[NPU] Align with GPUModelRunner#1114
david6666666 merged 4 commits intovllm-project:mainfrom
gcanlin:align

Conversation

@gcanlin
Copy link
Contributor

@gcanlin gcanlin commented Jan 30, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Align with the newest commit in GPUModelRunner:

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@gcanlin gcanlin changed the title [NPU] align with GPUModelRunner [NPU] Align with GPUModelRunner Jan 30, 2026
Signed-off-by: gcanlin <[email protected]>
@gcanlin
Copy link
Contributor Author

gcanlin commented Jan 30, 2026

Qwen3-Omni ACL graph breaks:

Details
] WorkerProc hit an exception.
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822] Traceback (most recent call last):
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 817, in worker_busy_loop
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     output = func(*args, **kwargs)
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]              ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     return func(*args, **kwargs)
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker.py", line 262, in determine_available_memory
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     self.model_runner.profile_run()
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2176, in profile_run
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     super().profile_run()
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 4743, in profile_run
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     hidden_states, last_hidden_states = self._dummy_run(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]                                         ^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     return func(*args, **kwargs)
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-omni/vllm_omni/platforms/npu/worker/npu_model_runner.py", line 597, in _dummy_run
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     hidden_states = self.talker_mtp(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]                     ^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-omni/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py", line 626, in talker_mtp
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     code_predictor_codes, summed_embeddings = self.talker.code_predictor_forward(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-omni/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_talker.py", line 203, in code_predictor_forward
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     pos_all_layers, current_input = self.code_predictor(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]                                     ^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm/vllm/compilation/decorators.py", line 558, in __call__
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     output = TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)  # type: ignore[arg-type]
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 228, in __call__
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     return self._call_with_optional_nvtx_range(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 119, in _call_with_optional_nvtx_range
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     return callable_fn(*args, **kwargs)
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 841, in compile_wrapper
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     raise e.with_traceback(None) from e.__cause__  # User compiler error
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822] torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   Explanation: Dynamo developers have intentionally marked that the function `find_spec` in file `<frozen importlib.util>` should not be traced.
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   Hint: Avoid calling the function `find_spec`.
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   Hint: Apply `@torch._dynamo.dont_skip_tracing` to the function `find_spec` to force tracing into the function. More graph breaks may occur as a result of attempting to trace into the function.
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   Hint: Please file an issue to PyTorch.
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   Developer debug context: module: importlib.util, qualname: find_spec, skip reason: <missing reason>
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]  For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0007.html
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822] from user code:
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]    File "/root/vllm-workspace/vllm-omni/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_code_predictor_mtp.py", line 522, in forward
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     outputs = self.model(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-omni/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_code_predictor_mtp.py", line 390, in forward
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     hidden_states = layer.mtp_block(hidden_states, past_key_values, cache_position, use_cache, position_ids)
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-omni/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_code_predictor_mtp.py", line 306, in mtp_block
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     hidden_states = self.self_attn(hidden_states, past_key_values, cache_position, use_cache, position_ids)
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-omni/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_code_predictor_mtp.py", line 173, in forward
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     attn_output, _ = attention_interface(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/transformers/integrations/flash_attention.py", line 66, in flash_attention_forward
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     attn_output = _flash_attention_forward(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/transformers/modeling_flash_attention_utils.py", line 570, in _flash_attention_forward
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     (flash_fn, flash_varlen_fn, pad_fn, unpad_fn), process_flash_kwargs_fn = lazy_import_flash_attention(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/transformers/modeling_flash_attention_utils.py", line 136, in lazy_import_flash_attention
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     _flash_fn, _flash_varlen_fn, _pad_fn, _unpad_fn = _lazy_imports(implementation)
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/transformers/modeling_flash_attention_utils.py", line 77, in _lazy_imports
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     is_fa2 = is_flash_attn_2_available()
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1193, in is_flash_attn_2_available
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     if not _is_package_available("flash_attn"):
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 47, in _is_package_available
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     package_exists = importlib.util.find_spec(pkg_name) is not None
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822] Traceback (most recent call last):
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 817, in worker_busy_loop
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     output = func(*args, **kwargs)
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]              ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     return func(*args, **kwargs)
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker.py", line 262, in determine_available_memory
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     self.model_runner.profile_run()
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2176, in profile_run
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     super().profile_run()
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py", line 4743, in profile_run
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     hidden_states, last_hidden_states = self._dummy_run(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]                                         ^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     return func(*args, **kwargs)
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-omni/vllm_omni/platforms/npu/worker/npu_model_runner.py", line 597, in _dummy_run
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     hidden_states = self.talker_mtp(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]                     ^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-omni/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni.py", line 626, in talker_mtp
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     code_predictor_codes, summed_embeddings = self.talker.code_predictor_forward(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-omni/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_talker.py", line 203, in code_predictor_forward
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     pos_all_layers, current_input = self.code_predictor(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]                                     ^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm/vllm/compilation/decorators.py", line 558, in __call__
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     output = TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)  # type: ignore[arg-type]
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 228, in __call__
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     return self._call_with_optional_nvtx_range(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm/vllm/compilation/wrapper.py", line 119, in _call_with_optional_nvtx_range
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     return callable_fn(*args, **kwargs)
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 841, in compile_wrapper
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     raise e.with_traceback(None) from e.__cause__  # User compiler error
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822] torch._dynamo.exc.Unsupported: Attempted to call function marked as skipped
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   Explanation: Dynamo developers have intentionally marked that the function `find_spec` in file `<frozen importlib.util>` should not be traced.
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   Hint: Avoid calling the function `find_spec`.
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   Hint: Apply `@torch._dynamo.dont_skip_tracing` to the function `find_spec` to force tracing into the function. More graph breaks may occur as a result of attempting to trace into the function.
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   Hint: Please file an issue to PyTorch.
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   Developer debug context: module: importlib.util, qualname: find_spec, skip reason: <missing reason>
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]  For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0007.html
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822] from user code:
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]    File "/root/vllm-workspace/vllm-omni/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_code_predictor_mtp.py", line 522, in forward
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     outputs = self.model(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-omni/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_code_predictor_mtp.py", line 390, in forward
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     hidden_states = layer.mtp_block(hidden_states, past_key_values, cache_position, use_cache, position_ids)
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-omni/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_code_predictor_mtp.py", line 306, in mtp_block
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     hidden_states = self.self_attn(hidden_states, past_key_values, cache_position, use_cache, position_ids)
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/vllm-omni/vllm_omni/model_executor/models/qwen3_omni/qwen3_omni_moe_code_predictor_mtp.py", line 173, in forward
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     attn_output, _ = attention_interface(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/transformers/integrations/flash_attention.py", line 66, in flash_attention_forward
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     attn_output = _flash_attention_forward(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/transformers/modeling_flash_attention_utils.py", line 570, in _flash_attention_forward
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     (flash_fn, flash_varlen_fn, pad_fn, unpad_fn), process_flash_kwargs_fn = lazy_import_flash_attention(
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/transformers/modeling_flash_attention_utils.py", line 136, in lazy_import_flash_attention
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     _flash_fn, _flash_varlen_fn, _pad_fn, _unpad_fn = _lazy_imports(implementation)
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/transformers/modeling_flash_attention_utils.py", line 77, in _lazy_imports
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     is_fa2 = is_flash_attn_2_available()
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 1193, in is_flash_attn_2_available
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     if not _is_package_available("flash_attn"):
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]   File "/root/vllm-workspace/.venv/lib/python3.11/site-packages/transformers/utils/import_utils.py", line 47, in _is_package_available
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]     package_exists = importlib.util.find_spec(pkg_name) is not None
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822] Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
(Worker pid=241259) [Stage-1] ERROR 01-30 16:09:21 [multiproc_executor.py:822]

@david6666666 david6666666 added this to the v0.14.0 milestone Jan 31, 2026
Signed-off-by: gcanlin <[email protected]>
@gcanlin gcanlin marked this pull request as ready for review January 31, 2026 03:41
@gcanlin
Copy link
Contributor Author

gcanlin commented Jan 31, 2026

Qwen3-Omni:

python end2end.py --output-wav output_audio                   --query-type use_audio
 {'type': 'request_level_metrics',
INFO 01-31 03:40:32 [log_utils.py:550]  'request_id': '0_a71677b8-8697-40c3-a214-92e125b19622',
INFO 01-31 03:40:32 [log_utils.py:550]  'e2e_time_ms': 74595.84975242615,
INFO 01-31 03:40:32 [log_utils.py:550]  'e2e_tpt': 132.9694291487097,
INFO 01-31 03:40:32 [log_utils.py:550]  'e2e_total_tokens': 561,
INFO 01-31 03:40:32 [log_utils.py:550]  'transfers_total_time_ms': 19.654035568237305,
INFO 01-31 03:40:32 [log_utils.py:550]  'transfers_total_bytes': 9330587,
INFO 01-31 03:40:32 [log_utils.py:550]  'stages': {0: {'stage_gen_time_ms': 23235.073566436768,
INFO 01-31 03:40:32 [log_utils.py:550]                 'num_tokens_out': 72,
INFO 01-31 03:40:32 [log_utils.py:550]                 'num_tokens_in': 265},
INFO 01-31 03:40:32 [log_utils.py:550]             1: {'stage_gen_time_ms': 32800.30679702759, 'num_tokens_out': 224},
INFO 01-31 03:40:32 [log_utils.py:550]             2: {'stage_gen_time_ms': 17407.96208381653, 'num_tokens_out': 0}}}
Processed prompts: 100%|███████████████████████████████████| 1/1 [01:14<00:00, 74.60s/req, est. speed stage-2 tok/s: 7.52, avg e2e_lat: 0.0ms]
INFO 01-31 03:40:32 [omni.py:860] [Summary] {'e2e_requests': 1,1 [01:14<00:00, 74.60s/req, est. speed stage-2 tok/s: 7.52, avg e2e_lat: 0.0ms]
INFO 01-31 03:40:32 [omni.py:860]  'e2e_total_time_ms': 74596.72665596008,
INFO 01-31 03:40:32 [omni.py:860]  'e2e_sum_time_ms': 74595.84975242615,
INFO 01-31 03:40:32 [omni.py:860]  'e2e_total_tokens': 561,
INFO 01-31 03:40:32 [omni.py:860]  'e2e_avg_time_per_request_ms': 74595.84975242615,
INFO 01-31 03:40:32 [omni.py:860]  'e2e_avg_tokens_per_s': 7.520525630606602,
INFO 01-31 03:40:32 [omni.py:860]  'wall_time_ms': 74596.72665596008,
INFO 01-31 03:40:32 [omni.py:860]  'final_stage_id': {'0_a71677b8-8697-40c3-a214-92e125b19622': 2},
INFO 01-31 03:40:32 [omni.py:860]  'stages': [{'stage_id': 0,
INFO 01-31 03:40:32 [omni.py:860]              'requests': 1,
INFO 01-31 03:40:32 [omni.py:860]              'tokens': 337,
INFO 01-31 03:40:32 [omni.py:860]              'total_time_ms': 23259.597301483154,
INFO 01-31 03:40:32 [omni.py:860]              'avg_time_per_request_ms': 23259.597301483154,
INFO 01-31 03:40:32 [omni.py:860]              'avg_tokens_per_s': 14.488642930138395},
INFO 01-31 03:40:32 [omni.py:860]             {'stage_id': 1,
INFO 01-31 03:40:32 [omni.py:860]              'requests': 1,
INFO 01-31 03:40:32 [omni.py:860]              'tokens': 224,
INFO 01-31 03:40:32 [omni.py:860]              'total_time_ms': 32812.37721443176,
INFO 01-31 03:40:32 [omni.py:860]              'avg_time_per_request_ms': 32812.37721443176,
INFO 01-31 03:40:32 [omni.py:860]              'avg_tokens_per_s': 6.82669221239718},
INFO 01-31 03:40:32 [omni.py:860]             {'stage_id': 2,
INFO 01-31 03:40:32 [omni.py:860]              'requests': 1,
INFO 01-31 03:40:32 [omni.py:860]              'tokens': 0,
INFO 01-31 03:40:32 [omni.py:860]              'total_time_ms': 17416.627168655396,
INFO 01-31 03:40:32 [omni.py:860]              'avg_time_per_request_ms': 17416.627168655396,
INFO 01-31 03:40:32 [omni.py:860]              'avg_tokens_per_s': 0.0}],
INFO 01-31 03:40:32 [omni.py:860]  'transfers': [{'from_stage': 0,
INFO 01-31 03:40:32 [omni.py:860]                 'to_stage': 1,
INFO 01-31 03:40:32 [omni.py:860]                 'samples': 1,
INFO 01-31 03:40:32 [omni.py:860]                 'total_bytes': 8299244,
INFO 01-31 03:40:32 [omni.py:860]                 'total_time_ms': 7.549047470092773,
INFO 01-31 03:40:32 [omni.py:860]                 'tx_mbps': 8795.010531200707,
INFO 01-31 03:40:32 [omni.py:860]                 'rx_samples': 1,
INFO 01-31 03:40:32 [omni.py:860]                 'rx_total_bytes': 8299244,
INFO 01-31 03:40:32 [omni.py:860]                 'rx_total_time_ms': 7.003545761108398,
INFO 01-31 03:40:32 [omni.py:860]                 'rx_mbps': 9480.048287639422,
INFO 01-31 03:40:32 [omni.py:860]                 'total_samples': 1,
INFO 01-31 03:40:32 [omni.py:860]                 'total_transfer_time_ms': 15.373706817626953,
INFO 01-31 03:40:32 [omni.py:860]                 'total_mbps': 4318.669061899569},
INFO 01-31 03:40:32 [omni.py:860]                {'from_stage': 1,
INFO 01-31 03:40:32 [omni.py:860]                 'to_stage': 2,
INFO 01-31 03:40:32 [omni.py:860]                 'samples': 1,
INFO 01-31 03:40:32 [omni.py:860]                 'total_bytes': 1031343,
INFO 01-31 03:40:32 [omni.py:860]                 'total_time_ms': 1.018524169921875,
INFO 01-31 03:40:32 [omni.py:860]                 'tx_mbps': 8100.685524853932,
INFO 01-31 03:40:32 [omni.py:860]                 'rx_samples': 1,
INFO 01-31 03:40:32 [omni.py:860]                 'rx_total_bytes': 1031343,
INFO 01-31 03:40:32 [omni.py:860]                 'rx_total_time_ms': 2.595186233520508,
INFO 01-31 03:40:32 [omni.py:860]                 'rx_mbps': 3179.249293723105,
INFO 01-31 03:40:32 [omni.py:860]                 'total_samples': 1,
INFO 01-31 03:40:32 [omni.py:860]                 'total_transfer_time_ms': 4.280328750610352,
INFO 01-31 03:40:32 [omni.py:860]                 'total_mbps': 1927.5958648791845}]}
Adding requests:   0%|                                                                                                  | 0/1 [01:14<?, ?it/s]
query type: use_audio
Request ID: 0_a71677b8-8697-40c3-a214-92e125b19622, Text saved to output_audio/0_a71677b8-8697-40c3-a214-92e125b19622.txt
Request ID: 0_a71677b8-8697-40c3-a214-92e125b19622, Saved audio to output_audio/output_0_a71677b8-8697-40c3-a214-92e125b19622.wav

Qwen2.5-Omni:

 python end2end.py --output-wav output_audio                   --query-type use_mixed_modalities
INFO 01-31 03:50:54 [log_utils.py:550] {'type': 'request_level_metrics',
INFO 01-31 03:50:54 [log_utils.py:550]  'request_id': '0_e466e0b1-b084-4b71-88ab-45be529b0fd5',
INFO 01-31 03:50:54 [log_utils.py:550]  'e2e_time_ms': 161106.65488243103,
INFO 01-31 03:50:54 [log_utils.py:550]  'e2e_tpt': 25.168982171915488,
INFO 01-31 03:50:54 [log_utils.py:550]  'e2e_total_tokens': 6401,
INFO 01-31 03:50:54 [log_utils.py:550]  'transfers_total_time_ms': 135.81585884094238,
INFO 01-31 03:50:54 [log_utils.py:550]  'transfers_total_bytes': 116649082,
INFO 01-31 03:50:54 [log_utils.py:550]  'stages': {0: {'stage_gen_time_ms': 20946.707487106323,
INFO 01-31 03:50:54 [log_utils.py:550]                 'num_tokens_out': 62,
INFO 01-31 03:50:54 [log_utils.py:550]                 'num_tokens_in': 5513},
INFO 01-31 03:50:54 [log_utils.py:550]             1: {'stage_gen_time_ms': 28253.693342208862, 'num_tokens_out': 826},
INFO 01-31 03:50:54 [log_utils.py:550]             2: {'stage_gen_time_ms': 111519.29759979248, 'num_tokens_out': 0}}}
Processed prompts: 100%|█████████████████████████████████| 1/1 [02:41<00:00, 161.11s/req, est. speed stage-2 tok/s: 39.73, avg e2e_lat: 0.0ms]
INFO 01-31 03:50:54 [omni.py:860] [Summary] {'e2e_requests': 1,[02:41<00:00, 161.11s/req, est. speed stage-2 tok/s: 39.73, avg e2e_lat: 0.0ms]
INFO 01-31 03:50:54 [omni.py:860]  'e2e_total_time_ms': 161107.51581192017,
INFO 01-31 03:50:54 [omni.py:860]  'e2e_sum_time_ms': 161106.65488243103,
INFO 01-31 03:50:54 [omni.py:860]  'e2e_total_tokens': 6401,
INFO 01-31 03:50:54 [omni.py:860]  'e2e_avg_time_per_request_ms': 161106.65488243103,
INFO 01-31 03:50:54 [omni.py:860]  'e2e_avg_tokens_per_s': 39.73144377351255,
INFO 01-31 03:50:54 [omni.py:860]  'wall_time_ms': 161107.51581192017,
INFO 01-31 03:50:54 [omni.py:860]  'final_stage_id': {'0_e466e0b1-b084-4b71-88ab-45be529b0fd5': 2},
INFO 01-31 03:50:54 [omni.py:860]  'stages': [{'stage_id': 0,
INFO 01-31 03:50:54 [omni.py:860]              'requests': 1,
INFO 01-31 03:50:54 [omni.py:860]              'tokens': 5575,
INFO 01-31 03:50:54 [omni.py:860]              'total_time_ms': 21126.060485839844,
INFO 01-31 03:50:54 [omni.py:860]              'avg_time_per_request_ms': 21126.060485839844,
INFO 01-31 03:50:54 [omni.py:860]              'avg_tokens_per_s': 263.8920779260645},
INFO 01-31 03:50:54 [omni.py:860]             {'stage_id': 1,
INFO 01-31 03:50:54 [omni.py:860]              'requests': 1,
INFO 01-31 03:50:54 [omni.py:860]              'tokens': 826,
INFO 01-31 03:50:54 [omni.py:860]              'total_time_ms': 28333.07433128357,
INFO 01-31 03:50:54 [omni.py:860]              'avg_time_per_request_ms': 28333.07433128357,
INFO 01-31 03:50:54 [omni.py:860]              'avg_tokens_per_s': 29.15320767319569},
INFO 01-31 03:50:54 [omni.py:860]             {'stage_id': 2,
INFO 01-31 03:50:54 [omni.py:860]              'requests': 1,
INFO 01-31 03:50:54 [omni.py:860]              'tokens': 0,
INFO 01-31 03:50:54 [omni.py:860]              'total_time_ms': 111535.30430793762,
INFO 01-31 03:50:54 [omni.py:860]              'avg_time_per_request_ms': 111535.30430793762,
INFO 01-31 03:50:54 [omni.py:860]              'avg_tokens_per_s': 0.0}],
INFO 01-31 03:50:54 [omni.py:860]  'transfers': [{'from_stage': 0,
INFO 01-31 03:50:54 [omni.py:860]                 'to_stage': 1,
INFO 01-31 03:50:54 [omni.py:860]                 'samples': 1,
INFO 01-31 03:50:54 [omni.py:860]                 'total_bytes': 98299988,
INFO 01-31 03:50:54 [omni.py:860]                 'total_time_ms': 64.20040130615234,
INFO 01-31 03:50:54 [omni.py:860]                 'tx_mbps': 12249.14312061534,
INFO 01-31 03:50:54 [omni.py:860]                 'rx_samples': 1,
INFO 01-31 03:50:54 [omni.py:860]                 'rx_total_bytes': 98299988,
INFO 01-31 03:50:54 [omni.py:860]                 'rx_total_time_ms': 46.864986419677734,
INFO 01-31 03:50:54 [omni.py:860]                 'rx_mbps': 16780.115904819835,
INFO 01-31 03:50:54 [omni.py:860]                 'total_samples': 1,
INFO 01-31 03:50:54 [omni.py:860]                 'total_transfer_time_ms': 111.72366142272949,
INFO 01-31 03:50:54 [omni.py:860]                 'total_mbps': 7038.794593604429},
INFO 01-31 03:50:54 [omni.py:860]                {'from_stage': 1,
INFO 01-31 03:50:54 [omni.py:860]                 'to_stage': 2,
INFO 01-31 03:50:54 [omni.py:860]                 'samples': 1,
INFO 01-31 03:50:54 [omni.py:860]                 'total_bytes': 18349094,
INFO 01-31 03:50:54 [omni.py:860]                 'total_time_ms': 13.662576675415039,
INFO 01-31 03:50:54 [omni.py:860]                 'tx_mbps': 10744.148449255876,
INFO 01-31 03:50:54 [omni.py:860]                 'rx_samples': 1,
INFO 01-31 03:50:54 [omni.py:860]                 'rx_total_bytes': 18349094,
INFO 01-31 03:50:54 [omni.py:860]                 'rx_total_time_ms': 9.711980819702148,
INFO 01-31 03:50:54 [omni.py:860]                 'rx_mbps': 15114.604808754339,
INFO 01-31 03:50:54 [omni.py:860]                 'total_samples': 1,
INFO 01-31 03:50:54 [omni.py:860]                 'total_transfer_time_ms': 24.09219741821289,
INFO 01-31 03:50:54 [omni.py:860]                 'total_mbps': 6092.9582076655915}]}

Signed-off-by: gcanlin <[email protected]>
@gcanlin
Copy link
Contributor Author

gcanlin commented Jan 31, 2026

@david6666666 It's safe to merge directly without CI.

@david6666666
Copy link
Collaborator

LGTM

@david6666666 david6666666 added the ready label to trigger buildkite CI label Jan 31, 2026
@david6666666 david6666666 enabled auto-merge (squash) January 31, 2026 04:00
@david6666666 david6666666 merged commit 722a69e into vllm-project:main Jan 31, 2026
6 of 7 checks passed
dongbo910220 pushed a commit to dongbo910220/vllm-omni that referenced this pull request Feb 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants