Skip to content

[Bugfix] yield when stage response#287

Closed
Bounty-hunter wants to merge 1 commit intovllm-project:mainfrom
Bounty-hunter:async_sleep
Closed

[Bugfix] yield when stage response#287
Bounty-hunter wants to merge 1 commit intovllm-project:mainfrom
Bounty-hunter:async_sleep

Conversation

@Bounty-hunter
Copy link
Contributor

@Bounty-hunter Bounty-hunter commented Dec 11, 2025

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

As described in #286
, the execution becomes blocked while waiting for the stage response. We need to yield control so that other tasks can proceed.

Test Plan

Result:
image

The log as follow:

[Stage-0] Received batch size=1, request_ids=chatcmpl-23ee5fcf98294c3f88087feae38f9f94
--------------------------------
(APIServer pid=106638) INFO:vllm_omni.entrypoints.openai.serving_chat:dyyyyyy come to generate  1
(APIServer pid=106638) INFO:vllm_omni.entrypoints.openai.serving_chat:dyyyyyy come to generate chatcmpl-f55e61b8fa1a4305aec7d2aa3b81eefc
(APIServer pid=106638) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] generate() called
(APIServer pid=106638) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Seeding request into stage-0
(APIServer pid=106638) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Enqueued request chatcmpl-f55e61b8fa1a4305aec7d2aa3b81eefc to stage-0
(APIServer pid=106638) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Entering scheduling loop: stages=3
('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
--------------------------------
[Stage-0] Received batch size=1, request_ids=chatcmpl-f55e61b8fa1a4305aec7d2aa3b81eefc
--------------------------------
(APIServer pid=106638) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Stage-0 completed request chatcmpl-23ee5fcf98294c3f88087feae38f9f94;                         forwarding or finalizing
(APIServer pid=106638) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Request chatcmpl-23ee5fcf98294c3f88087feae38f9f94 finalized at stage-0
(APIServer pid=106638) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
--------------------------------
[Stage-1] Received batch size=1, request_ids=chatcmpl-23ee5fcf98294c3f88087feae38f9f94
--------------------------------
(EngineCore_DP0 pid=107787) /workspace/d00806799/code/epd/vllm-omni/vllm_omni/worker/npu/npu_model_runner.py:190: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
(EngineCore_DP0 pid=107787)   info_dict[k] = torch.from_numpy(arr)
(APIServer pid=106638) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Stage-0 completed request chatcmpl-f55e61b8fa1a4305aec7d2aa3b81eefc;                         forwarding or finalizing
(APIServer pid=106638) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Request chatcmpl-f55e61b8fa1a4305aec7d2aa3b81eefc finalized at stage-0
('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
--------------------------------
[Stage-1] Received batch size=1, request_ids=chatcmpl-f55e61b8fa1a4305aec7d2aa3b81eefc
--------------------------------
(APIServer pid=106638) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Stage-1 completed request chatcmpl-23ee5fcf98294c3f88087feae38f9f94;                         forwarding or finalizing
--------------------------------
[Stage-2] Received batch size=1, request_ids=chatcmpl-23ee5fcf98294c3f88087feae38f9f94
--------------------------------
(EngineCore_DP0 pid=108698) INFO:vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni:Currently, we do not use the chunked process, we only use the token2wav.process_chunk for the whole sequence. The stream mode will be implemented in the future.
INFO 12-11 20:43:20 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 12-11 20:43:20 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 12-11 20:43:20 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 12-11 20:43:20 [__init__.py:207] Platform plugin ascend is activated
INFO 12-11 20:43:24 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 12-11 20:43:26 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO:datasets:PyTorch version 2.7.1 available.
(APIServer pid=106638) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Stage-1 completed request chatcmpl-f55e61b8fa1a4305aec7d2aa3b81eefc;                         forwarding or finalizing
......('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
--------------------------------
[Stage-2] Received batch size=1, request_ids=chatcmpl-f55e61b8fa1a4305aec7d2aa3b81eefc
--------------------------------
(APIServer pid=106638) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Stage-2 completed request chatcmpl-23ee5fcf98294c3f88087feae38f9f94;                         forwarding or finalizing
(APIServer pid=106638) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Request chatcmpl-23ee5fcf98294c3f88087feae38f9f94 finalized at stage-2
(APIServer pid=106638) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] All requests completed
(APIServer pid=106638) INFO:vllm_omni.entrypoints.async_omni:[Summary] {'e2e_requests': 1, 'e2e_total_time_ms': 261757.46083259583, 'e2e_sum_time_ms': 261757.03525543213, 'e2e_total_tokens': 0, 'e2e_avg_time_per_request_ms': 261757.03525543213, 'e2e_avg_tokens_per_s': 0.0, 'wall_time_ms': 261757.46083259583, 'final_stage_id': 2, 'stages': [{'stage_id': 0, 'requests': 2, 'tokens': 170, 'total_time_ms': 9340.158939361572, 'avg_time_per_request_ms': 4670.079469680786, 'avg_tokens_per_s': 18.2009750694478}, {'stage_id': 1, 'requests': 1, 'tokens': 1005, 'total_time_ms': 48908.53309631348, 'avg_time_per_request_ms': 48908.53309631348, 'avg_tokens_per_s': 20.54856149582112}, {'stage_id': 2, 'requests': 1, 'tokens': 0, 'total_time_ms': 209500.1938343048, 'avg_time_per_request_ms': 209500.1938343048, 'avg_tokens_per_s': 0.0}], 'transfers': [{'from_stage': 0, 'to_stage': 1, 'samples': 2, 'total_bytes': 4109230, 'total_time_ms': 18.233299255371094, 'tx_mbps': 1802.9562033495476, 'rx_samples': 1, 'rx_total_bytes': 1681800, 'rx_total_time_ms': 8.18181037902832, 'rx_mbps': 1644.4282349156395, 'total_samples': 1, 'total_transfer_t

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@Bounty-hunter Bounty-hunter changed the title [Bugfix]yield when no respone [Bugfix] yield when stage response Dec 11, 2025
@hsliuustc0106
Copy link
Collaborator

please update the execution time comparison

Copy link
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for a nice catch!. Please use git commit -s --amend to pass the DCO check.

Copy link
Contributor

@fake0fan fake0fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But this is just the first step, ensuring that different requests can be switched between processes during execution. We will continue to solve parallel issues and out-of-order return results.

@Bounty-hunter
Copy link
Contributor Author

Thanks for a nice catch!. Please use git commit -s --amend to pass the DCO check.

Okay, but I will try to address it together with #293
, which involves a response mismatch issue among concurrent requests.

@Bounty-hunter
Copy link
Contributor Author

LGTM. But this is just the first step, ensuring that different requests can be switched between processes during execution. We will continue to solve parallel issues and out-of-order return results.

#293 track this problem

@david6666666
Copy link
Collaborator

nice catch

@david6666666
Copy link
Collaborator

@Bounty-hunter fix DCO

@Bounty-hunter
Copy link
Contributor Author

@Bounty-hunter fix DCO

I will close this pr. when it resolved, new problem has emerged where responses are being mismatched across concurrent requests, as describe in #293, I both of them in #301 , please help to review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Concurrent requests fail to execute in a pipelined manner in online mode

5 participants