-
Notifications
You must be signed in to change notification settings - Fork 452
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
The output of python collect_env.py
```
vllm: 0.11.0
vllm-ascend: 0.11.0rc2
```
Collecting environment information...
==============================
System Info
==============================
OS : openEuler 24.03 (LTS-SP2) (aarch64)
GCC version : (GCC) 10.3.1
Clang version : Could not collect
CMake version : version 4.1.2
Libc version : glibc-2.38
==============================
PyTorch Info
==============================
PyTorch version : 2.7.1+cpu
Is debug build : False
CUDA used to build PyTorch : None
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.11.13 (main, Nov 2 2025, 08:49:25) [GCC 12.3.1 (openEuler 12.3.1-98.oe2403sp2)] (64-bit runtime)
Python platform : Linux-4.19.90-vhulk2211.3.0.h1912.eulerosv2r10.aarch64-aarch64-with-glibc2.38
==============================
CUDA / GPU Info
==============================
Is CUDA available : False
CUDA runtime version : No CUDA
CUDA_MODULE_LOADING set to : N/A
GPU models and configuration : No CUDA
Nvidia driver version : No CUDA
cuDNN version : No CUDA
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: HiSilicon
BIOS Vendor ID: HiSilicon
Model name: Kunpeng-920
BIOS Model name: HUAWEI Kunpeng 920 5250 To be filled by O.E.M. CPU @ 2.6GHz
BIOS CPU family: 280
Model: 0
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 4
Stepping: 0x1
BogoMIPS: 200.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache: 12 MiB (192 instances)
L1i cache: 12 MiB (192 instances)
L2 cache: 96 MiB (192 instances)
L3 cache: 192 MiB (8 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
NUMA node4 CPU(s): 96-119
NUMA node5 CPU(s): 120-143
NUMA node6 CPU(s): 144-167
NUMA node7 CPU(s): 168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.7.1
[pip3] torch_npu==2.7.1
[pip3] torchaudio==2.8.0
[pip3] torchvision==0.22.1
[pip3] transformers==4.57.1
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : Could not collect
vLLM Version : 0.11.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
Could not collect
==============================
Environment Variables
==============================
PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/openssl-3.2.6/lib:/usr/local/lib64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
🐛 Describe the bug
When multiple requests are sent concurrently, the logs indicate that they are still executed sequentially rather than in a pipelined manner.
Start server:
vllm serve /workspace/models/Qwen2.5-Omni-7B --omni --port 8091
Send request:
python openai_chat_completion_client_for_multimodal_generation.py \
--query-type text \
--prompt "Generate a 100-character introduction about Huawei." > out1.log 2>&1 &
python openai_chat_completion_client_for_multimodal_generation.py \
--query-type text \
--prompt "Generate a 300-character introduction about Beijing." > out2.log 2>&1 &
The log is:
(APIServer pid=65949) INFO: Application startup complete.
(APIServer pid=65949) WARNING 12-11 19:14:40 [protocol.py:93] The following fields were present in the request but ignored: {'sampling_params_list'}
(APIServer pid=65949) dyyyyy recieve a request.
(APIServer pid=65949) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(APIServer pid=65949) INFO 12-11 19:14:42 [chat_utils.py:560] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=65949) WARNING 12-11 19:14:42 [protocol.py:93] The following fields were present in the request but ignored: {'sampling_params_list'}
(APIServer pid=65949) dyyyyy recieve a request.
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] generate() called
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Seeding request into stage-0
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Enqueued request chatcmpl-de4a7f2737964365b6cd10e6159e56d3 to stage-0
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Entering scheduling loop: stages=3
--------------------------------
[Stage-0] Received batch size=1, request_ids=chatcmpl-de4a7f2737964365b6cd10e6159e56d3
--------------------------------
('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Stage-0 completed request chatcmpl-de4a7f2737964365b6cd10e6159e56d3; forwarding or finalizing
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Request chatcmpl-de4a7f2737964365b6cd10e6159e56d3 finalized at stage-0
(APIServer pid=65949) ('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
--------------------------------
[Stage-1] Received batch size=1, request_ids=chatcmpl-de4a7f2737964365b6cd10e6159e56d3
--------------------------------
(EngineCore_DP0 pid=67098) /workspace/d00806799/code/epd/vllm-omni/vllm_omni/worker/npu/npu_model_runner.py:190: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
(EngineCore_DP0 pid=67098) info_dict[k] = torch.from_numpy(arr)
('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Stage-1 completed request chatcmpl-de4a7f2737964365b6cd10e6159e56d3; forwarding or finalizing
--------------------------------
[Stage-2] Received batch size=1, request_ids=chatcmpl-de4a7f2737964365b6cd10e6159e56d3
--------------------------------
(EngineCore_DP0 pid=68009) INFO:vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni:Currently, we do not use the chunked process, we only use the token2wav.process_chunk for the whole sequence. The stream mode will be implemented in the future.
INFO 12-11 19:15:46 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 12-11 19:15:46 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 12-11 19:15:46 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 12-11 19:15:46 [__init__.py:207] Platform plugin ascend is activated
INFO 12-11 19:15:49 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 12-11 19:15:51 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO:datasets:PyTorch version 2.7.1 available.
......('Warning: torch.save with "_use_new_zipfile_serialization = False" is not recommended for npu tensor, which may bring unexpected errors and hopefully set "_use_new_zipfile_serialization = True"', 'if it is necessary to use this, please convert the npu tensor to cpu tensor for saving')
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Stage-2 completed request chatcmpl-de4a7f2737964365b6cd10e6159e56d3; forwarding or finalizing
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Request chatcmpl-de4a7f2737964365b6cd10e6159e56d3 finalized at stage-2
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] All requests completed
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Summary] {'e2e_requests': 1, 'e2e_total_time_ms': 245017.2679424286, 'e2e_sum_time_ms': 245016.72983169556, 'e2e_total_tokens': 0, 'e2e_avg_time_per_request_ms': 245016.72983169556, 'e2e_avg_tokens_per_s': 0.0, 'wall_time_ms': 245017.2679424286, 'final_stage_id': 2, 'stages': [{'stage_id': 0, 'requests': 1, 'tokens': 111, 'total_time_ms': 6128.433704376221, 'avg_time_per_request_ms': 6128.433704376221, 'avg_tokens_per_s': 18.112295140067616}, {'stage_id': 1, 'requests': 1, 'tokens': 936, 'total_time_ms': 47577.05330848694, 'avg_time_per_request_ms': 47577.05330848694, 'avg_tokens_per_s': 19.6733495437607}, {'stage_id': 2, 'requests': 1, 'tokens': 0, 'total_time_ms': 191286.8161201477, 'avg_time_per_request_ms': 191286.8161201477, 'avg_tokens_per_s': 0.0}], 'transfers': [{'from_stage': 0, 'to_stage': 1, 'samples': 1, 'total_bytes': 2427430, 'total_time_ms': 8.36038589477539, 'tx_mbps': 2322.792302223236, 'rx_samples': 1, 'rx_total_bytes': 2427430, 'rx_total_time_ms': 10.643243789672852, 'rx_mbps': 1824.5790835725006, 'total_samples': 1, 'total_transfer_time_ms': 19.79207992553711, 'total_mbps': 981.1722705779748}, {'from_stage': 1, 'to_stage': 2, 'samples': 1, 'total_bytes': 3307, 'total_time_ms': 0.47707557678222656, 'tx_mbps': 55.45452604897551, 'rx_samples': 1, 'rx_total_bytes': 3307, 'rx_total_time_ms': 0.9031295776367188, 'rx_mbps': 29.29369235058078, 'total_samples': 1, 'total_transfer_time_ms': 2.3987293243408203, 'total_mbps': 11.029172708875857}]}
(APIServer pid=65949) INFO: 127.0.0.1:44424 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] generate() called
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Seeding request into stage-0
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Enqueued request chatcmpl-f72d5e20d3744490b07350cdb06fe334 to stage-0
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Entering scheduling loop: stages=3
--------------------------------
[Stage-0] Received batch size=1, request_ids=chatcmpl-f72d5e20d3744490b07350cdb06fe334
--------------------------------
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Stage-0 completed request chatcmpl-f72d5e20d3744490b07350cdb06fe334; forwarding or finalizing
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Request chatcmpl-f72d5e20d3744490b07350cdb06fe334 finalized at stage-0
--------------------------------
[Stage-1] Received batch size=1, request_ids=chatcmpl-f72d5e20d3744490b07350cdb06fe334
--------------------------------
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Stage-1 completed request chatcmpl-f72d5e20d3744490b07350cdb06fe334; forwarding or finalizing
--------------------------------
[Stage-2] Received batch size=1, request_ids=chatcmpl-f72d5e20d3744490b07350cdb06fe334
--------------------------------
(EngineCore_DP0 pid=68009) INFO:vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni:Currently, we do not use the chunked process, we only use the token2wav.process_chunk for the whole sequence. The stream mode will be implemented in the future.
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Stage-2 completed request chatcmpl-f72d5e20d3744490b07350cdb06fe334; forwarding or finalizing
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] Request chatcmpl-f72d5e20d3744490b07350cdb06fe334 finalized at stage-2
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Orchestrator] All requests completed
(APIServer pid=65949) INFO:vllm_omni.entrypoints.async_omni:[Summary] {'e2e_requests': 1, 'e2e_total_time_ms': 183361.48071289062, 'e2e_sum_time_ms': 183361.37342453003, 'e2e_total_tokens': 0, 'e2e_avg_time_per_request_ms': 183361.37342453003, 'e2e_avg_tokens_per_s': 0.0, 'wall_time_ms': 183361.48071289062, 'final_stage_id': 2, 'stages': [{'stage_id': 0, 'requests': 1, 'tokens': 59, 'total_time_ms': 3272.2251415252686, 'avg_time_per_request_ms': 3272.2251415252686, 'avg_tokens_per_s': 18.030544185752017}, {'stage_id': 1, 'requests': 1, 'tokens': 1005, 'total_time_ms': 50239.848375320435, 'avg_time_per_request_ms': 50239.848375320435, 'avg_tokens_per_s': 20.00404126405945}, {'stage_id': 2, 'requests': 1, 'tokens': 0, 'total_time_ms': 129831.34651184082, 'avg_time_per_request_ms': 129831.34651184082, 'avg_tokens_per_s': 0.0}], 'transfers': [{'from_stage': 0, 'to_stage': 1, 'samples': 1, 'total_bytes': 1681800, 'total_time_ms': 8.469581604003906, 'tx_mbps': 1588.5554480801711, 'rx_samples': 1, 'rx_total_bytes': 1681800, 'rx_total_time_ms': 4.726886749267578, 'rx_mbps': 2846.355479552103, 'total_samples': 1, 'total_transfer_time_ms': 14.164924621582031, 'total_mbps': 949.8391526560291}, {'from_stage': 1, 'to_stage': 2, 'samples': 1, 'total_bytes': 3515, 'total_time_ms': 0.19288063049316406, 'tx_mbps': 145.78965201483314, 'rx_samples': 1, 'rx_total_bytes': 3515, 'rx_total_time_ms': 0.2639293670654297, 'rx_mbps': 106.54365716350497, 'total_samples': 1, 'total_transfer_time_ms': 1.2195110321044922, 'total_mbps': 23.05842199022483}]}
(APIServer pid=65949) INFO: 127.0.0.1:44426 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working