[Bug]:Text-to-image model parallel inference failed: concurrent /v1/images/generations calls – one request never gets HTTP response (generation succeeds twice, but only one 200 OK)

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

Collecting environment information...
WARNING 01-29 18:23:47 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 3.22.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.12 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 20:16:04) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-5.15.0-86-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.4.131
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA A100-PCIE-40GB
Nvidia driver version        : 550.90.07
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             80
On-line CPU(s) list:                0-79
Vendor ID:                          GenuineIntel
Model name:                         Intel Xeon Processor (Cascadelake)
CPU family:                         6
Model:                              85
Thread(s) per core:                 2
Core(s) per socket:                 20
Socket(s):                          2
Stepping:                           6
BogoMIPS:                           5986.16
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat pku ospke avx512_vnni
L1d cache:                          2.5 MiB (80 instances)
L1i cache:                          2.5 MiB (80 instances)
L2 cache:                           160 MiB (40 instances)
L3 cache:                           32 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-39
NUMA node1 CPU(s):                  40-79
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                 Mitigation; PTE Inversion
Vulnerability Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:             Vulnerable
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.5.3
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.17.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.3.5
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.0
[pip3] torchaudio==2.9.0
[pip3] torchsde==0.2.6
[pip3] torchvision==0.24.0
[pip3] transformers==4.57.6
[pip3] triton==3.5.0
[conda] flashinfer-python         0.5.3                    pypi_0    pypi
[conda] numpy                     2.2.6                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.8.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.8.90                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.8.93                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.8.90                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.10.2.21                pypi_0    pypi
[conda] nvidia-cudnn-frontend     1.17.0                   pypi_0    pypi
[conda] nvidia-cufft-cu12         11.3.3.83                pypi_0    pypi
[conda] nvidia-cufile-cu12        1.13.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.9.90                pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.7.3.90                pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.5.8.93                pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.7.1                    pypi_0    pypi
[conda] nvidia-cutlass-dsl        4.3.5                    pypi_0    pypi
[conda] nvidia-ml-py              13.590.48                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.27.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.8.93                  pypi_0    pypi
[conda] nvidia-nvshmem-cu12       3.3.20                   pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.8.90                  pypi_0    pypi
[conda] pyzmq                     27.1.0                   pypi_0    pypi
[conda] torch                     2.9.0                    pypi_0    pypi
[conda] torchaudio                2.9.0                    pypi_0    pypi
[conda] torchsde                  0.2.6                    pypi_0    pypi
[conda] torchvision               0.24.0                   pypi_0    pypi
[conda] transformers              4.57.6                   pypi_0    pypi
[conda] triton                    3.5.0                    pypi_0    pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.12.0
vLLM-Omni Version            : 0.12.0rc1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-79    0-1             N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=GPU-048d485d-0791-39d9-394c-17bff1ef670c
NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536
NCCL_VERSION=2.21.5-1
NVIDIA_DRIVER_CAPABILITIES=compute,utility,graphics,video
NVIDIA_PRODUCT_NAME=CUDA
CUDA_VERSION=12.4.1
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
OMP_NUM_THREADS=10
MKL_NUM_THREADS=10
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

</details>


### Your code version

<details>
<summary>The commit id or version of vllm</summary>

0.12.0

</details>
<details>
<summary>The commit id or version of vllm-omni</summary>

0.12.0rc1

</details>


### 🐛 Describe the bug



  ```bash
  vllm serve /root/autodl-tmp/models/Z-Image-Turbo \
    --omni \
    --port 8091 \
    --gpu_memory_utilization 0.9
  ```

- Mode: pure diffusion (single diffusion stage)
- Hardware:
  - GPU: [please fill in, e.g. A100 GPU memory size]
  - Number of GPUs: 1
- OS / Python:
  - Python: 3.12 (from container)
- Client: `requests` (Python)

import time
import threading
import requests

URL = "http://localhost:8091/v1/images/generations"
PAYLOAD = {
    "prompt": "可爱的卡通橘猫，白色背景",
    "n": 1,
    "size": "512x512",
    "response_format": "b64_json",
    "num_inference_steps": 8,
}

def run(i):
    start = time.time()
    try:
        resp = requests.post(URL, json=PAYLOAD, timeout=60)
        cost = time.time() - start
        print(f"[req {i}] status={resp.status_code} cost={cost:.2f}s")
    except Exception as e:
        cost = time.time() - start
        print(f"[req {i}] ERROR cost={cost:.2f}s err={repr(e)}")

threads = [threading.Thread(target=run, args=(i,)) for i in range(2)]
for t in threads:
    t.start()
for t in threads:
    t.join()


Output (representative):

```text
[req 1] status=200 cost=1.14s
[req 0] ERROR cost=60.06s err=ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='localhost', port=8091): Read timed out. (read timeout=60)"))
```

---

### Server Logs During This Repro

For the above two concurrent requests, the server logs show:

```text
(APIServer pid=...) INFO ... Generating 1 image(s) 512x512    # request A
(APIServer pid=...) INFO ... Generating 1 image(s) 512x512    # request B

[Stage-0] INFO ... Generation completed successfully.         # generation A done
[Stage-0] INFO ... Post-processing completed ...

(APIServer pid=...) INFO [log_utils.py:549] {'type': 'request_level_metrics',
(APIServer pid=...) INFO [log_utils.py:549]  'request_id': 'img_gen_1769674978',
(APIServer pid=...) INFO [log_utils.py:549]  'e2e_time_ms': 1033.39,
(APIServer pid=...) INFO [log_utils.py:549]  'stages': {0: {...}}}

(APIServer pid=...) INFO [async_omni.py:468] [Summary] {'e2e_requests': 1, ...}
(APIServer pid=...) INFO [api_server.py:696] Successfully generated 1 image(s)
(APIServer pid=...) INFO: 127.0.0.1:44958 - "POST /v1/images/generations HTTP/1.1" 200 OK

[Stage-0] INFO ... Generation completed successfully.         # generation B done
[Stage-0] INFO ... Post-processing completed ...
# (no second request_level_metrics, no second 200 OK log)
```

So from the logs:

- Both requests are **accepted** (`Generating 1 image(s) 512x512` printed twice).
- Both generations **complete successfully** in Stage-0.
- But only **one** request is counted in `e2e_requests: 1` and only one `200 OK` is written to the HTTP log.

---

### Expected Behavior

- For two concurrent `/v1/images/generations` requests with the same payload:
  - Both should either succeed with HTTP 200 in ~1–2 seconds, or
  - If there is some kind of queueing, the second one should still get an HTTP response well within the client timeout (60s in this repro).

Most importantly, **for every HTTP request, there should be a corresponding HTTP response**.

---


- With 2 concurrent requests:
  - One request returns `200 OK` in ~1.1s.
  - The other request hangs until the client-side `requests.post(..., timeout=60)` hits a **ReadTimeout**:

    ```text
    ReadTimeoutError("HTTPConnectionPool(host='localhost', port=8091): Read timed out. (read timeout=60)")
    ```

- On the server side:
  - Both diffusion generations complete successfully (Stage-0 logs).
  - Only one of the two requests is reflected in `request_level_metrics` and `e2e_requests: 1`.
  - Only one `200 OK` is logged by the API server.
  - The other request appears to **never get an HTTP response written**, even though the diffusion stage finished.

This reproduces consistently with 2 concurrent requests (and also when vLLM is called behind a FastAPI gateway).




### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]:Text-to-image model parallel inference failed: concurrent /v1/images/generations calls – one request never gets HTTP response (generation succeeds twice, but only one 200 OK) #1080

Your current environment

Collecting environment information...
WARNING 01-29 18:23:47 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work

==============================
PyTorch Info

==============================
Python Environment

==============================
CUDA / GPU Info

==============================
CPU Info

==============================
Versions of relevant libraries

==============================
vLLM Info

==============================
Environment Variables

Your code version

🐛 Describe the bug

Server Logs During This Repro

Expected Behavior

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]:Text-to-image model parallel inference failed: concurrent /v1/images/generations calls – one request never gets HTTP response (generation succeeds twice, but only one 200 OK) #1080

Description

Your current environment

Collecting environment information... WARNING 01-29 18:23:47 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work

============================== PyTorch Info

============================== Python Environment

============================== CUDA / GPU Info

============================== CPU Info

============================== Versions of relevant libraries

============================== vLLM Info

============================== Environment Variables

Your code version

🐛 Describe the bug

Server Logs During This Repro

Expected Behavior

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Collecting environment information...
WARNING 01-29 18:23:47 [mooncake_connector.py:18] Mooncake not available, MooncakeOmniConnector will not work

==============================
PyTorch Info

==============================
Python Environment

==============================
CUDA / GPU Info

==============================
CPU Info

==============================
Versions of relevant libraries

==============================
vLLM Info

==============================
Environment Variables