[BugFix] Fix Qwen3 TTS 0.6B profile run hang (#995) by marksverdhei · Pull Request #1082 · vllm-project/vllm-omni

marksverdhei · 2026-01-29T10:53:43Z

Purpose

Fix Qwen3-TTS-12Hz-0.6B-Base hanging during server startup (profile/warmup run).

Resolves #995

Root Cause

During vLLM's profile/warmup run, forward() is called with dummy token IDs and empty runtime_additional_information. For the Base task type, this triggers generate_voice_clone() in ICL mode with degenerate inputs (1-second silent audio clip, placeholder ref_text). The 0.6B model cannot converge from this input and generates indefinitely, never producing an EOS token. The EngineCore times out waiting for the worker to finish, emitting:

No available shared memory broadcast block found in 60 seconds.

The 1.7B models are robust enough to produce EOS from the same degenerate input, which is why only 0.6B is affected.

Fix

Cap max_new_tokens to 2 when text is empty (profile run detection), so the full generation pipeline executes — preserving KV-cache profiling behaviour — but exits quickly even when the model cannot converge from degenerate dummy inputs.

Test Plan

# 1. Build Docker image
docker build -t vllm-omni-q3tts:latest -f docker/Dockerfile.ci .

# 2. Start server with 0.6B Base model
docker run -d \
  --name qwen3-tts-server \
  --gpus '"device=0"' \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm-omni-q3tts:latest \
  vllm serve Qwen/Qwen3-TTS-12Hz-0.6B-Base \
    --omni --host 0.0.0.0 --port 8000 \
    --gpu-memory-utilization 0.9 --enforce-eager

# 3. Verify server starts (previously hung here)
curl http://localhost:8000/health

# 4. Send a TTS request
curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello, this is a test.",
    "model": "Qwen/Qwen3-TTS-12Hz-0.6B-Base",
    "task_type": "Base",
    "language": "English",
    "ref_audio": "data:audio/wav;base64,<base64_encoded_wav>",
    "ref_text": "Reference text for the audio.",
    "x_vector_only_mode": true,
    "max_new_tokens": 256
  }' -o output.wav

Test Result

Tested on RTX 3090 (Ampere, sm_86), Docker, vllm-omni v0.14.0rc1:

Before fix: Server hangs at No available shared memory broadcast block found in 60 seconds — never starts
After fix: Profile run completes with capped generation (max_new_tokens=2), full pipeline executes, server starts successfully
0.6B Base inference: HTTP 200, produces valid 3.34s non-silent 24kHz audio in 2.9s (x_vector_only voice cloning mode)
Approach: Generation step cap preserves full pipeline execution during warmup (KV-cache profiling compatible) while bounding degenerate generation

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6eba044eea

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

When flash-attn is not installed, explicitly request PyTorch SDPA attention instead of falling back to eager mode. Both Qwen3TTS pretrained model classes declare _supports_sdpa = True, so this is safe and lets PyTorch auto-select the fastest available kernel. Also adds regression tests for the SDPA fallback and the empty-text profile run short-circuit (PR vllm-project#1082). Co-Authored-By: Claude Opus 4.5 <[email protected]>

hsliuustc0106 · 2026-01-29T12:07:29Z

@linyueqian PTAL

tzhouam · 2026-01-29T12:15:47Z

Hello, thank you for your effort and contribution.

I reviewed your code and design, and I’m concerned that the “shortcut forward” path could lead to unexpected behavior in KV-cache management, especially if we adopt vLLM’s paged attention in the future.

Would using a dummy input text help avoid this issue?

Copilot

Pull request overview

This PR fixes a critical hang issue where the Qwen3-TTS-12Hz-0.6B-Base model would indefinitely generate during vLLM's profile/warmup run at server startup. The root cause is that during profile runs, the forward() method receives empty text input and degenerate runtime information, causing the 0.6B model to fail to converge and never produce an EOS token, leading to a timeout.

Changes:

Short-circuits the forward() method in Qwen3TTSModelForGeneration when text input is empty
Returns a dummy 1-second silent audio tensor immediately during profile runs
Prevents the 0.6B model from hanging during server startup while maintaining compatibility with larger models

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

marksverdhei · 2026-01-29T12:39:55Z

Thanks for the feedback @tzhouam — that's a valid concern. I tested the dummy text approach: substituting text = "Hello." when empty so the full pipeline executes. Unfortunately it still hangs, because during profile runs it's not only text that's missing — _dummy_run() calls the model without runtime_additional_information in kwargs at all, so ref_audio, ref_text, and all other fields also fall back to degenerate defaults (e.g. a 1-second silent audio clip for ref_audio). The 0.6B model can't converge from that combination.

I'm looking into supplying valid dummy values for all the missing inputs so the full pipeline can execute during warmup, but that may be a larger change. What approach would you prefer here? Some options:

Construct realistic dummy runtime_additional_information at the runner level so the model gets valid inputs during profile runs
Add a generation step cap (e.g. max_new_tokens) during profile runs to bound execution even if the model doesn't converge
Keep the current short-circuit as a stopgap while working on a more complete fix

Happy to go whichever direction works best for the project.

Gaohan123 · 2026-01-29T13:05:38Z

Thanks for the feedback @tzhouam — that's a valid concern. I tested the dummy text approach: substituting text = "Hello." when empty so the full pipeline executes. Unfortunately it still hangs, because during profile runs it's not only text that's missing — _dummy_run() calls the model without runtime_additional_information in kwargs at all, so ref_audio, ref_text, and all other fields also fall back to degenerate defaults (e.g. a 1-second silent audio clip for ref_audio). The 0.6B model can't converge from that combination.

I'm looking into supplying valid dummy values for all the missing inputs so the full pipeline can execute during warmup, but that may be a larger change. What approach would you prefer here? Some options:

Construct realistic dummy runtime_additional_information at the runner level so the model gets valid inputs during profile runs

Add a generation step cap (e.g. max_new_tokens) during profile runs to bound execution even if the model doesn't converge

Keep the current short-circuit as a stopgap while working on a more complete fix

Happy to go whichever direction works best for the project.

Currently, I think choice 2 is better. Thanks.

marksverdhei · 2026-01-29T13:54:38Z

Applied and tested the approach.

Gaohan123 · 2026-01-30T02:50:35Z

Could you please check the CI error? Thanks

marksverdhei · 2026-01-30T16:24:35Z

Could you please check the CI error? Thanks

Not obvious to me that the fail was triggered by this patch. I'm not entirely sure...

During vLLM's profile/warmup run, forward() is called with dummy token IDs and empty runtime_additional_information. For the Base task type, this triggers generate_voice_clone() in ICL mode with degenerate inputs (1s silent clip, placeholder text). The 0.6B model cannot converge from this input and generates indefinitely, causing a shared memory broadcast timeout that blocks server startup. Short-circuit forward() when text is empty (profile run) by returning a dummy 1s silent audio tensor immediately. Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: marksverdhei <[email protected]>

@tzhouam

Replace the forward() short-circuit with a max_new_tokens=2 cap during profile/warmup runs. This lets the full generation pipeline execute (preserving KV-cache profiling behaviour for future paged attention adoption) while bounding execution so the 0.6B model cannot hang. Addresses review feedback from @tzhouam and @Gaohan123. Signed-off-by: Mark Sverdlov Heisey <[email protected]> Signed-off-by: marksverdhei <[email protected]>

marksverdhei · 2026-01-30T17:19:00Z

Could you please check the CI error? Thanks

Not obvious to me that the fail was triggered by this patch. I'm not entirely sure...

[Automated analysis] I reproduced the CI environment and investigated both failures. Neither appears to be caused by this PR's changes:

test_qwen3_omni_talker — Fails with torch._dynamo.exc.BackendCompilerFailed during torch.compile. This is a pre-existing issue with torch.compile + Qwen3 Omni talker that is already tracked and was fixed independently in PR [BugFix] Fix Qwen3 Omni talker mtp torch.compile startup error #1104 ([BugFix] Fix Qwen3 Omni talker mtp torch.compile startup error). The failing code path is in torch._inductor and unrelated to the profile-run changes in this PR.
test_qwen3_tts — Fails with an assertion on audio duration: assert 2 <= duration <= 20 where duration ≈ 0.02s. This is an inference quality issue — the model produces near-silent output for the test prompt. This test also fails on main (confirmed by checking the base branch CI), so it is a flaky/pre-existing failure unrelated to this PR.

In summary, both CI failures are reproducible on main without this PR's changes.

This is an automated message generated by a CI analysis tool.

hsliuustc0106

lgtm

hsliuustc0106 · 2026-01-30T21:40:35Z

@Gaohan123 PTAL

Gaohan123

LGTM. Thanks!

…-project#1082)

* [Frontend][Model] Support batch request with refined OmniDiffusionReq… (#797) Signed-off-by: Huang, Zeyu <[email protected]> * [Model]: add FLUX.1-dev model (#853) * [BugFix] ignore mm data from stages to async omni (#954) Signed-off-by: dengyunyang <[email protected]> * Revert "[BugFix] ignore mm data from stages to async omni" (#1023) * [Bugfix] Modify output to model_runner_output (#1026) Signed-off-by: gcanlin <[email protected]> * [Feature] Support cache-dit for Wan 2.2 inference (#1021) Signed-off-by: samithuang <[email protected]> Signed-off-by: Samit <[email protected]> * [Doc]Format profiling doc (#993) Signed-off-by: lishunyang <[email protected]> Signed-off-by: Hongsheng Liu <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> Co-authored-by: Copilot <[email protected]> * [Hardware] Support platforms and plugin system (#774) Signed-off-by: gcanlin <[email protected]> * [Core]: KV Cache Transfer Encapsulation (#979) Signed-off-by: princepride <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> * [Test]Delete skip mark for amd ci test and fix CI failure (#927) Signed-off-by: wangyu31577 <[email protected]> Signed-off-by: Hongsheng Liu <[email protected]> Co-authored-by: wangyu31577 <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Bugfix][Doc]Specify Qwen3-TTS model name for each task type (#1036) Signed-off-by: Kyle Huang <[email protected]> * [Misc] pin version of fa3-fwd (#1051) Signed-off-by: zjy0516 <[email protected]> * [CI] [ROCm] Add more AMD CI tests (#1039) Signed-off-by: tjtanaa <[email protected]> * [Bugfix] fix qwen image layerd in dummy run (#1027) Signed-off-by: zjy0516 <[email protected]> * [BugFix] Fix noisy output without setting a seed in Qwen Image (#1043) Signed-off-by: natureofnature <[email protected]> * [bugfix] remove vllm speech route (#1060) Signed-off-by: linyueqian <[email protected]> * [Debug] Update GLM-Image Pipeline (#1049) Co-authored-by: root <[email protected]> * [Diffusion][Bugfix] Fix the flash_attn backends selection logic (#983) Signed-off-by: mxuax <[email protected]> Signed-off-by: XU Mingshi <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [BugFix] Fix the accuracy issue of multimodal input. (#1020) Signed-off-by: amy-why-3459 <[email protected]> Co-authored-by: Rein Yang <[email protected]> * [Bugfix] Set VaeImageProcessor `do_convert_rgb` True (#1032) Signed-off-by: gcanlin <[email protected]> * [feat]: adapt batch request for flux (#1028) Signed-off-by: wuzhongjian [email protected] * [CI] Change Qwen3 Omni stage placement strategy (#1072) Signed-off-by: ZeldaHuang <[email protected]> * [BugFix] Fix to use correct attn backend (#1038) Signed-off-by: Divyansh Singhvi <[email protected]> * [Perf] Qwen3 Omni talker mtp optimization (#1005) Signed-off-by: ZeldaHuang <[email protected]> Signed-off-by: Hongsheng Liu <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Wan2.2] Optimize memory usage with conditional transformer loading (#980) Signed-off-by: Lin, Fanli <[email protected]> Signed-off-by: Samit <[email protected]> Co-authored-by: Samit <[email protected]> * [Feat] Support XPU Backend in vLLM-Omni (#191) Signed-off-by: Fanli Lin <[email protected]> Signed-off-by: Fanli Lin <[email protected]> Co-authored-by: Copilot <[email protected]> * [Fix] stabilize diffusion images LoRA E2E across CI drift (#1075) Signed-off-by: dongbo910220 <[email protected]> * [Bugfix][Test] Re-enable the log simple tests (#1065) Signed-off-by: gcanlin <[email protected]> * [Bugfix] pr conflict fix, bugfix ignore mm data from stages to async omni (#1025) Signed-off-by: dengyunyang <[email protected]> * [Doc][Bagel] Add BAGEL-7B-MoT documentation and edit the default stage configuration (#987) Signed-off-by: Ding Zuhao <[email protected]> Signed-off-by: jzz <[email protected]> * [Fix] Increase max wait time for server readiness to accommodate model loading (#1089) Signed-off-by: Andy Zhou <[email protected]> * [Benchmark] Add vLLM-Omni Omni model online benchmark (#780) Signed-off-by: wangyu31577 <[email protected]> Signed-off-by: wangyu <[email protected]> Co-authored-by: wangyu31577 <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Bugfix] Remove Mooncake/Yuanrong connector import warning (#1091) Signed-off-by: natureofnature <[email protected]> * fix: UnboundLocalError for role in streaming audio/image responses (#784) Signed-off-by: Pierre Le Guen <[email protected]> * [Misc] update wechat image (#1096) * [Feature] Support DiT Layerwise (Blockwise) CPU Offloading (#858) Signed-off-by: yuanheng <[email protected]> Signed-off-by: Hongsheng Liu <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> Co-authored-by: Copilot <[email protected]> * [BugFix] Modify max_tokens and modify the log and fix #1103 (#1097) Signed-off-by: amy-why-3459 <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [BugFix] Fix modulate_index shape error in Qwen-Image-Edit Task (#1100) Signed-off-by: mxuax <[email protected]> Signed-off-by: XU Mingshi <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Platform] Add supports_torch_inductor interface (#1108) Signed-off-by: gcanlin <[email protected]> * [BugFix] Fix Qwen3 Omni talker mtp torch.compile startup error (#1104) Signed-off-by: ram16g <[email protected]> Signed-off-by: ZeldaHuang <[email protected]> Co-authored-by: ram16g <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Bugfix] fix request_id of image generation in api server (#1112) Signed-off-by: zjy0516 <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Perf]: CFG parallel abstraction (#851) Signed-off-by: Didan Deng <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [BugFix] Fix Qwen3 TTS 0.6B profile run hang (#995) (#1082) * [CI] [ROCm] Quick fix amd ci (#1116) Signed-off-by: tjtanaa <[email protected]> * [Bugfix] fix benchmark audio timing error and add benchmark test (#1109) Signed-off-by: wangyu31577 <[email protected]> Co-authored-by: wangyu31577 <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Bugfix][Qwen3TTS] Load speaker_id/voices from model configuration (#1079) Signed-off-by: pablo <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> Co-authored-by: WeiQing Chen <[email protected]> * [NPU] Align with GPUModelRunner (#1114) Signed-off-by: gcanlin <[email protected]> * [FEATURE] /v1/images/edit interface (#1101) Signed-off-by: dengyunyang <[email protected]> * [Bugfix] Fix NPU SDPA attention mask shape and semantics (#1031) Signed-off-by: gcanlin <[email protected]> Co-authored-by: muziyuhui666 <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [TeaCache]: Add Coefficient Estimation (#940) Signed-off-by: princepride <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [CI]: Bagel E2E Smoked Test (#1074) Signed-off-by: princepride <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Misc] Bump version to 0.14.0 (#1128) Signed-off-by: Roger Wang <[email protected]> * [Doc] First stable release of vLLM-Omni (#1129) Signed-off-by: Roger Wang <[email protected]> * [Misc] Align error handling with upstream vLLM v0.14.0 (#1122) Signed-off-by: anna <[email protected]> Co-authored-by: anna <[email protected]> * [Feature] add Tensor Parallelism to LongCat-Image(-Edit) (#926) Signed-off-by: Rustam Khadipash <[email protected]> * [CI] Temporarily remove slow tests. (#1143) Signed-off-by: Alicia <[email protected]> Signed-off-by: princepride <[email protected]> Co-authored-by: princepride <[email protected]> * [CI] Refactor test_sequence_parallel.py and add a warmup run for more accurate performance stat (#1165) Signed-off-by: mxuax <[email protected]> Signed-off-by: XU Mingshi <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * Dev/rebase v0.15.0 (#1159) Signed-off-by: Taichang Zhou <[email protected]> Signed-off-by: tzhouam <[email protected]> Signed-off-by: princepride <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> * Docs update paper link (#1169) Signed-off-by: hsliu <[email protected]> Signed-off-by: hsliu_ustc <[email protected]> Co-authored-by: hsliu_ustc <[email protected]> * [Debug] Clear Dockerfile.ci to accelerate build image (#1172) Signed-off-by: tzhouam <[email protected]> * [Debug] Correct Unreasonable Long Timeout (#1175) Signed-off-by: tzhouam <[email protected]> * [Doc]Fix - Align with repo. (#1176) Signed-off-by: Alicia <[email protected]> * [Bugfix][Qwen-Image-Edit] Add a warning log for none negative_prompt (#1170) Signed-off-by: gcanlin <[email protected]> * [Bugfix] fix qwen image oom (#1168) Signed-off-by: zjy0516 <[email protected]> * [Hardware] Disable compile of diffusion on XPU (#1148) Signed-off-by: zhenwei-intel <[email protected]> * [Doc] Fix vLLM version in user docs (#1179) Signed-off-by: Yuanheng Zhao <[email protected]> * [Refactor] Refactor async chunk and fix the shape mismatch issue (#1151) Signed-off-by: amy-why-3459 <[email protected]> * bugfix: /images/edits endpoint fails pipeline data format check (#1141) Signed-off-by: Huang, Zeyu <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Perf] resolving prolonged `cudastreamsynchronize` execution in z image processing (#1105) Signed-off-by: erfgss <[email protected]> Co-authored-by: Copilot <[email protected]> * [Bugfix] modify RTF use audio_e2e/audio_duration (#1157) Signed-off-by: wangyu31577 <[email protected]> Co-authored-by: wangyu31577 <[email protected]> * [Doc] Highlight paper & slides. (#1186) Signed-off-by: Alicia <[email protected]> * [chore] Remove zmq context initialize (#1187) Signed-off-by: xiedeyantu <[email protected]> * [NPU] Update Dockerfile and docs for v0.14.0 (#671) Signed-off-by: gcanlin <[email protected]> * [Bugfix] E2E metric incorrect qwen3-omni with async chunk feature (#1018) Signed-off-by: Junhong Liu <[email protected]> Signed-off-by: Junhong Liu <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Doc] opt doc (#1118) Signed-off-by: David Chen <[email protected]> * [Bugfix] Fix tp+sp accuracy, incorrect process group mapping (#1178) Signed-off-by: David Chen <[email protected]> * [Feature] Enable use_audio_in_video for Qwen 3 Omni Online (#1198) Signed-off-by: tzhouam <[email protected]> * [Bugfix] async_chunk rebase v0.15.0 (#1195) Signed-off-by: amy-why-3459 <[email protected]> * [feature]: support flux cache_dit (#1145) Co-authored-by: Jiangyun Zhu <[email protected]> * [CI] Add CI branch coverage calculation, fix statement coverage results and add log before test for buildkite log group (#1120) Signed-off-by: wangyu31577 <[email protected]> Co-authored-by: wangyu31577 <[email protected]> * [Wan 2.2][Diffusion] Add TP Support (#964) Signed-off-by: weichen <[email protected]> * [Hardware] [Feat] Setup platform dependent package installation (#1046) Signed-off-by: tjtanaa <[email protected]> Co-authored-by: PopSoda2002 <[email protected]> Co-authored-by: gcanlin <[email protected]> * [XPU] Fix XPU UTs for basic coverage (#1164) Signed-off-by: Yan Ma <[email protected]> * [Test] Add BuildKite test-full script for full CI. (#867) Signed-off-by: wangyu31577 <[email protected]> Co-authored-by: wangyu31577 <[email protected]> * [Refactor] Reuse upstream Qwen3MoeSparseMoeBlock (#1202) Signed-off-by: gcanlin <[email protected]> * [Bugfix] Fix wan2.2 ti2v (#1221) Signed-off-by: mxuax <[email protected]> Signed-off-by: XU Mingshi <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Bugfix] Fix '--max-generated-image-size' cli args type (#1249) Signed-off-by: ApsarasX <[email protected]> * [Bugfix] Ensure seed=0 is correctly handled in image edit (#1248) Signed-off-by: ApsarasX <[email protected]> * [Docs] Add example image download step to Image-To-Video examples (#1258) Signed-off-by: lishunyang <[email protected]> * [Bugfix] Fix padding bug in 12Hz tokenizer ConvTranspose1d decode (#1241) Signed-off-by: linyueqian <[email protected]> * [bugfix] Fix multimodal_output property to check completion outputs where audio data is attached (#1203) Signed-off-by: linyueqian <[email protected]> * [Doc] Update QA relevant to quantization (#1257) Signed-off-by: lishunyang <[email protected]> * [Bugfix] Fix Doc link Rrror (#1263) Signed-off-by: lishunyang <[email protected]> * Process-Scoped GPU Memory Accounting (#1204) Signed-off-by: Divyansh Singhvi <[email protected]> * [ComfyUI]: ComfyUI integration (#1113) Signed-off-by: Huang, Zeyu <[email protected]> * fix: add diffusion offload args to OmniConfig group instead of serve_parser (#1271) Signed-off-by: Chenguang ZHENG <[email protected]> * [Doc] Adding models/pipelines/features Tutorial (#1196) Signed-off-by: Didan Deng <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: dongbo910220 <[email protected]> * [CI] Add env variable check for nightly CI (#1281) Signed-off-by: Alicia <[email protected]> * [CI] Add pytest markers to current tests and update the doc. (#577) Signed-off-by: Alicia <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Diffusion][Perf] Remove Redundant Communication Cost by Refining SP Hook Design (#1275) Signed-off-by: mxuax <[email protected]> Signed-off-by: XU Mingshi <[email protected]> * [Feature] Opt metrics structure (#891) Signed-off-by: Junhong Liu <[email protected]> Signed-off-by: Junhong Liu <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Test] Add example test cases for omni online (#1086) Signed-off-by: wangyu31577 <[email protected]> Signed-off-by: Hongsheng Liu <[email protected]> Signed-off-by: yenuo26 <[email protected]> Co-authored-by: wangyu31577 <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> Co-authored-by: Copilot <[email protected]> * [CI] Reduce the time for Diffusion Sequence Parallelism Test (#1283) Signed-off-by: Alicia <[email protected]> * [Model] SupportHunyuanImage3 Diffusion Model in vllm-omni (#1085) Signed-off-by: Semmer2 <[email protected]> * [Chore] Update copyright year. (#1256) Signed-off-by: lishunyang <[email protected]> * [feature]: support Flux.1-dev CFG-Parallel (#1269) * [Bugfix] Fix 'NoneType' AttributeError in stable-diffusion model detect (#1254) Signed-off-by: Yan Ma <[email protected]> * [Doc] Update Qwen3-TTS docs for consistency with Omni examples (#1226) Signed-off-by: linyueqian <[email protected]> Signed-off-by: Yueqian Lin <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Fix]Ensure HuggingFace downloads complete before initialization. (#1213) Signed-off-by: zhou zhuoxin <[email protected]> Co-authored-by: Copilot <[email protected]> * [BugFix] Fixed the issue where ignore_eos was not working. (#1286) Signed-off-by: amy-why-3459 <[email protected]> * [Test] Add e2e tests for Qwen3-TTS speech endpoint (#1206) Signed-off-by: linyueqian <[email protected]> Signed-off-by: Yueqian Lin <[email protected]> * [Feat]: support VAE patch parallelism (#756) Signed-off-by: dongbo910220 <[email protected]> Co-authored-by: hsliuustc0106 <[email protected]> * [CI] Disable Qwen3-TTS E2E Test in pipeline.yml (#1306) Signed-off-by: Gao Han <[email protected]> * [Misc] Add per-request generator_device to online image gen and edit (#1183) Signed-off-by: gcanlin <[email protected]> * [Bagel]: Support TP (#1293) Signed-off-by: princepride <[email protected]> * [Bugfix] Fix image edit RoPE crash when explicit height/width are provided (#1265) Signed-off-by: lishunyang <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Doc] Sync (#1216) Signed-off-by: Alicia <[email protected]> * [Bugfix] fix precision issues of qwen3-omni when enable async_chunk without system prompt (#1288) Signed-off-by: Rein Yang <[email protected]> * [Debug] Add trigger to concurrent stage init (#1274) Signed-off-by: tzhouam <[email protected]> * [Bugfix][Qwen3-TTS] Fix task type (#1317) Signed-off-by: Ekagra Ranjan <[email protected]> * Unifying CLI Argument Naming Style (#1309) Signed-off-by: Didan Deng <[email protected]> * [Bugfix][Qwen3-TTS] Preserve original model ID in omni_snapshot_download (#1318) * [CI] Run nightly tests. (#1333) Signed-off-by: Alicia <[email protected]> * [Feature]: FP8 Quantization Support for DiT (#1034) Signed-off-by: lishunyang <[email protected]> Signed-off-by: SYLAR <[email protected]> * Fix yield token metrics and opt metrics record stats (#1292) * [Test] L2 & L3 Test Case Stratification Design for Omni Model (#1272) Signed-off-by: wangyu31577 <[email protected]> Signed-off-by: yenuo26 <[email protected]> Signed-off-by: wangyu <[email protected]> Signed-off-by: Hongsheng Liu <[email protected]> Co-authored-by: wangyu31577 <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Pref] Support Qwen3 Omni code2wav batch infernce with async chunk (#1246) Signed-off-by: ZeldaHuang <[email protected]> Signed-off-by: Hongsheng Liu <[email protected]> Signed-off-by: Ziming Huang <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> Co-authored-by: Copilot <[email protected]> * update qwen3-omni & qwen2.5-onmi openai client (#1304) Signed-off-by: Rein Yang <[email protected]> * [Feature] Support Wan2.2 T2V and I2V Online Serving with OpenAI /v1/videos API (#1073) Signed-off-by: samithuang <[email protected]> Signed-off-by: Samit <[email protected]> Signed-off-by: SamitHuang <[email protected]> Co-authored-by: Flora Feng <[email protected]> * [Feature] add Tensor Parallelism to SD_3.5 (#1336) Signed-off-by: GG-li <[email protected]> * [Feature]async scheduling to overlap chunk IO and compute (#951) Signed-off-by: CHEN <[email protected]> Co-authored-by: Bhanu068 <[email protected]> Co-authored-by: Gao Han <[email protected]> * [Bugfix] reused metrics to modify the API Server token statistics in Stream Response (#1301) Signed-off-by: John Liu BUAA <[email protected]> * Refactor CPU Offloading Backend Pattern (#1223) Signed-off-by: yuanheng <[email protected]> Signed-off-by: Yuanheng Zhao <[email protected]> Signed-off-by: Samit <[email protected]> Co-authored-by: Samit <[email protected]> * [DOC] Doc for CI test - Details about five level stucture and some other files. (#1167) Signed-off-by: Alicia <[email protected]> Co-authored-by: yenuo26 <[email protected]> * [Bugfix] remove Tongyi-MAI/Z-Image-Turbo related test from L2 ci (#1348) Signed-off-by: dengyunyang <[email protected]> * [Misc] wechat image update (#1354) Signed-off-by: David Chen <[email protected]> * [Misc] Support WorkerWrapperBase and CustomPipeline for Diffusion Worker (#764) Signed-off-by: knlnguyen1802 <[email protected]> * [Feature][Bugfix] Add CFG feature to Bagel (#1310) Signed-off-by: Ding Zuhao <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> * [Feature]: Diffusion sleep to use process level memory calculation (#1276) Signed-off-by: Divyansh Singhvi <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: dsinghvi <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> * change qwen3-omni open cudagraph by default (#1352) Signed-off-by: Rein Yang <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [XPU] Update Bagel's flash_attn_varlen_func to fa utils (#1295) Signed-off-by: zhenwei-intel <[email protected]> * [Test] Add Omni Model Performance Benchmark Test (#1321) Signed-off-by: yenuo26 <[email protected]> Signed-off-by: wangyu <[email protected]> * [BugFix]: Revert utils change (#1369) Signed-off-by: princepride <[email protected]> * [Rebase] Rebase to vllm v0.16.0 (#1357) Signed-off-by: Taichang Zhou <[email protected]> Signed-off-by: tzhouam <[email protected]> Signed-off-by: princepride <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: ZJY0516 <[email protected]> Co-authored-by: youkaichao <[email protected]> * [Test] Fix expansion and example test case for qwen3-omni (#1358) Signed-off-by: yenuo26 <[email protected]> * [v0.16.0][BUG FIX]Fix hunyuan MOE after update to 0.16.0 (#1401) Signed-off-by: Chendi Xue <[email protected]> * [0.16.0] remove cuda hard-code for Hunyuan Image3 (#1402) Signed-off-by: Chendi Xue <[email protected]> * [XPU] Add XPU Dockerfile and related docs (#1162) Signed-off-by: Yan Ma <[email protected]> Signed-off-by: Daniel Huang <[email protected]> Co-authored-by: Daniel Huang <[email protected]> * [Bugfix] Fix Hardcoded Datatypes in Z-image (#1393) Signed-off-by: Alex Brooks <[email protected]> * [Feature] : Support disaggregated inference pipeline for Qwen3_TTS (#1161) Signed-off-by: Sy03 <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Feature] Add automated PR reviewer bot with GLM integration (#1424) Signed-off-by: hsliu <[email protected]> Signed-off-by: Claude Opus 4.6 <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> * [Misc] Add Qwen2.5-Omni-3B model support to Gradio demo (#1382) Signed-off-by: UsamaKenway <[email protected]> * [misc] Feature/pr reviewer auto trigger&update model (#1431) Signed-off-by: hsliu <[email protected]> Signed-off-by: Claude Opus 4.6 <[email protected]> Signed-off-by: Hunter Liu <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> * Revert "[misc] Feature/pr reviewer auto trigger&update model" (#1432) * [Doc] Update GPU installation commands (#1434) * [ROCM] [CI] fix dockerfile.rocm to support nightly build and also fix amd ci v0.16.0rc1 (#1380) Signed-off-by: tjtanaa <[email protected]> * [Feature][BAGEL] Combine multi-branch cfg into a single batch to accelerate inference. (#1429) Signed-off-by: Ding Zuhao <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> * [Feat]: add ASCII art logo for vLLM-Omni (#1430) * [Bug] [Bagel] Fix kv transfer bug (#1437) Signed-off-by: Ding Zuhao <[email protected]> Co-authored-by: Wang Zhipeng: princepride <[email protected]> * [CI] Set L2 & L3 tests running conditions. (#1344) Signed-off-by: Alicia <[email protected]> * [Feature] vLLM-Omni RDMA connector (#1019) Signed-off-by: natureofnature <[email protected]> * [Minor][Refactor] Pass seq_token_counts explicitly (#1425) Signed-off-by: gcanlin <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Misc] Extend Diffusion Benchmark script to other backends (#875) Signed-off-by: NickLucche <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Feature] Support Stage Based Deployment CLI (#939) Signed-off-by: wuhang <[email protected]> Signed-off-by: princepride <[email protected]> Signed-off-by: wuhang <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Copilot <[email protected]> * [Doc] Optimize vLLM-Omni metrics documentation (#1311) Signed-off-by: Junhong Liu <[email protected]> Signed-off-by: Junhong Liu <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Bugfix] Forward all vllm-omni serve command parameters to model (#985) Signed-off-by: Junhong Liu <[email protected]> Signed-off-by: Junhong Liu <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Doc]: Add bagel single/multi node usage with mooncake document (#1450) * [Qwen3TTS][Feat] Code2Wav batched decoding (#1426) Signed-off-by: pablo <[email protected]> Co-authored-by: pablo <[email protected]> * [CI] Remove overwhelming debug log (#1463) Signed-off-by: tzhouam <[email protected]> * [Misc] update wechat image (#1464) Signed-off-by: David Chen <[email protected]> * [Doc] Refine Diffusion Tutorial Documents (#1305) Signed-off-by: Didan Deng <[email protected]> * [Bugfix] Robust Audio Data Handling in _create_audio_choice (#1222) Signed-off-by: Junhong Liu <[email protected]> * [Bugfix]: Fix merging updated additional information to ensure dict type (#1296) Signed-off-by: Shijin Zhang <[email protected]> * [Model]Add new nextstep_1(Diffusion) model(only T2I) (#612) Signed-off-by: Dong Wang <[email protected]> Signed-off-by: sniper35 <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Bugfix] Add TTS configuration options (#1177) Signed-off-by: Yanick Schraner <[email protected]> * [Debug] Multi-Request for Qwen 3 Omni use_audio_in_video (#1433) Signed-off-by: tzhouam <[email protected]> * [Bugfix] Fix case-sensitive task_type matching in Qwen3TTSModelForGeneration (#1455) Signed-off-by: Sangchun Ha <[email protected]> * [BugFix] process request.num_cached_tokens if it equals to the initial value (#1468) Signed-off-by: Junhong Liu <[email protected]> Co-authored-by: Gao Han <[email protected]> * [Bugfix] Fix SDPA attention mask dtype and shape (Fix #857) (#1349) Signed-off-by: jader <[email protected]> * [Test] Reduce Perf test case and fix modify stage config (#1449) Signed-off-by: yenuo26 <[email protected]> * [NPU] Upgrade to v0.16.0 (#1375) Signed-off-by: gcanlin <[email protected]> * [CI] Update Dockerfile for vllm-omni CI image and remove obsolete dep… (#1491) Signed-off-by: tzhouam <[email protected]> * [Fix][Chore] Qwen3-TTS Modeling Minor Code Sanity Improvements (#1482) Signed-off-by: yuanheng <[email protected]> * [Bugfix] Fix tuple/list KV cache extraction crash (#1405) Signed-off-by: junuxyz <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Doc] format lora related docs for the user's end (#1009) Signed-off-by: AndyZhou952 <[email protected]> Signed-off-by: Andy Zhou <[email protected]> * [Feature] Support Wan2.2 output with irregular shapes (#1279) Signed-off-by: gcanlin <[email protected]> * [Misc] Migrate L1 tests to use pytest-mock (#1315) Signed-off-by: Yuanheng Zhao <[email protected]> Signed-off-by: yuanheng <[email protected]> * [Bugfix] Fix LoRA Scaling on Active Adapters (#1421) Signed-off-by: Alex Brooks <[email protected]> * [Bugfix] fix record audio generated frame in offline infer (#1312) Signed-off-by: Junhong Liu <[email protected]> Signed-off-by: Junhong Liu <[email protected]> * [Model] Support OmniGen2 (#513) Signed-off-by: Yupu <[email protected]> * [Bugfix][Qwen3TTS] (#1289) Signed-off-by: pablo <[email protected]> Co-authored-by: Gao Han <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * Use pull through cache image for H100 pool (#1518) Signed-off-by: Kevin H. Luu <[email protected]> * [ROCm] [CI] [Docker] Point to use the latest vLLM v0.16.0 stable version (#1500) Signed-off-by: tjtanaa <[email protected]> * [Bugfix] fix offline text_to_image error from #1009 (#1515) Signed-off-by: David Chen <[email protected]> * [XPU] Enable FLASH_ATTN on XPU (#1332) Signed-off-by: Yan Ma <[email protected]> * Revert gpu_1 job to use regular image (#1521) Signed-off-by: Kevin H. Luu <[email protected]> * [Chore] remove unused logger in omni_diffusion (#531) (#1509) Signed-off-by: Huang, Zeyu <[email protected]> Co-authored-by: Gao Han <[email protected]> * [Qwen3TTS][Feat] Streaming output (#1438) Signed-off-by: pablo <[email protected]> Signed-off-by: Hongsheng Liu <[email protected]> Co-authored-by: pablo <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Bugfix] Race condition in MultiprocExecutor when concurent access to Scheduler (#1448) Signed-off-by: knlnguyen1802 <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Doc][Test][Misc] ComfyUI test, more screenshot, and code cleaning (#1435) Signed-off-by: Huang, Zeyu <[email protected]> Signed-off-by: Samit <[email protected]> Co-authored-by: Samit <[email protected]> * [Performance]Qwen3-Omni performance optimization (#1378) Signed-off-by: amy-why-3459 <[email protected]> * [Feature] Support HSDP for diffusion models (#1339) Signed-off-by: gcanlin <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [CI] fixed CI timeout (#1460) Signed-off-by: zhumingjue <[email protected]> Signed-off-by: zhumingjue138 <[email protected]> * [Bugfix] Use uds for zmq address if not set --stage-id (#1522) Signed-off-by: wuhang <[email protected]> * [BugFix] Restore talker's config (#1524) Signed-off-by: amy-why-3459 <[email protected]> Signed-off-by: Hongsheng Liu <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> Co-authored-by: Canlin Guo <[email protected]> * [XPU] fix qwen_omni after rebase to v0.16.0 (#1416) Signed-off-by: Chendi Xue <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Platform] Enable layerwise offload on all hardware (#1492) Signed-off-by: gcanlin <[email protected]> * diffusion: enable VAE patch parallel for SD3.5 (#1428) Signed-off-by: dongbo910220 <[email protected]> * [Perf] GLM Image (#920) Signed-off-by: JaredforReal <[email protected]> Signed-off-by: Jared Wen <[email protected]> Signed-off-by: Hongsheng Liu <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [skip ci][Doc] add design docs for async chunk in qwen3-omni (#962) Signed-off-by: Rein Yang <[email protected]> * feat(qwen3-tts): Add CUDA Graph support for speech tokenizer decoder (#1205) Signed-off-by: xulusjb <[email protected]> Co-authored-by: xulusjb <[email protected]> * [New Model]: XiaomiMiMo/MiMo-Audio-7B-Instruct support (#750) Signed-off-by: wangyu31577 <[email protected]> Signed-off-by: 齐保元 <[email protected]> Signed-off-by: hsliu <[email protected]> Signed-off-by: Hongsheng Liu <[email protected]> Signed-off-by: GG-li <[email protected]> Signed-off-by: Sihao Li <[email protected]> Signed-off-by: XU Mingshi <[email protected]> Signed-off-by: mxuax <[email protected]> Signed-off-by: Baoyuan Qi <[email protected]> Signed-off-by: gcanlin <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: wuzhongjian <[email protected]> Signed-off-by: dongbo910220 <[email protected]> Signed-off-by: dongbo910220 <[email protected]> Signed-off-by: Jiangyun Zhu <[email protected]> Signed-off-by: Junhong Liu <[email protected]> Signed-off-by: tjtanaa <[email protected]> Signed-off-by: baoyuan qi <[email protected]> Signed-off-by: tzhouam <[email protected]> Signed-off-by: zjy0516 <[email protected]> Signed-off-by: Prajwal A <[email protected]> Signed-off-by: Shijin Zhang <[email protected]> Signed-off-by: 丁宁 <[email protected]> Signed-off-by: SHIJIN ZHANG <[email protected]> Signed-off-by: dingning<[email protected]> Signed-off-by: dingning <[email protected]> Signed-off-by: dingning <[email protected]> Co-authored-by: wangyu <[email protected]> Co-authored-by: wangyu31577 <[email protected]> Co-authored-by: Zhang Shijin <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> Co-authored-by: Sihao Li <[email protected]> Co-authored-by: XU Mingshi <[email protected]> Co-authored-by: Canlin Guo <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: JohnJan <[email protected]> Co-authored-by: WeiQing Chen <[email protected]> Co-authored-by: dongbo910220 <[email protected]> Co-authored-by: Jiangyun Zhu <[email protected]> Co-authored-by: Junhong Liu <[email protected]> Co-authored-by: TJian <[email protected]> Co-authored-by: shijin zhang <[email protected]> Co-authored-by: Zhou Taichang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Prajwal A <[email protected]> Co-authored-by: Shijin Zhang <[email protected]> Co-authored-by: dingning <[email protected]> Co-authored-by: ning ding <[email protected]> Co-authored-by: Copilot <[email protected]> * [Feature]: Native GGUF Quantization Support for DiT (#1285) Signed-off-by: David Chen <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: WeiQing Chen <[email protected]> Signed-off-by: Hongsheng Liu <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * Add benchmark for `v1/audio/speech` non-streaming (#1408) Signed-off-by: Ekagra Ranjan <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> * [Version] Auto generate version using `setuptool_scm` (#1224) Signed-off-by: tjtanaa <[email protected]> * [Feat] : Support Async chunk cleanup (#1087) Signed-off-by: Sy03 <[email protected]> * [Profiler] Support online profiling (#1136) Signed-off-by: gcanlin <[email protected]> Signed-off-by: Canlin Guo <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> * [Bugfix] Fix redundant finished req status updating on OmniGenerationScheduler (#1510) Signed-off-by: shijin zhang <[email protected]> Co-authored-by: 齐保元 <[email protected]> * [XPU][NPU][ROCM] enable cpu_offloading flag for non_cuda (#1488) Signed-off-by: gcanlin <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Co-authored-by: gcanlin <[email protected]> * [Chore] Cleanup dead code in GGUF DiT code path (#1533) Signed-off-by: Isotr0py <[email protected]> * [Doc] Update installation instructions for vllm 0.16.0 (#1505) Signed-off-by: tzhouam <[email protected]> * [Doc] [skip ci]Sync. (#1363) Signed-off-by: Alicia <[email protected]> Co-authored-by: Yueqian Lin <[email protected]> * [CI][skip ci]Update H100 image link based on #1518 (#1538) Signed-off-by: Alicia <[email protected]> * Fix no embed text spk tokens (#1540) Signed-off-by: Junhong Liu <[email protected]> * [Debug] Merge vllm pull 35368 (#1534) Signed-off-by: tzhouam <[email protected]> * [Docs] update async chunk docs diagram [skip ci] (#1530) Signed-off-by: Rein Yang <[email protected]> * fix(qwen3-tts): fix Base ICL voice clone producing corrupted audio (#1554) Signed-off-by: linyueqian <[email protected]> * [NPU][Bugfix] Align GPU side and recover qwen3-tts (#1564) Signed-off-by: gcanlin <[email protected]> * [BugFix] Fix unexpected crash when init OmniDiffusion (#1562) Signed-off-by: Semmer2 <[email protected]> * [CI] Modify some CI test cases to run on L4 environment to reduce H100 resource usage. (#1543) Signed-off-by: yenuo26 <[email protected]> Signed-off-by: wangyu <[email protected]> * [BugFix]: fix a lot of bug (#1565) Signed-off-by: princepride <[email protected]> * feat: add HyperCLOVAX-SEED-Omni-8B support Model files: - vllm_omni/diffusion/models/hyperclovax_vision/: vision decoder pipeline (HyperCLOVAXVisionPipeline) using flow matching diffusion + VisionTransformer - vllm_omni/diffusion/models/hyperclovax_audio/: audio decoder pipeline (HyperCLOVAXAudioPipeline) using Unit-BigVGAN codec - vllm_omni/model_executor/stage_input_processors/hyperclovax_seed_omni.py: thinker2vision_decoder and thinker2audio_decoder — extract discrete tokens from LLM output; truncate/pad vision codes to 729 (27x27) for decoder Registry: - vllm_omni/diffusion/registry.py: register HyperCLOVAXVisionPipeline and HyperCLOVAXAudioPipeline with post-process functions Stage config: - vllm_omni/model_executor/stage_configs/hcx_omni.yaml: 3-stage config Stage 0: LLM thinker (TP=4, GPUs 0-3), Stage 1: vision decoder (GPU 4), Stage 2: audio decoder (GPU 5) Bug fixes for HyperCLOVAX compatibility: - diffusion/request.py: add extra dict field to OmniDiffusionRequest so vision_tokens/audio_tokens from stage input processors reach the pipeline - entrypoints/async_omni_diffusion.py: extract OmniTokensPrompt.additional_information into OmniDiffusionRequest.extra before creating request - entrypoints/omni_stage.py: skip empty engine inputs (text-only requests where thinker2vision_decoder/thinker2audio_decoder return []) - entrypoints/async_omni.py: handle skipped sentinel in _process_single_result so text-only requests complete without crashing on Stage 1/2 * fix: correct decoder params and HCX porting fixes - hcx_omni.yaml: guidance_scale 3.5→0.75, num_inference_steps 30→50 (matches OmniServe production defaults; 3.5 caused over-amplified autoguidance → shrunken/degraded output images) - omni_stage.py: skip empty engine inputs for text-only requests - async_omni_diffusion.py: extract OmniTokensPrompt.additional_information into OmniDiffusionRequest.extra (audio_tokens/vision_tokens) - registry.py: HCX Omni diffusion model registration fix Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * feat: HyperCLOVAX-SEED-Omni-8B stage pipeline and entrypoint fixes * fix: change guidance_scale from 9.0 to 0.75 (autoguidance scale, OmniServe default) * feat: add audio decoder Stage 2 to hcx_omni pipeline - Wire HyperCLOVAXAudioPipeline as Stage 2 in hcx_omni.yaml - GPU 5 assigned for audio decoder (Unit-BigVGAN / NCCosybigvganDecoder) - Add runtime edge 0->2 (thinker -> audio decoder) - Implement post-generation PCM chunk streaming for audio output (4800 samples / 200ms per SSE event @ 24kHz, int16 base64-encoded) Refs: github.com/vllm-project/vllm-omni/pull/869 (already incorporated) Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix: vllm version compatibility for HyperCLOVAX audio decoder startup - config/model.py: try/except fallback for AttentionBackendEnum import (vllm.v1.attention.backends.registry absent in older vllm builds) - pipeline_hyperclovax_audio.py: return actual named_parameters() from load_weights() when using MAR checkpoint so diffusers_loader strict check passes (weights loaded eagerly in __init__ via MAR extraction) - qwen3_omni_moe_thinker.py, qwen2_5_omni_thinker.py: try/except stubs for check_interleaved_audio_video and merge_interleaved_embeddings which are absent in older vllm qwen2_5_omni_thinker; these symbols are only exercised by Qwen models, not HyperCLOVAX Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix: add edge 1→2 and correct model key in hcx_omni.yaml Stage 2 - Add runtime edge from:1 to:2 (required for Stage-2 connector init; without it AsyncOrchestrator cannot route to audio decoder at runtime) - Change model_subdir to model for Stage-2 engine_args to match total-poc working reference config Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix: audio S2S output - handle diffusion outputs in _create_audio_choice HyperCLOVAXAudioPipeline (diffusion) stores audio in multimodal_output directly (OmniRequestOutput.from_diffusion), not in outputs[0].multimodal_output like LLM pipelines. Fix three locations: 1. _create_audio_choice (non-streaming): use omni_outputs.multimodal_output when final_res.outputs is empty (diffusion path). 2. Streaming audio path: same fix for _final_res.outputs[0]. 3. Both loops (for output in final_res.outputs): fall back to single synthetic choice at index 0 when outputs list is empty. 4. Handle bytes audio output from HyperCLOVAXAudioPipeline post-process (returns WAV bytes, not tensors like Qwen3-Omni). Also fixes audio input (A2T) regression: skip diffusion prompt extraction when mm_data has audio content (added in previous session). Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix: parse WAV bytes with soundfile for uniform PCM chunk streaming HyperCLOVAXAudioPipeline returns WAV bytes including 44-byte header. The previous byte-offset splitting included the header in the first chunk, corrupting it. Fix: parse with soundfile to get float32 PCM, then convert to int16 chunks uniformly regardless of source type (bytes or tensor). Verified: 136 audio chunks x 200ms = 27.04s audio streamed correctly. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * feat: zero-shot TTS with speaker embedding from input audio - serving_chat.py: extract last input_audio base64 from request messages and inject as ref_audio_b64 into engine_prompt dict - thinker2audio_decoder: read ref_audio_b64 from prompt and pass as ref_audio_tokens to Stage 2 (HyperCLOVAXAudioPipeline) - hcx_omni.yaml: switch Stage 2 to NCZSCosybigvganDecoder.mar (zero-shot) which uses ECAPA-TDNN speaker encoder instead of finetuned ID lookup Pipeline: input audio -> ECAPA-TDNN -> speaker embedding -> BigVGAN synthesis matching the voice characteristics of the original speaker. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * feat: wire audio decoder Stage 2 to hcx_omni pipeline and fix S2S flow - Add Stage 2 (HyperCLOVAXAudioPipeline / NCZSCosybigvganDecoder) to hcx_omni.yaml with GPU 5, gpu_memory_utilization 0.4, edge 0->2 from thinker - Fix thinker2audio_decoder: correct audio token range (128606-135167), remap to [0, 6561) for BigVGAN input, handle empty token case gracefully - Fix pipeline_hyperclovax_audio.py post_process_func signature and incorporate PR#869 BUG FIX patches for stable audio generation * fix: use finetuned audio decoder and fix transformers_modules deserialization - hcx_omni.yaml: switch Stage 2 from NCZSCosybigvganDecoder (zero-shot, ECAPA-TDNN) to NCCosybigvganDecoder (finetuned, nn.Embedding speaker id). Zero-shot decoder required ref_audio (mel spectrogram) which is unavailable for text-only requests and incompatible with finetuned decoder path. - pipeline_hyperclovax_audio.py: guard ref_audio processing with 'not self.bigvgan.finetune' — finetuned decoder has no ECAPA-TDNN encoder, so passing ref_audio bytes would crash with 'expected 100 channels'. - omni_stage.py: add HuggingFace modules cache (~/.cache/huggingface/modules) to sys.path before queue.get_nowait() in try_collect(). Stage-0 pickles outputs containing custom classes from transformers_modules (trust_remote_code), but the API server process doesn't have this path, causing deserialization failures that silently drop Stage-0 outputs. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * fix: restore zero-shot speaker cloning with fallback for text-only requests - hcx_omni.yaml: revert to NCZSCosybigvganDecoder.mar (zero-shot ECAPA-TDNN) for voice-preserving S2S synthesis. NCCosybigvganDecoder used a fixed integer speaker_id and lost the input speaker's voice. - pipeline_hyperclovax_audio.py: add zero-mel fallback branch for finetune=False + ref_audio=None case. When a text-only request arrives (no input audio → no ref_audio), ECAPA-TDNN receives a zero mel tensor [1, num_mels, 64] instead of crashing with 'expected 100 channels'. S2S requests always have ref_audio so the zero-shot cloning path is unchanged. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> * feat: add stage config yaml for HCX audio decoder Signed-off-by: Hyunjoon Jeong <[email protected]> * feat: add HyperCLOVAX-SEED-Omni 8B model as vllm-omni executor Signed-off-by: Hyunjoon Jeong <[email protected]> * feat: add HCX audio decoder pipeline Signed-off-by: Hyunjoon Jeong <[email protected]> * fix: modify exception for HCX audio decoder (GAN) Signed-off-by: Hyunjoon Jeong <[email protected]> * fix: default temperature set to 0, and pipeline model evaluation mode Signed-off-by: Hyunjoon Jeong <[email protected]> --------- Signed-off-by: Huang, Zeyu <[email protected]> Signed-off-by: dengyunyang <[email protected]> Signed-off-by: gcanlin <[email protected]> Signed-off-by: samithuang <[email protected]> Signed-off-by: Samit <[email protected]> Signed-off-by: lishunyang <[email protected]> Signed-off-by: Hongsheng Liu <[email protected]> Signed-off-by: princepride <[email protected]> Signed-off-by: 汪志鹏 <[email protected]> Signed-off-by: wangyu31577 <[email protected]> Signed-off-by: Kyle Huang <[email protected]> Signed-off-by: zjy0516 <[email protected]> Signed-off-by: tjtanaa <[email protected]> Signed-off-by: natureofnature <[email protected]> Signed-off-by: linyueqian <[email protected]> Signed-off-by: mxuax <[email protected]> Signed-off-by: XU Mingshi <[email protected]> Signed-off-by: amy-why-3459 <[email protected]> Signed-off-by: wuzhongjian [email protected] Signed-off-by: ZeldaHuang <[email protected]> Signed-off-by: Divyansh Singhvi <[email protected]> Signed-off-by: Lin, Fanli <[email protected]> Signed-off-by: Fanli Lin <[email protected]> Signed-off-by: Fanli Lin <[email protected]> Signed-off-by: dongbo910220 <[email protected]> Signed-off-by: Ding Zuhao <[email protected]> Signed-off-by: jzz <[email protected]> Signed-off-by: Andy Zhou <[email protected]> Signed-off-by: wangyu <[email protected]> Signed-off-by: Pierre Le Guen <[email protected]> Signed-off-by: yuanheng <[email protected]> Signed-off-by: ram16g <[email protected]> Signed-off-by: Didan Deng <[email protected]> Signed-off-by: pablo <[email protected]> Signed-off-by: Roger Wang <[email protected]> Signed-off-by: anna <[email protected]> Signed-off-by: Rustam Khadipash <[email protected]> Signed-off-by: Alicia <[email protected]> Signed-off-by: Taichang Zhou <[email protected]> Signed-off-by: tzhouam <[email protected]> Signed-off-by: hsliu <[email protected]> Signed-off-by: hsliu_ustc <[email protected]> Signed-off-by: zhenwei-intel <[email protected]> Signed-off-by: Yuanheng Zhao <[email protected]> Signed-off-by: erfgss <[email protected]> Signed-off-by: xiedeyantu <[email protected]> Signed-off-by: Junhong Liu <[email protected]> Signed-off-by: Junhong Liu <[email protected]> Signed-off-by: David Chen <[email protected]> Signed-off-by: weichen <[email protected]> Signed-off-by: Yan Ma <[email protected]> Signed-off-by: ApsarasX <[email protected]> Signed-off-by: Chenguang ZHENG <[email protected]> Signed-off-by: yenuo26 <[email protected]> Signed-off-by: Semmer2 <[email protected]> Signed-off-by: Yueqian Lin <[email protected]> Signed-off-by: zhou zhuoxin <[email protected]> Signed-off-by: Gao Han <[email protected]> Signed-off-by: Rein Yang <[email protected]> Signed-off-by: Ekagra Ranjan <[email protected]> Signed-off-by: SYLAR <[email protected]> Signed-off-by: Ziming Huang <[email protected]> Signed-off-by: SamitHuang <[email protected]> Signed-off-by: GG-li <[email protected]> Signed-off-by: CHEN <[email protected]> Signed-off-by: John Liu BUAA <[email protected]> Signed-off-by: knlnguyen1802 <[email protected]> Signed-off-by: dsinghvi <[email protected]> Signed-off-by: Chendi Xue <[email protected]> Signed-off-by: Daniel Huang <[email protected]> Signed-off-by: Alex Brooks <[email protected]> Signed-off-by: Sy03 <[email protected]> Signed-off-by: Claude Opus 4.6 <[email protected]> Signed-off-by: UsamaKenway <[email protected]> Signed-off-by: Hunter Liu <[email protected]> Signed-off-by: NickLucche <[email protected]> Signed-off-by: wuhang <[email protected]> Signed-off-by: wuhang <[email protected]> Signed-off-by: pablo <[email protected]> Signed-off-by: Shijin Zhang <[email protected]> Signed-off-by: Dong Wang <[email protected]> Signed-off-by: sniper35 <[email protected]> Signed-off-by: Yanick Schraner <[email protected]> Signed-off-by: Sangchun Ha <[email protected]> Signed-off-by: jader <[email protected]> Signed-off-by: junuxyz <[email protected]> Signed-off-by: AndyZhou952 <[email protected]> Signed-off-by: Yupu <[email protected]> Signed-off-by: Kevin H. Luu <[email protected]> Signed-off-by: zhumingjue <[email protected]> Signed-off-by: zhumingjue138 <[email protected]> Signed-off-by: JaredforReal <[email protected]> Signed-off-by: Jared Wen <[email protected]> Signed-off-by: xulusjb <[email protected]> Signed-off-by: 齐保元 <[email protected]> Signed-off-by: Sihao Li <[email protected]> Signed-off-by: Baoyuan Qi <[email protected]> Signed-off-by: Isotr0py <[email protected]> Signed-off-by: wuzhongjian <[email protected]> Signed-off-by: dongbo910220 <[email protected]> Signed-off-by: Jiangyun Zhu <[email protected]> Signed-off-by: baoyuan qi <[email protected]> Signed-off-by: Prajwal A <[email protected]> Signed-off-by: 丁宁 <[email protected]> Signed-off-by: SHIJIN ZHANG <[email protected]> Signed-off-by: dingning<[email protected]> Signed-off-by: dingning <[email protected]> Signed-off-by: dingning <[email protected]> Signed-off-by: WeiQing Chen <[email protected]> Signed-off-by: Canlin Guo <[email protected]> Signed-off-by: shijin zhang <[email protected]> Signed-off-by: Hyunjoon Jeong <[email protected]> Signed-off-by: Hyunjoon Jeong <[email protected]> Co-authored-by: Zeyu Huang | 黃澤宇 <[email protected]> Co-authored-by: JohnJan <[email protected]> Co-authored-by: dengyunyang <[email protected]> Co-authored-by: Hongsheng Liu <[email protected]> Co-authored-by: Canlin Guo <[email protected]> Co-authored-by: Samit <[email protected]> Co-authored-by: SYLAR <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: 汪志鹏 <[email protected]> Co-authored-by: wangyu <[email protected]> Co-authored-by: wangyu31577 <[email protected]> Co-authored-by: kYLe <[email protected]> Co-authored-by: Jiangyun Zhu <[email protected]> Co-authored-by: TJian <[email protected]> Co-authored-by: NATURE <[email protected]> Co-authored-by: Yueqian Lin <[email protected]> Co-authored-by: Zhou Taichang <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: XU Mingshi <[email protected]> Co-authored-by: amy-why-3459 <[email protected]> Co-authored-by: Rein Yang <[email protected]> Co-authored-by: Ziming Huang <[email protected]> Co-authored-by: dsinghvi <[email protected]> Co-authored-by: Fanli Lin <[email protected]> Co-authored-by: dongbo910220 <[email protected]> Co-authored-by: Ding Zuhao <[email protected]> Co-authored-by: Andy Zhou <[email protected]> Co-authored-by: Pierre LE GUEN <[email protected]> Co-authored-by: WeiQing Chen <[email protected]> Co-authored-by: Yuanheng Zhao <[email protected]> Co-authored-by: ram16g <[email protected]> Co-authored-by: Didan Deng <[email protected]> Co-authored-by: Markus / Mark <[email protected]> Co-authored-by: Juan Pablo Zuluaga <[email protected]> Co-authored-by: muziyuhui666 <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: ceanna93 <[email protected]> Co-authored-by: anna <[email protected]> Co-authored-by: Rustam Khadipash <[email protected]> Co-authored-by: Alicia <[email protected]> Co-authored-by: hsliu_ustc <[email protected]> Co-authored-by: liuzhenwei <[email protected]> Co-authored-by: erfgss <[email protected]> Co-authored-by: Jensen <[email protected]> Co-authored-by: Junhong Liu <[email protected]> Co-authored-by: weichen <[email protected]> Co-authored-by: PopSoda2002 <[email protected]> Co-authored-by: Yan Ma <[email protected]> Co-authored-by: ApsarasX <[email protected]> Co-authored-by: Chenguang Zheng <[email protected]> Co-authored-by: Jiaping Wu <[email protected]> Co-authored-by: zhou zhuoxin <[email protected]> Co-authored-by: Gao Han <[email protected]> Co-authored-by: rein yang <[email protected]> Co-authored-by: Ekagra Ranjan <[email protected]> Co-authored-by: Flora Feng <[email protected]> Co-authored-by: Sihao Li <[email protected]> Co-authored-by: ChenWenjing <[email protected]> Co-authored-by: Bhanu068 <[email protected]> Co-authored-by: John Liu BUAA <[email protected]> Co-authored-by: yenuo26 <[email protected]> Co-authored-by: knlnguyen1802 <[email protected]> Co-authored-by: liuzhenwei <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: ZJY0516 <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Chendi.Xue <[email protected]> Co-authored-by: Daniel Huang <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: Sy03 <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> Co-authored-by: UsamaKenway <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: wuhang <[email protected]> Co-authored-by: pablo <[email protected]> Co-authored-by: SHIJIN ZHANG <[email protected]> Co-authored-by: Dong W <[email protected]> Co-authored-by: Yanick Schraner <[email protected]> Co-authored-by: Sangchun Ha <[email protected]> Co-authored-by: 亦瑾 <[email protected]> Co-authored-by: junuxyz <[email protected]> Co-authored-by: Yupu <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: zhumingjue138 <[email protected]> Co-authored-by: Canlin Guo <[email protected]> Co-authored-by: Jared Wen <[email protected]> Co-authored-by: Xu Lu <[email protected]> Co-authored-by: xulusjb <[email protected]> Co-authored-by: Baoyuan Qi <[email protected]> Co-authored-by: Zhang Shijin <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: shijin zhang <[email protected]> Co-authored-by: Prajwal A <[email protected]> Co-authored-by: dingning <[email protected]> Co-authored-by: ning ding <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Nicolò Lucchesi <[email protected]> Co-authored-by: Ting FU <[email protected]> Co-authored-by: developer-account <[email protected]> Co-authored-by: Hyunjoon Jeong <[email protected]>

marksverdhei requested a review from hsliuustc0106 as a code owner January 29, 2026 10:53

marksverdhei marked this pull request as draft January 29, 2026 10:54

marksverdhei marked this pull request as ready for review January 29, 2026 11:08

marksverdhei force-pushed the q3-tts branch from 6eba044 to 71ca2f0 Compare January 29, 2026 11:12

david6666666 requested a review from Gaohan123 January 29, 2026 11:18

chatgpt-codex-connector Bot reviewed Jan 29, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/models/qwen3_tts/qwen3_tts.py Outdated

marksverdhei mentioned this pull request Jan 29, 2026

Fall back to SDPA attention when flash-attn unavailable heiervang-technologies/ht-vllm-omni#3

Closed

3 tasks

hsliuustc0106 requested a review from Copilot January 29, 2026 12:05

Copilot started reviewing on behalf of hsliuustc0106 January 29, 2026 12:13 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/models/qwen3_tts/qwen3_tts.py Outdated

Comment thread vllm_omni/model_executor/models/qwen3_tts/qwen3_tts.py Outdated

marksverdhei changed the title ~~[BugFix] Fix Qwen3 TTS 0.6B profile run hang (#995) (#2)~~ [BugFix] Fix Qwen3 TTS 0.6B profile run hang (#995) Jan 29, 2026

Gaohan123 added the ready label to trigger buildkite CI label Jan 29, 2026

marksverdhei force-pushed the q3-tts branch from 347d6f4 to 94864c3 Compare January 30, 2026 16:23

marksverdhei and others added 2 commits January 30, 2026 18:14

marksverdhei force-pushed the q3-tts branch from 94864c3 to 83bb1e1 Compare January 30, 2026 17:14

hsliuustc0106 approved these changes Jan 30, 2026

View reviewed changes

Gaohan123 approved these changes Jan 30, 2026

View reviewed changes

Gaohan123 merged commit 425952e into vllm-project:main Jan 31, 2026
6 of 7 checks passed

dongbo910220 pushed a commit to dongbo910220/vllm-omni that referenced this pull request Feb 1, 2026

[BugFix] Fix Qwen3 TTS 0.6B profile run hang (vllm-project#995) (vllm…

831f043

…-project#1082)

This was referenced Feb 7, 2026

[RFC]: Qwen3-TTS Production Ready - February Milestone #938

Open

[Doc]: Contributing guidelines for agentic contribution #1260

Open

marksverdhei deleted the q3-tts branch February 12, 2026 21:25

Conversation

marksverdhei commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Root Cause

Fix

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

hsliuustc0106 commented Jan 29, 2026

Uh oh!

tzhouam commented Jan 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

marksverdhei commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gaohan123 commented Jan 29, 2026

Uh oh!

marksverdhei commented Jan 29, 2026

Uh oh!

Gaohan123 commented Jan 30, 2026

Uh oh!

marksverdhei commented Jan 30, 2026

Uh oh!

marksverdhei commented Jan 30, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Jan 30, 2026

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

marksverdhei commented Jan 29, 2026 •

edited

Loading

marksverdhei commented Jan 29, 2026 •

edited

Loading