Feat/hyperclovax omni ad by with1015 · Pull Request #5 · with1015/vllm-omni

with1015 · 2026-04-06T07:23:21Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>

Model files: - vllm_omni/diffusion/models/hyperclovax_vision/: vision decoder pipeline (HyperCLOVAXVisionPipeline) using flow matching diffusion + VisionTransformer - vllm_omni/diffusion/models/hyperclovax_audio/: audio decoder pipeline (HyperCLOVAXAudioPipeline) using Unit-BigVGAN codec - vllm_omni/model_executor/stage_input_processors/hyperclovax_seed_omni.py: thinker2vision_decoder and thinker2audio_decoder — extract discrete tokens from LLM output; truncate/pad vision codes to 729 (27x27) for decoder Registry: - vllm_omni/diffusion/registry.py: register HyperCLOVAXVisionPipeline and HyperCLOVAXAudioPipeline with post-process functions Stage config: - vllm_omni/model_executor/stage_configs/hcx_omni.yaml: 3-stage config Stage 0: LLM thinker (TP=4, GPUs 0-3), Stage 1: vision decoder (GPU 4), Stage 2: audio decoder (GPU 5) Bug fixes for HyperCLOVAX compatibility: - diffusion/request.py: add extra dict field to OmniDiffusionRequest so vision_tokens/audio_tokens from stage input processors reach the pipeline - entrypoints/async_omni_diffusion.py: extract OmniTokensPrompt.additional_information into OmniDiffusionRequest.extra before creating request - entrypoints/omni_stage.py: skip empty engine inputs (text-only requests where thinker2vision_decoder/thinker2audio_decoder return []) - entrypoints/async_omni.py: handle skipped sentinel in _process_single_result so text-only requests complete without crashing on Stage 1/2

- hcx_omni.yaml: guidance_scale 3.5→0.75, num_inference_steps 30→50 (matches OmniServe production defaults; 3.5 caused over-amplified autoguidance → shrunken/degraded output images) - omni_stage.py: skip empty engine inputs for text-only requests - async_omni_diffusion.py: extract OmniTokensPrompt.additional_information into OmniDiffusionRequest.extra (audio_tokens/vision_tokens) - registry.py: HCX Omni diffusion model registration fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…Serve default)

- Wire HyperCLOVAXAudioPipeline as Stage 2 in hcx_omni.yaml - GPU 5 assigned for audio decoder (Unit-BigVGAN / NCCosybigvganDecoder) - Add runtime edge 0->2 (thinker -> audio decoder) - Implement post-generation PCM chunk streaming for audio output (4800 samples / 200ms per SSE event @ 24kHz, int16 base64-encoded) Refs: github.com/vllm-project/pull/869 (already incorporated) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- config/model.py: try/except fallback for AttentionBackendEnum import (vllm.v1.attention.backends.registry absent in older vllm builds) - pipeline_hyperclovax_audio.py: return actual named_parameters() from load_weights() when using MAR checkpoint so diffusers_loader strict check passes (weights loaded eagerly in __init__ via MAR extraction) - qwen3_omni_moe_thinker.py, qwen2_5_omni_thinker.py: try/except stubs for check_interleaved_audio_video and merge_interleaved_embeddings which are absent in older vllm qwen2_5_omni_thinker; these symbols are only exercised by Qwen models, not HyperCLOVAX Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add runtime edge from:1 to:2 (required for Stage-2 connector init; without it AsyncOrchestrator cannot route to audio decoder at runtime) - Change model_subdir to model for Stage-2 engine_args to match total-poc working reference config Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

HyperCLOVAXAudioPipeline (diffusion) stores audio in multimodal_output directly (OmniRequestOutput.from_diffusion), not in outputs[0].multimodal_output like LLM pipelines. Fix three locations: 1. _create_audio_choice (non-streaming): use omni_outputs.multimodal_output when final_res.outputs is empty (diffusion path). 2. Streaming audio path: same fix for _final_res.outputs[0]. 3. Both loops (for output in final_res.outputs): fall back to single synthetic choice at index 0 when outputs list is empty. 4. Handle bytes audio output from HyperCLOVAXAudioPipeline post-process (returns WAV bytes, not tensors like Qwen3-Omni). Also fixes audio input (A2T) regression: skip diffusion prompt extraction when mm_data has audio content (added in previous session). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

HyperCLOVAXAudioPipeline returns WAV bytes including 44-byte header. The previous byte-offset splitting included the header in the first chunk, corrupting it. Fix: parse with soundfile to get float32 PCM, then convert to int16 chunks uniformly regardless of source type (bytes or tensor). Verified: 136 audio chunks x 200ms = 27.04s audio streamed correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- serving_chat.py: extract last input_audio base64 from request messages and inject as ref_audio_b64 into engine_prompt dict - thinker2audio_decoder: read ref_audio_b64 from prompt and pass as ref_audio_tokens to Stage 2 (HyperCLOVAXAudioPipeline) - hcx_omni.yaml: switch Stage 2 to NCZSCosybigvganDecoder.mar (zero-shot) which uses ECAPA-TDNN speaker encoder instead of finetuned ID lookup Pipeline: input audio -> ECAPA-TDNN -> speaker embedding -> BigVGAN synthesis matching the voice characteristics of the original speaker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add Stage 2 (HyperCLOVAXAudioPipeline / NCZSCosybigvganDecoder) to hcx_omni.yaml with GPU 5, gpu_memory_utilization 0.4, edge 0->2 from thinker - Fix thinker2audio_decoder: correct audio token range (128606-135167), remap to [0, 6561) for BigVGAN input, handle empty token case gracefully - Fix pipeline_hyperclovax_audio.py post_process_func signature and incorporate PR#869 BUG FIX patches for stable audio generation

…lization - hcx_omni.yaml: switch Stage 2 from NCZSCosybigvganDecoder (zero-shot, ECAPA-TDNN) to NCCosybigvganDecoder (finetuned, nn.Embedding speaker id). Zero-shot decoder required ref_audio (mel spectrogram) which is unavailable for text-only requests and incompatible with finetuned decoder path. - pipeline_hyperclovax_audio.py: guard ref_audio processing with 'not self.bigvgan.finetune' — finetuned decoder has no ECAPA-TDNN encoder, so passing ref_audio bytes would crash with 'expected 100 channels'. - omni_stage.py: add HuggingFace modules cache (~/.cache/huggingface/modules) to sys.path before queue.get_nowait() in try_collect(). Stage-0 pickles outputs containing custom classes from transformers_modules (trust_remote_code), but the API server process doesn't have this path, causing deserialization failures that silently drop Stage-0 outputs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…quests - hcx_omni.yaml: revert to NCZSCosybigvganDecoder.mar (zero-shot ECAPA-TDNN) for voice-preserving S2S synthesis. NCCosybigvganDecoder used a fixed integer speaker_id and lost the input speaker's voice. - pipeline_hyperclovax_audio.py: add zero-mel fallback branch for finetune=False + ref_audio=None case. When a text-only request arrives (no input audio → no ref_audio), ECAPA-TDNN receives a zero mel tensor [1, num_mels, 64] instead of crashing with 'expected 100 channels'. S2S requests always have ref_audio so the zero-shot cloning path is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

…V E2E) (#1) * feat: add HyperCLOVAX-SEED-Omni-8B vision pipeline, thinker, and stage config - diffusion/models/hyperclovax_vision/: HyperCLOVAX vision diffusion pipeline (transformer, layers, vision_token_embedder, pipeline) - model_executor/models/hcx_omni/: HCX Omni thinker model - model_executor/stage_configs/hcx_omni.yaml: 3-stage pipeline config (Stage-0 LLM thinker, Stage-1 vision decoder, Stage-2 audio decoder) - model_executor/stage_input_processors/hyperclovax_seed_omni.py: thinker→vision/audio token routing - engine/, entrypoints/: arg_utils, input_processor, omni_llm, zmq_utils, stage_utils, cli/main integration - examples/online_serving/hcx_omni/: client demo and run script - tests/: e2e and unit tests for HCX Omni Co-Authored-By: Hyunjoon Jeong <with1015@unist.ac.kr> * fix: async fan-out topology, serving pipeline, and vLLM 0.18.0 compat - async_omni.py: redesign _process_sequential_results for fan-out topology — Stage-0 forwards to Stage-1 (vision) AND Stage-2 (audio) independently based on engine_input_source; add skipped_stages for conditional routing - serving_chat.py: add _stage0_is_llm guard so GLM-Image bare-text replacement does not clobber HCX Omni Stage-0 multimodal inputs; handle audio output in _create_chat_completion_response - async_omni_diffusion.py, omni_stage.py: vLLM 0.18.0 API alignment - worker/gpu_ar_model_runner.py, async_omni_llm.py: compatibility fixes Co-Authored-By: 길재은 <jaeeun.kil@navercorp.com> Co-Authored-By: Hyunjoon Jeong <with1015@unist.ac.kr> * fix: diffusion IPC, audio/vision decoder E2E pipeline fixes - diffusion/ipc.py, diffusion_engine.py, diffusion_worker.py: IPC stability and worker lifecycle fixes for HCX audio+vision stages - diffusion/models/hyperclovax_audio/pipeline_hyperclovax_audio.py: finetuned audio decoder path, transformers_modules deserialization, zero-shot speaker embedding fallback - diffusion/registry.py, request.py: HCX Omni diffusion model registration and request type handling Validated E2E with HyperCLOVAX-SEED-Omni-8B: Speech-to-Speech → 11.84s / 568KB WAV (BigVGAN, 24kHz) Text-to-Vision → 768×768 PNG (diffusion, 50 steps) Co-Authored-By: 길재은 <jaeeun.kil@navercorp.com> Co-Authored-By: Hyunjoon Jeong <with1015@unist.ac.kr> * fix: vLLM 0.18.0 compatibility for unit tests and config imports - tests/unit/conftest.py: stub vllm_omni heavy init so unit tests can import stage_input_processors without a full vLLM installation - vllm_omni/config/model.py: guard _RUNNER_TASKS / TaskOption imports with try/except fallback for vLLM 0.18.0 where these were removed Co-Authored-By: 길재은 <jaeeun.kil@navercorp.com> Co-Authored-By: Hyunjoon Jeong <with1015@unist.ac.kr> --------- Co-authored-by: 길재은 <jaeeun.kil@navercorp.com> Co-authored-by: Hyunjoon Jeong <with1015@unist.ac.kr> Co-authored-by: kje <kje@navercorp.com>

Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>

with1015 and others added 24 commits January 20, 2026 07:38

feat: add hyperclovax audio decoder model

239f959

Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>

feat: add pydub package as requirement

19159c2

Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>

feat: add hyperclovax audio decoder in registry

df2a630

Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>

feat: remove unsupported speaker and change default parameter of model

9c28d49

Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>

feat: HyperCLOVAX-SEED-Omni-8B stage pipeline and entrypoint fixes

c2584de

fix: change guidance_scale from 9.0 to 0.75 (autoguidance scale, Omni…

85d291c

…Serve default)

feat: add stage config yaml for HCX audio decoder

b73ae20

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

feat: add HyperCLOVAX-SEED-Omni 8B model as vllm-omni executor

44041c6

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

feat: add HCX audio decoder pipeline

d87a0af

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

fix: modify exception for HCX audio decoder (GAN)

8e45030

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

fix: default temperature set to 0, and pipeline model evaluation mode

e6285ce

Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>

Merge branch 'model/hyperclovax-audio' into feat/hyperclovax-omni-AD

11b8ff3

Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/hyperclovax omni ad#5

Feat/hyperclovax omni ad#5
with1015 wants to merge 24 commits intomainfrom
feat/hyperclovax-omni-AD

with1015 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

with1015 commented Apr 6, 2026

Purpose

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants