Skip to content

Feat/hyperclovax omni ad#5

Draft
with1015 wants to merge 24 commits intomainfrom
feat/hyperclovax-omni-AD
Draft

Feat/hyperclovax omni ad#5
with1015 wants to merge 24 commits intomainfrom
feat/hyperclovax-omni-AD

Conversation

@with1015
Copy link
Copy Markdown
Owner

@with1015 with1015 commented Apr 6, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

with1015 and others added 24 commits January 20, 2026 07:38
Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>
Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>
Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>
Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>
Model files:
- vllm_omni/diffusion/models/hyperclovax_vision/: vision decoder pipeline
  (HyperCLOVAXVisionPipeline) using flow matching diffusion + VisionTransformer
- vllm_omni/diffusion/models/hyperclovax_audio/: audio decoder pipeline
  (HyperCLOVAXAudioPipeline) using Unit-BigVGAN codec
- vllm_omni/model_executor/stage_input_processors/hyperclovax_seed_omni.py:
  thinker2vision_decoder and thinker2audio_decoder — extract discrete tokens from
  LLM output; truncate/pad vision codes to 729 (27x27) for decoder

Registry:
- vllm_omni/diffusion/registry.py: register HyperCLOVAXVisionPipeline and
  HyperCLOVAXAudioPipeline with post-process functions

Stage config:
- vllm_omni/model_executor/stage_configs/hcx_omni.yaml: 3-stage config
  Stage 0: LLM thinker (TP=4, GPUs 0-3), Stage 1: vision decoder (GPU 4),
  Stage 2: audio decoder (GPU 5)

Bug fixes for HyperCLOVAX compatibility:
- diffusion/request.py: add extra dict field to OmniDiffusionRequest so
  vision_tokens/audio_tokens from stage input processors reach the pipeline
- entrypoints/async_omni_diffusion.py: extract OmniTokensPrompt.additional_information
  into OmniDiffusionRequest.extra before creating request
- entrypoints/omni_stage.py: skip empty engine inputs (text-only requests where
  thinker2vision_decoder/thinker2audio_decoder return [])
- entrypoints/async_omni.py: handle skipped sentinel in _process_single_result
  so text-only requests complete without crashing on Stage 1/2
- hcx_omni.yaml: guidance_scale 3.5→0.75, num_inference_steps 30→50
  (matches OmniServe production defaults; 3.5 caused over-amplified
  autoguidance → shrunken/degraded output images)
- omni_stage.py: skip empty engine inputs for text-only requests
- async_omni_diffusion.py: extract OmniTokensPrompt.additional_information
  into OmniDiffusionRequest.extra (audio_tokens/vision_tokens)
- registry.py: HCX Omni diffusion model registration fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Wire HyperCLOVAXAudioPipeline as Stage 2 in hcx_omni.yaml
- GPU 5 assigned for audio decoder (Unit-BigVGAN / NCCosybigvganDecoder)
- Add runtime edge 0->2 (thinker -> audio decoder)
- Implement post-generation PCM chunk streaming for audio output
  (4800 samples / 200ms per SSE event @ 24kHz, int16 base64-encoded)

Refs: github.com/vllm-project/pull/869 (already incorporated)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- config/model.py: try/except fallback for AttentionBackendEnum import
  (vllm.v1.attention.backends.registry absent in older vllm builds)
- pipeline_hyperclovax_audio.py: return actual named_parameters() from
  load_weights() when using MAR checkpoint so diffusers_loader strict
  check passes (weights loaded eagerly in __init__ via MAR extraction)
- qwen3_omni_moe_thinker.py, qwen2_5_omni_thinker.py: try/except stubs
  for check_interleaved_audio_video and merge_interleaved_embeddings
  which are absent in older vllm qwen2_5_omni_thinker; these symbols
  are only exercised by Qwen models, not HyperCLOVAX

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add runtime edge from:1 to:2 (required for Stage-2 connector init;
  without it AsyncOrchestrator cannot route to audio decoder at runtime)
- Change model_subdir to model for Stage-2 engine_args to match
  total-poc working reference config

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HyperCLOVAXAudioPipeline (diffusion) stores audio in multimodal_output
directly (OmniRequestOutput.from_diffusion), not in outputs[0].multimodal_output
like LLM pipelines. Fix three locations:

1. _create_audio_choice (non-streaming): use omni_outputs.multimodal_output
   when final_res.outputs is empty (diffusion path).
2. Streaming audio path: same fix for _final_res.outputs[0].
3. Both loops (for output in final_res.outputs): fall back to single
   synthetic choice at index 0 when outputs list is empty.
4. Handle bytes audio output from HyperCLOVAXAudioPipeline post-process
   (returns WAV bytes, not tensors like Qwen3-Omni).

Also fixes audio input (A2T) regression: skip diffusion prompt extraction
when mm_data has audio content (added in previous session).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HyperCLOVAXAudioPipeline returns WAV bytes including 44-byte header.
The previous byte-offset splitting included the header in the first
chunk, corrupting it. Fix: parse with soundfile to get float32 PCM,
then convert to int16 chunks uniformly regardless of source type
(bytes or tensor).

Verified: 136 audio chunks x 200ms = 27.04s audio streamed correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- serving_chat.py: extract last input_audio base64 from request messages
  and inject as ref_audio_b64 into engine_prompt dict
- thinker2audio_decoder: read ref_audio_b64 from prompt and pass as
  ref_audio_tokens to Stage 2 (HyperCLOVAXAudioPipeline)
- hcx_omni.yaml: switch Stage 2 to NCZSCosybigvganDecoder.mar (zero-shot)
  which uses ECAPA-TDNN speaker encoder instead of finetuned ID lookup

Pipeline: input audio -> ECAPA-TDNN -> speaker embedding -> BigVGAN synthesis
matching the voice characteristics of the original speaker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add Stage 2 (HyperCLOVAXAudioPipeline / NCZSCosybigvganDecoder) to hcx_omni.yaml
  with GPU 5, gpu_memory_utilization 0.4, edge 0->2 from thinker
- Fix thinker2audio_decoder: correct audio token range (128606-135167),
  remap to [0, 6561) for BigVGAN input, handle empty token case gracefully
- Fix pipeline_hyperclovax_audio.py post_process_func signature and
  incorporate PR#869 BUG FIX patches for stable audio generation
…lization

- hcx_omni.yaml: switch Stage 2 from NCZSCosybigvganDecoder (zero-shot,
  ECAPA-TDNN) to NCCosybigvganDecoder (finetuned, nn.Embedding speaker id).
  Zero-shot decoder required ref_audio (mel spectrogram) which is unavailable
  for text-only requests and incompatible with finetuned decoder path.

- pipeline_hyperclovax_audio.py: guard ref_audio processing with
  'not self.bigvgan.finetune' — finetuned decoder has no ECAPA-TDNN encoder,
  so passing ref_audio bytes would crash with 'expected 100 channels'.

- omni_stage.py: add HuggingFace modules cache (~/.cache/huggingface/modules)
  to sys.path before queue.get_nowait() in try_collect(). Stage-0 pickles
  outputs containing custom classes from transformers_modules (trust_remote_code),
  but the API server process doesn't have this path, causing deserialization
  failures that silently drop Stage-0 outputs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…quests

- hcx_omni.yaml: revert to NCZSCosybigvganDecoder.mar (zero-shot ECAPA-TDNN)
  for voice-preserving S2S synthesis. NCCosybigvganDecoder used a fixed
  integer speaker_id and lost the input speaker's voice.

- pipeline_hyperclovax_audio.py: add zero-mel fallback branch for
  finetune=False + ref_audio=None case. When a text-only request arrives
  (no input audio → no ref_audio), ECAPA-TDNN receives a zero mel tensor
  [1, num_mels, 64] instead of crashing with 'expected 100 channels'.
  S2S requests always have ref_audio so the zero-shot cloning path is
  unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
Signed-off-by: Hyunjoon Jeong <hyunjoon.jeong@navercorp.com>
…V E2E) (#1)

* feat: add HyperCLOVAX-SEED-Omni-8B vision pipeline, thinker, and stage config

- diffusion/models/hyperclovax_vision/: HyperCLOVAX vision diffusion pipeline
  (transformer, layers, vision_token_embedder, pipeline)
- model_executor/models/hcx_omni/: HCX Omni thinker model
- model_executor/stage_configs/hcx_omni.yaml: 3-stage pipeline config
  (Stage-0 LLM thinker, Stage-1 vision decoder, Stage-2 audio decoder)
- model_executor/stage_input_processors/hyperclovax_seed_omni.py:
  thinker→vision/audio token routing
- engine/, entrypoints/: arg_utils, input_processor, omni_llm, zmq_utils,
  stage_utils, cli/main integration
- examples/online_serving/hcx_omni/: client demo and run script
- tests/: e2e and unit tests for HCX Omni

Co-Authored-By: Hyunjoon Jeong <with1015@unist.ac.kr>

* fix: async fan-out topology, serving pipeline, and vLLM 0.18.0 compat

- async_omni.py: redesign _process_sequential_results for fan-out topology
  — Stage-0 forwards to Stage-1 (vision) AND Stage-2 (audio) independently
  based on engine_input_source; add skipped_stages for conditional routing
- serving_chat.py: add _stage0_is_llm guard so GLM-Image bare-text
  replacement does not clobber HCX Omni Stage-0 multimodal inputs;
  handle audio output in _create_chat_completion_response
- async_omni_diffusion.py, omni_stage.py: vLLM 0.18.0 API alignment
- worker/gpu_ar_model_runner.py, async_omni_llm.py: compatibility fixes

Co-Authored-By: 길재은 <jaeeun.kil@navercorp.com>
Co-Authored-By: Hyunjoon Jeong <with1015@unist.ac.kr>

* fix: diffusion IPC, audio/vision decoder E2E pipeline fixes

- diffusion/ipc.py, diffusion_engine.py, diffusion_worker.py:
  IPC stability and worker lifecycle fixes for HCX audio+vision stages
- diffusion/models/hyperclovax_audio/pipeline_hyperclovax_audio.py:
  finetuned audio decoder path, transformers_modules deserialization,
  zero-shot speaker embedding fallback
- diffusion/registry.py, request.py: HCX Omni diffusion model registration
  and request type handling

Validated E2E with HyperCLOVAX-SEED-Omni-8B:
  Speech-to-Speech → 11.84s / 568KB WAV (BigVGAN, 24kHz)
  Text-to-Vision   → 768×768 PNG (diffusion, 50 steps)

Co-Authored-By: 길재은 <jaeeun.kil@navercorp.com>
Co-Authored-By: Hyunjoon Jeong <with1015@unist.ac.kr>

* fix: vLLM 0.18.0 compatibility for unit tests and config imports

- tests/unit/conftest.py: stub vllm_omni heavy init so unit tests can
  import stage_input_processors without a full vLLM installation
- vllm_omni/config/model.py: guard _RUNNER_TASKS / TaskOption imports
  with try/except fallback for vLLM 0.18.0 where these were removed

Co-Authored-By: 길재은 <jaeeun.kil@navercorp.com>
Co-Authored-By: Hyunjoon Jeong <with1015@unist.ac.kr>

---------

Co-authored-by: 길재은 <jaeeun.kil@navercorp.com>
Co-authored-by: Hyunjoon Jeong <with1015@unist.ac.kr>
Co-authored-by: kje <kje@navercorp.com>
Signed-off-by: Hyunjoon Jeong <with1015@unist.ac.kr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants