Skip to content
Merged
Changes from 2 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
1bdef5f
[BugFix] Fix text_to_audio output handling and gated model note
LudovicoYIN Jan 19, 2026
ffc1537
[BugFix] Simplify text_to_audio output handling
LudovicoYIN Jan 19, 2026
b422593
[BugFix] Standardize StableAudio audio output and update exampleâ€
LudovicoYIN Jan 19, 2026
a586389
Merge branch 'main' into fix-text-to-audio-output-and-docs
LudovicoYIN Jan 20, 2026
c93c4b4
Add SupportOutputType for StableAudio output
LudovicoYIN Jan 20, 2026
d854f3c
Merge branch 'main' into fix-text-to-audio-output-and-docs
LudovicoYIN Jan 20, 2026
18da606
Merge branch 'main' into fix-text-to-audio-output-and-docs
LudovicoYIN Jan 21, 2026
3bba3dd
Merge branch 'vllm-project:main' into fix-text-to-audio-output-and-docs
LudovicoYIN Jan 21, 2026
808203b
Add output_type protocol and rename SupportImageInput
LudovicoYIN Jan 21, 2026
73e33c8
Merge branch 'main' into fix-text-to-audio-output-and-docs
LudovicoYIN Jan 21, 2026
5be10c9
Update stable audio test for multimodal_output
LudovicoYIN Jan 21, 2026
58e43a1
Add support_audio_output flag and rename SupportImageInput
LudovicoYIN Jan 21, 2026
c433b22
Merge branch 'main' into fix-text-to-audio-output-and-docs
LudovicoYIN Jan 21, 2026
1149d50
Fix lint
LudovicoYIN Jan 21, 2026
8fa6b5f
Merge branch 'main' into fix-text-to-audio-output-and-docs
LudovicoYIN Jan 22, 2026
60d96df
Replace output_type with support_audio_output helper
LudovicoYIN Jan 22, 2026
ad01f2b
Merge branch 'main' into fix-text-to-audio-output-and-docs
hsliuustc0106 Jan 22, 2026
8c6d35d
Merge branch 'main' into fix-text-to-audio-output-and-docs
hsliuustc0106 Jan 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion examples/offline_inference/text_to_audio/text_to_audio.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ def main():
generation_start = time.perf_counter()

# Generate audio
audio = omni.generate(
outputs = omni.generate(
args.prompt,
negative_prompt=args.negative_prompt,
generator=generator,
Expand All @@ -166,6 +166,24 @@ def main():
suffix = output_path.suffix or ".wav"
stem = output_path.stem or "stable_audio_output"

# Extract audio from omni.generate() outputs
if isinstance(outputs, (torch.Tensor, np.ndarray)):
audio = outputs
elif isinstance(outputs, list) and outputs:
output = outputs[0]
if not hasattr(output, "request_output") or not output.request_output:
raise ValueError("No request_output found in OmniRequestOutput")
request_output = output.request_output[0]
if hasattr(request_output, "multimodal_output"):
multimodal_output = request_output.multimodal_output or {}
audio = multimodal_output.get("audio")
elif hasattr(request_output, "images") and request_output.images:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file focuses on text to audio. No need to consider image output here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks your review. But for StableAudio diffusion, the audio is returned via OmniRequestOutput.images, so without this branch the example fails and raises “No audio output found”. I verified the audio is extracted from request_output[0].images[0] in this path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve seen Qwen Omni audio under request_output[0].multimodal_output["audio"], while StableAudio diffusion returns it in request_output[0].images[0]—do we want to standardize on request_output[0].multimodal_output["audio"]?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve seen Qwen Omni audio under request_output[0].multimodal_output["audio"], while StableAudio diffusion returns it in request_output[0].images[0]—do we want to standardize on request_output[0].multimodal_output["audio"]?

Yes. It would be great if you want to fix it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming, I want to fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve seen Qwen Omni audio under request_output[0].multimodal_output["audio"], while StableAudio diffusion returns it in request_output[0].images[0]—do we want to standardize on request_output[0].multimodal_output["audio"]?

Yes. It would be great if you want to fix it

I’ve updated StableAudio diffusion outputs to use request_output[0].multimodal_output["audio"] and simplified the text_to_audio example accordingly. Please take another look.

audio = request_output.images[0]
else:
raise ValueError("No audio output found in request_output")
else:
raise ValueError("No output generated from omni.generate()")

# Handle different output formats
if isinstance(audio, torch.Tensor):
audio = audio.cpu().float().numpy()
Expand Down