[BugFix] Standardize StableAudio audio output by LudovicoYIN · Pull Request #842 · vllm-project/vllm-omni

LudovicoYIN · 2026-01-19T07:51:51Z

Purpose

Standardize StableAudio diffusion output to request_output[0].multimodal_output["audio"] and update the text_to_audio example accordingly. Resolves #829.

Test Plan

python examples/offline_inference/text_to_audio/text_to_audio.py --prompt "A piano playing a gentle melody" --audio_length 10.0

Test Result

script completes and writes WAV output without AttributeError.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 14c9610c5b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-19T07:56:12Z

examples/offline_inference/text_to_audio/text_to_audio.py

+        if candidate.request_output:
+            if isinstance(candidate.request_output, list) and candidate.request_output:
+                candidate = candidate.request_output[0]
+


Handle RequestOutput multimodal audio directly

When omni.generate() returns an OmniRequestOutput from an LLM pipeline, request_output is a list of RequestOutput objects and the audio tensor lives on request_output[0].multimodal_output["audio"] (see tests/e2e/offline_inference/test_qwen2_5_omni.py:89-93). Here you overwrite candidate with request_output[0], but then only try candidate.request_output.multimodal_output, which a RequestOutput does not have, so the function raises ValueError even though audio exists. This breaks the example if a user passes a Qwen* Omni audio model via --model; consider checking candidate.multimodal_output (or avoiding the reassignment) so pipeline audio outputs are extracted correctly.

Useful? React with 👍 / 👎.

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

ZJY0516 · 2026-01-19T08:32:39Z

examples/offline_inference/text_to_audio/text_to_audio.py

+        if hasattr(request_output, "multimodal_output"):
+            multimodal_output = request_output.multimodal_output or {}
+            audio = multimodal_output.get("audio")
+        elif hasattr(request_output, "images") and request_output.images:


This file focuses on text to audio. No need to consider image output here.

Thanks your review. But for StableAudio diffusion, the audio is returned via OmniRequestOutput.images, so without this branch the example fails and raises “No audio output found”. I verified the audio is extracted from request_output[0].images[0] in this path.

I’ve seen Qwen Omni audio under request_output[0].multimodal_output["audio"], while StableAudio diffusion returns it in request_output[0].images[0]—do we want to standardize on request_output[0].multimodal_output["audio"]?

I’ve seen Qwen Omni audio under request_output[0].multimodal_output["audio"], while StableAudio diffusion returns it in request_output[0].images[0]—do we want to standardize on request_output[0].multimodal_output["audio"]?

Yes. It would be great if you want to fix it

Thanks for confirming, I want to fix it.

I’ve seen Qwen Omni audio under request_output[0].multimodal_output["audio"], while StableAudio diffusion returns it in request_output[0].images[0]—do we want to standardize on request_output[0].multimodal_output["audio"]?

Yes. It would be great if you want to fix it

I’ve updated StableAudio diffusion outputs to use request_output[0].multimodal_output["audio"] and simplified the text_to_audio example accordingly. Please take another look.

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

hsliuustc0106 · 2026-01-19T10:54:51Z

@linyueqian PTAL

linyueqian · 2026-01-19T15:36:20Z

@linyueqian PTAL

@hsliuustc0106 Should we move the output modality detection to the registry rather than hardcoding model names in the engine?

LudovicoYIN · 2026-01-20T01:31:50Z

@linyueqian PTAL

@hsliuustc0106 Should we move the output modality detection to the registry rather than hardcoding model names in the engine?

Maybe we can keep a small mapping in the diffusion registry and a helper, e.g.:

_DIFFUSION_OUTPUT_TYPES = {
    "StableAudioPipeline": "audio",
    # default for others: "image"
}

def get_diffusion_output_type(model_class_name: str) -> str:
    return _DIFFUSION_OUTPUT_TYPES.get(model_class_name, "image")

Then the engine can call this helper instead of hardcoding model names.

ZJY0516 · 2026-01-20T03:01:21Z

FYI, we have a way to mark or detect if a model accepts image inputs

vllm-omni/vllm_omni/diffusion/models/interface.py

Lines 11 to 12 in 3fc4f98

    
           class SupportImageInput(Protocol): 
        
               support_image_input: ClassVar[bool] = True

vllm-omni/vllm_omni/diffusion/models/qwen_image/pipeline_qwen_image_edit_plus.py

Line 161 in 3fc4f98

class QwenImageEditPlusPipeline(nn.Module, SupportImageInput):

LudovicoYIN · 2026-01-20T03:46:44Z

I’d like to propose an approach to avoid hard-coding model names for output handling.
The idea is to let model classes self-describe their output modality via a lightweight output_type attribute, using a small protocol for clarity.

Concretely:

interface.py: add

@runtime_checkable
class SupportOutputType(Protocol):
    output_type: ClassVar[str]

StableAudioPipeline: declare

class StableAudioPipeline(nn.Module, SupportOutputType):
    output_type: ClassVar[str] = "audio"

diffusion_engine.py: replace model-name-based branching with
```
model_cls = DiffusionModelRegistry._try_load_model_cls(self.od_config.model_class_name)
output_type = getattr(model_cls, "output_type", "image")
```
and branch on output_type == "audio" to consistently write audio outputs to multimodal_output["audio"].

I’m happy to implement this if the direction makes sense.

hsliuustc0106 · 2026-01-20T04:09:59Z

I’d like to propose an approach to avoid hard-coding model names for output handling. The idea is to let model classes self-describe their output modality via a lightweight output_type attribute, using a small protocol for clarity.

Concretely:
interface.py: add
@runtime_checkable
class SupportOutputType(Protocol):
    output_type: ClassVar[str]
StableAudioPipeline: declare
class StableAudioPipeline(nn.Module, SupportOutputType):
    output_type: ClassVar[str] = "audio"
diffusion_engine.py: replace model-name-based branching with
model_cls = DiffusionModelRegistry._try_load_model_cls(self.od_config.model_class_name)
output_type = getattr(model_cls, "output_type", "image")
and branch on output_type == "audio" to consistently write audio outputs to multimodal_output["audio"].
I’m happy to implement this if the direction makes sense.

@ZJY0516 WDYT? this proposal LTGM

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

ZJY0516 · 2026-01-20T12:26:21Z

@LudovicoYIN LGTM. But qwen3-omni also supports audio output. We'd better align behavior during output

LudovicoYIN · 2026-01-20T12:58:47Z

@LudovicoYIN LGTM. But qwen3-omni also supports audio output. We'd better align behavior during output

@ZJY0516 Thanks for the review! I’ve verified that Qwen3‑Omni already outputs audio via request_output[0].multimodal_output["audio"]. This PR aligns StableAudio diffusion to the same output path so the behavior is consistent. Please let me know if you’d like any adjustments.

vllm_omni/diffusion/diffusion_engine.py

vllm_omni/diffusion/models/interface.py

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

hsliuustc0106

lgtm

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

LudovicoYIN · 2026-01-21T08:23:15Z

I noticed that the outer OmniRequestOutput.final_output_type is still driven by the diffusion stage config (which defaults to image), so it doesn’t reflect audio even after we standardize StableAudio to multimodal_output["audio"]. The inner diffusion output already has final_output_type="audio".

models = ["linyueqian/stable_audio_random"]

@pytest.mark.parametrize("model_name", models)
def test_stable_audio_model(model_name: str):
    m = Omni(model=model_name)

    # Use minimal settings for testing
    # Generate a short 2-second audio clip with minimal inference steps
    audio_start_in_s = 0.0
    audio_end_in_s = 2.0  # Short duration for fast testing
    sample_rate = 44100  # Stable Audio uses 44100 Hz

    outputs = m.generate(
        "The sound of a dog barking",
        negative_prompt="Low quality.",
        num_inference_steps=4,  # Minimal steps for speed
        guidance_scale=7.0,
        generator=torch.Generator("cuda").manual_seed(42),
        num_outputs_per_prompt=1,
        extra={
            "audio_start_in_s": audio_start_in_s,
            "audio_end_in_s": audio_end_in_s,
        },
    )

    # Extract audio from OmniRequestOutput
    assert outputs is not None
    first_output = outputs[0]
    assert first_output.final_output_type == "image"

Do we want to follow up with a change to the diffusion stage config so that the outer final_output_type also reflects audio, or is it intentional to leave this as-is for now?

vllm_omni/diffusion/models/interface.py

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

LudovicoYIN · 2026-01-21T14:50:40Z

@ZJY0516
I’ve addressed the requested change by switching to SupportAudioOutput and removing output_type.
Could you please re-review and let me know if it looks good now?

ZJY0516 · 2026-01-21T15:07:31Z

vllm_omni/diffusion/models/stable_audio/pipeline_stable_audio.py

        prefix: Weight prefix for loading (default: "")
    """

+    support_audio_output: bool = True


perhaps no need to set this True here.

Thanks! I’ve removed the explicit support_audio_output = True here

ZJY0516 · 2026-01-21T15:08:44Z

vllm_omni/diffusion/diffusion_engine.py

+            if not isinstance(outputs, list):
+                outputs = [outputs] if outputs is not None else []
+
+            model_cls = DiffusionModelRegistry._try_load_model_cls(self.od_config.model_class_name)


Could you please define a util function for this?

Done. I’ve added a small util function like def supports_image_input.

ZJY0516 · 2026-01-21T15:11:28Z

I noticed that the outer OmniRequestOutput.final_output_type is still driven by the diffusion stage config (which defaults to image), so it doesn’t reflect audio even after we standardize StableAudio to multimodal_output["audio"]. The inner diffusion output already has final_output_type="audio".

models = ["linyueqian/stable_audio_random"]

@pytest.mark.parametrize("model_name", models)
def test_stable_audio_model(model_name: str):
    m = Omni(model=model_name)

    # Use minimal settings for testing
    # Generate a short 2-second audio clip with minimal inference steps
    audio_start_in_s = 0.0
    audio_end_in_s = 2.0  # Short duration for fast testing
    sample_rate = 44100  # Stable Audio uses 44100 Hz

    outputs = m.generate(
        "The sound of a dog barking",
        negative_prompt="Low quality.",
        num_inference_steps=4,  # Minimal steps for speed
        guidance_scale=7.0,
        generator=torch.Generator("cuda").manual_seed(42),
        num_outputs_per_prompt=1,
        extra={
            "audio_start_in_s": audio_start_in_s,
            "audio_end_in_s": audio_end_in_s,
        },
    )

    # Extract audio from OmniRequestOutput
    assert outputs is not None
    first_output = outputs[0]
    assert first_output.final_output_type == "image"

Do we want to follow up with a change to the diffusion stage config so that the outer final_output_type also reflects audio, or is it intentional to leave this as-is for now?

Add a TODO will be fine. We'll refactor that recently

hsliuustc0106 · 2026-01-21T22:45:29Z

how about the video support?

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

david6666666 · 2026-01-22T07:09:10Z

Is this pr ready?

ZJY0516 · 2026-01-22T07:24:45Z

@david6666666 @LudovicoYIN Let's wait for #797

hsliuustc0106 · 2026-01-23T05:49:23Z

tests/e2e/offline_inference/test_stable_audio_model.py

@@ -44,15 +44,14 @@ def test_stable_audio_model(model_name: str):
    # Extract audio from OmniRequestOutput


@yenuo26 PTAL

Signed-off-by: LudovicoYIN <hankeyin@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: majiayu000 <1835304752@qq.com>

LudovicoYIN requested a review from hsliuustc0106 as a code owner January 19, 2026 07:51

chatgpt-codex-connector bot reviewed Jan 19, 2026

View reviewed changes

LudovicoYIN force-pushed the fix-text-to-audio-output-and-docs branch from 14c9610 to 934d3a2 Compare January 19, 2026 08:03

[BugFix] Fix text_to_audio output handling and gated model note

1bdef5f

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

LudovicoYIN force-pushed the fix-text-to-audio-output-and-docs branch 2 times, most recently from 59bf1f3 to c50c875 Compare January 19, 2026 08:19

[BugFix] Simplify text_to_audio output handling

ffc1537

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

LudovicoYIN force-pushed the fix-text-to-audio-output-and-docs branch from c50c875 to ffc1537 Compare January 19, 2026 08:29

ZJY0516 reviewed Jan 19, 2026

View reviewed changes

[BugFix] Standardize StableAudio audio output and update exampleâ�

b422593

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

LudovicoYIN changed the title ~~[BugFix] Fix text_to_audio output handling and gated model note~~ [BugFix] Standardize StableAudio audio output Jan 19, 2026

LudovicoYIN and others added 2 commits January 20, 2026 16:41

Merge branch 'main' into fix-text-to-audio-output-and-docs

a586389

Add SupportOutputType for StableAudio output

c93c4b4

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

LudovicoYIN added 2 commits January 20, 2026 21:07

Merge branch 'main' into fix-text-to-audio-output-and-docs

d854f3c

Merge branch 'main' into fix-text-to-audio-output-and-docs

18da606

hsliuustc0106 reviewed Jan 21, 2026

View reviewed changes

vllm_omni/diffusion/diffusion_engine.py Outdated Show resolved Hide resolved

vllm_omni/diffusion/models/interface.py Outdated Show resolved Hide resolved

LudovicoYIN and others added 2 commits January 21, 2026 13:24

Merge branch 'vllm-project:main' into fix-text-to-audio-output-and-docs

3bba3dd

Add output_type protocol and rename SupportImageInput

808203b

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

hsliuustc0106 approved these changes Jan 21, 2026

View reviewed changes

hsliuustc0106 added the ready label to trigger buildkite CI label Jan 21, 2026

LudovicoYIN added 2 commits January 21, 2026 15:46

Merge branch 'main' into fix-text-to-audio-output-and-docs

73e33c8

Update stable audio test for multimodal_output

5be10c9

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

ZJY0516 requested changes Jan 21, 2026

View reviewed changes

vllm_omni/diffusion/models/interface.py Outdated Show resolved Hide resolved

LudovicoYIN added 3 commits January 21, 2026 09:29

Add support_audio_output flag and rename SupportImageInput

58e43a1

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

Merge branch 'main' into fix-text-to-audio-output-and-docs

c433b22

Fix lint

1149d50

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

ZJY0516 reviewed Jan 21, 2026

View reviewed changes

LudovicoYIN added 2 commits January 22, 2026 10:35

Merge branch 'main' into fix-text-to-audio-output-and-docs

8fa6b5f

Replace output_type with support_audio_output helper

60d96df

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>

ZJY0516 approved these changes Jan 22, 2026

View reviewed changes

david6666666 added this to the v0.14.0rc1 milestone Jan 22, 2026

Merge branch 'main' into fix-text-to-audio-output-and-docs

ad01f2b

david6666666 modified the milestones: v0.14.0rc1, v0.14.0 Jan 23, 2026

Merge branch 'main' into fix-text-to-audio-output-and-docs

8c6d35d

hsliuustc0106 reviewed Jan 23, 2026

View reviewed changes

hsliuustc0106 merged commit f9c69a8 into vllm-project:main Jan 24, 2026
7 checks passed

LudovicoYIN deleted the fix-text-to-audio-output-and-docs branch February 2, 2026 03:09

		@@ -44,15 +44,14 @@ def test_stable_audio_model(model_name: str):
		# Extract audio from OmniRequestOutput

Conversation

LudovicoYIN commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Jan 19, 2026

Uh oh!

linyueqian commented Jan 19, 2026

Uh oh!

LudovicoYIN commented Jan 20, 2026

Uh oh!

ZJY0516 commented Jan 20, 2026

Uh oh!

LudovicoYIN commented Jan 20, 2026

Uh oh!

hsliuustc0106 commented Jan 20, 2026

Uh oh!

ZJY0516 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LudovicoYIN commented Jan 20, 2026

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

LudovicoYIN commented Jan 21, 2026

Uh oh!

Uh oh!

LudovicoYIN commented Jan 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZJY0516 commented Jan 21, 2026

Uh oh!

hsliuustc0106 commented Jan 21, 2026

Uh oh!

david6666666 commented Jan 22, 2026

Uh oh!

ZJY0516 commented Jan 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

LudovicoYIN commented Jan 19, 2026 •

edited

Loading

ZJY0516 commented Jan 20, 2026 •

edited

Loading