Skip to content

[BugFix] Standardize StableAudio audio output#842

Merged
hsliuustc0106 merged 18 commits intovllm-project:mainfrom
LudovicoYIN:fix-text-to-audio-output-and-docs
Jan 24, 2026
Merged

[BugFix] Standardize StableAudio audio output#842
hsliuustc0106 merged 18 commits intovllm-project:mainfrom
LudovicoYIN:fix-text-to-audio-output-and-docs

Conversation

@LudovicoYIN
Copy link
Contributor

@LudovicoYIN LudovicoYIN commented Jan 19, 2026

Purpose

Standardize StableAudio diffusion output to request_output[0].multimodal_output["audio"] and update the text_to_audio example accordingly. Resolves #829.

Test Plan

  • python examples/offline_inference/text_to_audio/text_to_audio.py --prompt "A piano playing a gentle melody" --audio_length 10.0

Test Result

  • script completes and writes WAV output without AttributeError.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 14c9610c5b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +132 to +135
if candidate.request_output:
if isinstance(candidate.request_output, list) and candidate.request_output:
candidate = candidate.request_output[0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle RequestOutput multimodal audio directly

When omni.generate() returns an OmniRequestOutput from an LLM pipeline, request_output is a list of RequestOutput objects and the audio tensor lives on request_output[0].multimodal_output["audio"] (see tests/e2e/offline_inference/test_qwen2_5_omni.py:89-93). Here you overwrite candidate with request_output[0], but then only try candidate.request_output.multimodal_output, which a RequestOutput does not have, so the function raises ValueError even though audio exists. This breaks the example if a user passes a Qwen* Omni audio model via --model; consider checking candidate.multimodal_output (or avoiding the reassignment) so pipeline audio outputs are extracted correctly.

Useful? React with 👍 / 👎.

@LudovicoYIN LudovicoYIN force-pushed the fix-text-to-audio-output-and-docs branch from 14c9610 to 934d3a2 Compare January 19, 2026 08:03
Signed-off-by: LudovicoYIN <hankeyin@gmail.com>
@LudovicoYIN LudovicoYIN force-pushed the fix-text-to-audio-output-and-docs branch 2 times, most recently from 59bf1f3 to c50c875 Compare January 19, 2026 08:19
Signed-off-by: LudovicoYIN <hankeyin@gmail.com>
@LudovicoYIN LudovicoYIN force-pushed the fix-text-to-audio-output-and-docs branch from c50c875 to ffc1537 Compare January 19, 2026 08:29
if hasattr(request_output, "multimodal_output"):
multimodal_output = request_output.multimodal_output or {}
audio = multimodal_output.get("audio")
elif hasattr(request_output, "images") and request_output.images:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file focuses on text to audio. No need to consider image output here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks your review. But for StableAudio diffusion, the audio is returned via OmniRequestOutput.images, so without this branch the example fails and raises “No audio output found”. I verified the audio is extracted from request_output[0].images[0] in this path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve seen Qwen Omni audio under request_output[0].multimodal_output["audio"], while StableAudio diffusion returns it in request_output[0].images[0]—do we want to standardize on request_output[0].multimodal_output["audio"]?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve seen Qwen Omni audio under request_output[0].multimodal_output["audio"], while StableAudio diffusion returns it in request_output[0].images[0]—do we want to standardize on request_output[0].multimodal_output["audio"]?

Yes. It would be great if you want to fix it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming, I want to fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve seen Qwen Omni audio under request_output[0].multimodal_output["audio"], while StableAudio diffusion returns it in request_output[0].images[0]—do we want to standardize on request_output[0].multimodal_output["audio"]?

Yes. It would be great if you want to fix it

I’ve updated StableAudio diffusion outputs to use request_output[0].multimodal_output["audio"] and simplified the text_to_audio example accordingly. Please take another look.

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>
@LudovicoYIN LudovicoYIN changed the title [BugFix] Fix text_to_audio output handling and gated model note [BugFix] Standardize StableAudio audio output Jan 19, 2026
@hsliuustc0106
Copy link
Collaborator

@linyueqian PTAL

@linyueqian
Copy link
Collaborator

@linyueqian PTAL

@hsliuustc0106 Should we move the output modality detection to the registry rather than hardcoding model names in the engine?

@LudovicoYIN
Copy link
Contributor Author

@linyueqian PTAL

@hsliuustc0106 Should we move the output modality detection to the registry rather than hardcoding model names in the engine?

Maybe we can keep a small mapping in the diffusion registry and a helper, e.g.:

_DIFFUSION_OUTPUT_TYPES = {
    "StableAudioPipeline": "audio",
    # default for others: "image"
}

def get_diffusion_output_type(model_class_name: str) -> str:
    return _DIFFUSION_OUTPUT_TYPES.get(model_class_name, "image")

Then the engine can call this helper instead of hardcoding model names.

@ZJY0516
Copy link
Collaborator

ZJY0516 commented Jan 20, 2026

FYI, we have a way to mark or detect if a model accepts image inputs

class SupportImageInput(Protocol):
support_image_input: ClassVar[bool] = True

class QwenImageEditPlusPipeline(nn.Module, SupportImageInput):

@LudovicoYIN
Copy link
Contributor Author

I’d like to propose an approach to avoid hard-coding model names for output handling.
The idea is to let model classes self-describe their output modality via a lightweight output_type attribute, using a small protocol for clarity.

Concretely:

  • interface.py: add

    @runtime_checkable
    class SupportOutputType(Protocol):
        output_type: ClassVar[str]
  • StableAudioPipeline: declare

    class StableAudioPipeline(nn.Module, SupportOutputType):
        output_type: ClassVar[str] = "audio"
  • diffusion_engine.py: replace model-name-based branching with

    model_cls = DiffusionModelRegistry._try_load_model_cls(self.od_config.model_class_name)
    output_type = getattr(model_cls, "output_type", "image")

    and branch on output_type == "audio" to consistently write audio outputs to multimodal_output["audio"].

I’m happy to implement this if the direction makes sense.

@hsliuustc0106
Copy link
Collaborator

I’d like to propose an approach to avoid hard-coding model names for output handling. The idea is to let model classes self-describe their output modality via a lightweight output_type attribute, using a small protocol for clarity.

Concretely:

  • interface.py: add

    @runtime_checkable
    class SupportOutputType(Protocol):
        output_type: ClassVar[str]
  • StableAudioPipeline: declare

    class StableAudioPipeline(nn.Module, SupportOutputType):
        output_type: ClassVar[str] = "audio"
  • diffusion_engine.py: replace model-name-based branching with

    model_cls = DiffusionModelRegistry._try_load_model_cls(self.od_config.model_class_name)
    output_type = getattr(model_cls, "output_type", "image")

    and branch on output_type == "audio" to consistently write audio outputs to multimodal_output["audio"].

I’m happy to implement this if the direction makes sense.

@ZJY0516 WDYT? this proposal LTGM

@ZJY0516
Copy link
Collaborator

ZJY0516 commented Jan 20, 2026

@LudovicoYIN LGTM. But qwen3-omni also supports audio output. We'd better align behavior during output

@LudovicoYIN
Copy link
Contributor Author

@LudovicoYIN LGTM. But qwen3-omni also supports audio output. We'd better align behavior during output

@ZJY0516 Thanks for the review! I’ve verified that Qwen3‑Omni already outputs audio via request_output[0].multimodal_output["audio"]. This PR aligns StableAudio diffusion to the same output path so the behavior is consistent. Please let me know if you’d like any adjustments.

Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Jan 21, 2026
@LudovicoYIN
Copy link
Contributor Author

I noticed that the outer OmniRequestOutput.final_output_type is still driven by the diffusion stage config (which defaults to image), so it doesn’t reflect audio even after we standardize StableAudio to multimodal_output["audio"]. The inner diffusion output already has final_output_type="audio".

models = ["linyueqian/stable_audio_random"]

@pytest.mark.parametrize("model_name", models)
def test_stable_audio_model(model_name: str):
    m = Omni(model=model_name)

    # Use minimal settings for testing
    # Generate a short 2-second audio clip with minimal inference steps
    audio_start_in_s = 0.0
    audio_end_in_s = 2.0  # Short duration for fast testing
    sample_rate = 44100  # Stable Audio uses 44100 Hz

    outputs = m.generate(
        "The sound of a dog barking",
        negative_prompt="Low quality.",
        num_inference_steps=4,  # Minimal steps for speed
        guidance_scale=7.0,
        generator=torch.Generator("cuda").manual_seed(42),
        num_outputs_per_prompt=1,
        extra={
            "audio_start_in_s": audio_start_in_s,
            "audio_end_in_s": audio_end_in_s,
        },
    )

    # Extract audio from OmniRequestOutput
    assert outputs is not None
    first_output = outputs[0]
    assert first_output.final_output_type == "image"

Do we want to follow up with a change to the diffusion stage config so that the outer final_output_type also reflects audio, or is it intentional to leave this as-is for now?

Signed-off-by: LudovicoYIN <hankeyin@gmail.com>
Signed-off-by: LudovicoYIN <hankeyin@gmail.com>
@LudovicoYIN
Copy link
Contributor Author

@ZJY0516
I’ve addressed the requested change by switching to SupportAudioOutput and removing output_type.
Could you please re-review and let me know if it looks good now?

prefix: Weight prefix for loading (default: "")
"""

support_audio_output: bool = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps no need to set this True here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I’ve removed the explicit support_audio_output = True here

if not isinstance(outputs, list):
outputs = [outputs] if outputs is not None else []

model_cls = DiffusionModelRegistry._try_load_model_cls(self.od_config.model_class_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please define a util function for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I’ve added a small util function like def supports_image_input.

@ZJY0516
Copy link
Collaborator

ZJY0516 commented Jan 21, 2026

I noticed that the outer OmniRequestOutput.final_output_type is still driven by the diffusion stage config (which defaults to image), so it doesn’t reflect audio even after we standardize StableAudio to multimodal_output["audio"]. The inner diffusion output already has final_output_type="audio".

models = ["linyueqian/stable_audio_random"]

@pytest.mark.parametrize("model_name", models)
def test_stable_audio_model(model_name: str):
    m = Omni(model=model_name)

    # Use minimal settings for testing
    # Generate a short 2-second audio clip with minimal inference steps
    audio_start_in_s = 0.0
    audio_end_in_s = 2.0  # Short duration for fast testing
    sample_rate = 44100  # Stable Audio uses 44100 Hz

    outputs = m.generate(
        "The sound of a dog barking",
        negative_prompt="Low quality.",
        num_inference_steps=4,  # Minimal steps for speed
        guidance_scale=7.0,
        generator=torch.Generator("cuda").manual_seed(42),
        num_outputs_per_prompt=1,
        extra={
            "audio_start_in_s": audio_start_in_s,
            "audio_end_in_s": audio_end_in_s,
        },
    )

    # Extract audio from OmniRequestOutput
    assert outputs is not None
    first_output = outputs[0]
    assert first_output.final_output_type == "image"

Do we want to follow up with a change to the diffusion stage config so that the outer final_output_type also reflects audio, or is it intentional to leave this as-is for now?

Add a TODO will be fine. We'll refactor that recently

@hsliuustc0106
Copy link
Collaborator

how about the video support?

@david6666666
Copy link
Collaborator

Is this pr ready?

@ZJY0516
Copy link
Collaborator

ZJY0516 commented Jan 22, 2026

@david6666666 @LudovicoYIN Let's wait for #797

@david6666666 david6666666 added this to the v0.14.0rc1 milestone Jan 22, 2026
@david6666666 david6666666 modified the milestones: v0.14.0rc1, v0.14.0 Jan 23, 2026
@@ -44,15 +44,14 @@ def test_stable_audio_model(model_name: str):
# Extract audio from OmniRequestOutput
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yenuo26 PTAL

@hsliuustc0106 hsliuustc0106 merged commit f9c69a8 into vllm-project:main Jan 24, 2026
7 checks passed
majiayu000 pushed a commit to majiayu000/vllm-omni that referenced this pull request Jan 26, 2026
Signed-off-by: LudovicoYIN <hankeyin@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: majiayu000 <1835304752@qq.com>
@LudovicoYIN LudovicoYIN deleted the fix-text-to-audio-output-and-docs branch February 2, 2026 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: AttributeError in text_to_audio.py & Request for User Guide

5 participants