[BugFix] Standardize StableAudio audio output#842
[BugFix] Standardize StableAudio audio output#842hsliuustc0106 merged 18 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 14c9610c5b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if candidate.request_output: | ||
| if isinstance(candidate.request_output, list) and candidate.request_output: | ||
| candidate = candidate.request_output[0] | ||
|
|
There was a problem hiding this comment.
Handle RequestOutput multimodal audio directly
When omni.generate() returns an OmniRequestOutput from an LLM pipeline, request_output is a list of RequestOutput objects and the audio tensor lives on request_output[0].multimodal_output["audio"] (see tests/e2e/offline_inference/test_qwen2_5_omni.py:89-93). Here you overwrite candidate with request_output[0], but then only try candidate.request_output.multimodal_output, which a RequestOutput does not have, so the function raises ValueError even though audio exists. This breaks the example if a user passes a Qwen* Omni audio model via --model; consider checking candidate.multimodal_output (or avoiding the reassignment) so pipeline audio outputs are extracted correctly.
Useful? React with 👍 / 👎.
14c9610 to
934d3a2
Compare
Signed-off-by: LudovicoYIN <hankeyin@gmail.com>
59bf1f3 to
c50c875
Compare
Signed-off-by: LudovicoYIN <hankeyin@gmail.com>
c50c875 to
ffc1537
Compare
| if hasattr(request_output, "multimodal_output"): | ||
| multimodal_output = request_output.multimodal_output or {} | ||
| audio = multimodal_output.get("audio") | ||
| elif hasattr(request_output, "images") and request_output.images: |
There was a problem hiding this comment.
This file focuses on text to audio. No need to consider image output here.
There was a problem hiding this comment.
Thanks your review. But for StableAudio diffusion, the audio is returned via OmniRequestOutput.images, so without this branch the example fails and raises “No audio output found”. I verified the audio is extracted from request_output[0].images[0] in this path.
There was a problem hiding this comment.
I’ve seen Qwen Omni audio under request_output[0].multimodal_output["audio"], while StableAudio diffusion returns it in request_output[0].images[0]—do we want to standardize on request_output[0].multimodal_output["audio"]?
There was a problem hiding this comment.
I’ve seen Qwen Omni audio under request_output[0].multimodal_output["audio"], while StableAudio diffusion returns it in request_output[0].images[0]—do we want to standardize on request_output[0].multimodal_output["audio"]?
Yes. It would be great if you want to fix it
There was a problem hiding this comment.
Thanks for confirming, I want to fix it.
There was a problem hiding this comment.
I’ve seen Qwen Omni audio under request_output[0].multimodal_output["audio"], while StableAudio diffusion returns it in request_output[0].images[0]—do we want to standardize on request_output[0].multimodal_output["audio"]?
Yes. It would be great if you want to fix it
I’ve updated StableAudio diffusion outputs to use request_output[0].multimodal_output["audio"] and simplified the text_to_audio example accordingly. Please take another look.
Signed-off-by: LudovicoYIN <hankeyin@gmail.com>
|
@linyueqian PTAL |
@hsliuustc0106 Should we move the output modality detection to the registry rather than hardcoding model names in the engine? |
Maybe we can keep a small mapping in the diffusion registry and a helper, e.g.: _DIFFUSION_OUTPUT_TYPES = {
"StableAudioPipeline": "audio",
# default for others: "image"
}
def get_diffusion_output_type(model_class_name: str) -> str:
return _DIFFUSION_OUTPUT_TYPES.get(model_class_name, "image")Then the engine can call this helper instead of hardcoding model names. |
|
FYI, we have a way to mark or detect if a model accepts image inputs vllm-omni/vllm_omni/diffusion/models/interface.py Lines 11 to 12 in 3fc4f98 |
|
I’d like to propose an approach to avoid hard-coding model names for output handling. Concretely:
I’m happy to implement this if the direction makes sense. |
@ZJY0516 WDYT? this proposal LTGM |
Signed-off-by: LudovicoYIN <hankeyin@gmail.com>
|
@LudovicoYIN LGTM. But qwen3-omni also supports audio output. We'd better align behavior during output |
@ZJY0516 Thanks for the review! I’ve verified that Qwen3‑Omni already outputs audio via request_output[0].multimodal_output["audio"]. This PR aligns StableAudio diffusion to the same output path so the behavior is consistent. Please let me know if you’d like any adjustments. |
Signed-off-by: LudovicoYIN <hankeyin@gmail.com>
Signed-off-by: LudovicoYIN <hankeyin@gmail.com>
|
I noticed that the outer OmniRequestOutput.final_output_type is still driven by the diffusion stage config (which defaults to image), so it doesn’t reflect audio even after we standardize StableAudio to multimodal_output["audio"]. The inner diffusion output already has final_output_type="audio". models = ["linyueqian/stable_audio_random"]
@pytest.mark.parametrize("model_name", models)
def test_stable_audio_model(model_name: str):
m = Omni(model=model_name)
# Use minimal settings for testing
# Generate a short 2-second audio clip with minimal inference steps
audio_start_in_s = 0.0
audio_end_in_s = 2.0 # Short duration for fast testing
sample_rate = 44100 # Stable Audio uses 44100 Hz
outputs = m.generate(
"The sound of a dog barking",
negative_prompt="Low quality.",
num_inference_steps=4, # Minimal steps for speed
guidance_scale=7.0,
generator=torch.Generator("cuda").manual_seed(42),
num_outputs_per_prompt=1,
extra={
"audio_start_in_s": audio_start_in_s,
"audio_end_in_s": audio_end_in_s,
},
)
# Extract audio from OmniRequestOutput
assert outputs is not None
first_output = outputs[0]
assert first_output.final_output_type == "image"Do we want to follow up with a change to the diffusion stage config so that the outer final_output_type also reflects audio, or is it intentional to leave this as-is for now? |
Signed-off-by: LudovicoYIN <hankeyin@gmail.com>
|
@ZJY0516 |
| prefix: Weight prefix for loading (default: "") | ||
| """ | ||
|
|
||
| support_audio_output: bool = True |
There was a problem hiding this comment.
perhaps no need to set this True here.
There was a problem hiding this comment.
Thanks! I’ve removed the explicit support_audio_output = True here
| if not isinstance(outputs, list): | ||
| outputs = [outputs] if outputs is not None else [] | ||
|
|
||
| model_cls = DiffusionModelRegistry._try_load_model_cls(self.od_config.model_class_name) |
There was a problem hiding this comment.
Could you please define a util function for this?
There was a problem hiding this comment.
Done. I’ve added a small util function like def supports_image_input.
Add a TODO will be fine. We'll refactor that recently |
|
how about the video support? |
Signed-off-by: LudovicoYIN <hankeyin@gmail.com>
|
Is this pr ready? |
|
@david6666666 @LudovicoYIN Let's wait for #797 |
| @@ -44,15 +44,14 @@ def test_stable_audio_model(model_name: str): | |||
| # Extract audio from OmniRequestOutput | |||
Signed-off-by: LudovicoYIN <hankeyin@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: majiayu000 <1835304752@qq.com>
Purpose
Standardize StableAudio diffusion output to
request_output[0].multimodal_output["audio"]and update the text_to_audio example accordingly. Resolves #829.Test Plan
Test Result