Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint#1255
Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint#1255ekagra-ranjan wants to merge 26 commits intovllm-project:mainfrom
v1/audio/generate endpoint#1255Conversation
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
… er-stable-audio-online
Signed-off-by: Ekagra Ranjan <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ea38d5aeb8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
I just saw this comment. Lemme know which examples should I delete in this PR. |
|
i would not call it tts since it is an audio generation model, not capable of genrating speech. |
|
Tested locally with a local checkpoint path. The Stable Audio specific params (
Should detect model type from config/architecture instead of a name substring. Also Separately, since Stable Audio is an audio generation model (not speech/TTS), should we serve it under a different endpoint like |
| elif self._is_stable_audio_model(): | ||
| # Handle Stable Audio models | ||
| # Stable Audio uses diffusion, needs different parameters | ||
| default_sr = 44100 # Default sample rate for Stable Audio |
There was a problem hiding this comment.
I am not 100% sure how to merge this block with is_tts_model(). As of now that is_tts_model() block is very qwen3 specific with its prompt template and "additional_information" so I think there would be some model specific if-else but I dont know there is an existing standardization across the parameters and if that can be done later on when standardization happens on vllm-omni?
There was a problem hiding this comment.
I'd suggest moving the Stable Audio logic to a separate code path for now, like a diffusion-specific branch, rather than mixing it into the TTS flow. They're fundamentally different (autoregressive TTS vs diffusion gen) and trying to unify them now will be forced. We can revisit standardization later when we have a clearer picture.
There was a problem hiding this comment.
I didnt get this part. Are you suggesting that we keep the current PR code as is, i.e., not try to merge this block with is_tts_model()?
got it - makes sense what you are observing
|
…_type code across stage config loading. Avoid inplace change in default sampling arg Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
vllm_omni/entrypoints/omni.py
Outdated
| tokenizer = kwargs.get("tokenizer", None) | ||
|
|
||
| base_engine_args = {"tokenizer": tokenizer} if tokenizer is not None else None | ||
| self.model_type = resolve_model_type(model) |
There was a problem hiding this comment.
I added this to get an identifier which can be relied on when local path to the model is used. After adding this, I realised that resolve_model_config_path() and load_stage_configs_from_model() share some operations so I refactored them to reuse the intermediate variables.
@linyueqian - Pls lmk if there was a better way to use an existing identifier in case I missed it.
There was a problem hiding this comment.
For Qwen3-Omni and Qwen3-TTS, we get model type through engine_client.model_config.hf_config which vLLM populates from config.json at init time, so it works with local paths out of the box. Could you check if the same approach works here instead of adding a separate resolution step?
There was a problem hiding this comment.
I gave the engine_client.model_config route a shot but it doesnt work for StableAudio. I believe this is the reason, i.e., diffusion model may have model_config as None.
This is an interesting point. My understanding is that the StableAudio is similar to any other non-streaming TTS model where input is text and output is audio. It is correct that the audio is not speech but its still audio. If we go with The primary objective of this PR was to support pure diffusion text-to-audio models with speech endpoint but I am happy to introduce a new endpoint if you guys think that is the right approach. |
There was a problem hiding this comment.
Pull request overview
Adds OpenAI-compatible online serving support for diffusion-based text-to-audio models (specifically Stable Audio), extending the existing /v1/audio/speech endpoint beyond Qwen3-TTS. This includes request schema extensions, serving-path routing for Stable Audio parameters (including 44.1kHz defaults), and new end-to-end usage examples.
Changes:
- Extend
OpenAICreateSpeechRequestwith Stable Audio/diffusion-specific parameters (negative prompt, guidance scale, inference steps, seed, audio length/start). - Add Stable Audio handling in
OmniOpenAIServingSpeech.create_speech, plus diffusion-mode server initialization wiring. - Refactor stage-config discovery to resolve
model_typeseparately, and add Stable Audio online serving docs + client examples.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_omni/entrypoints/utils.py | Splits model type resolution from stage-config path resolution; adjusts stage config loading signature. |
| vllm_omni/entrypoints/openai/serving_speech.py | Adds Stable Audio diffusion parameter handling and default sample rate selection. |
| vllm_omni/entrypoints/openai/protocol/audio.py | Adds Stable Audio-specific request fields to the OpenAI speech request model. |
| vllm_omni/entrypoints/openai/audio_utils_mixin.py | Adds handling for stereo tensors shaped as [channels, samples]. |
| vllm_omni/entrypoints/openai/api_server.py | Adds diffusion-only openai_serving_models fallback and initializes speech serving for pure diffusion mode. |
| vllm_omni/entrypoints/omni_llm.py | Updates stage-config loading to use resolve_model_type + config path. |
| vllm_omni/entrypoints/omni.py | Updates stage initialization to use resolve_model_type + config path. |
| tests/entrypoints/test_omni_llm.py | Updates mocks to match the new load_stage_configs_from_model(config_path=...) signature. |
| tests/entrypoints/test_omni_diffusion.py | Updates mocks to match the new load_stage_configs_from_model(config_path=...) signature. |
| examples/online_serving/stable_audio/stable_audio_client.py | Adds a Python client example for /v1/audio/speech Stable Audio usage. |
| examples/online_serving/stable_audio/curl_examples.sh | Adds curl examples for Stable Audio online serving. |
| examples/online_serving/stable_audio/README.md | Adds Stable Audio online serving documentation and usage guide. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Agreed, I think |
Co-authored-by: Copilot <[email protected]> Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
|
@Gaohan123 @linyueqian - I've added a new endpoint I plan to add tests similar to |
Signed-off-by: Ekagra Ranjan <[email protected]>
|
Thanks for the updates @ekagra-ranjan, the separation looks much cleaner now. Before merging, could you add tests for Also the PR title still says "TTS", worth updating since we agreed Stable Audio is audio generation. And it would be good to document the default behavior when |
v1/audio/generate endpoint
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
| negative_prompt: str | list[str] | None = None, | ||
| audio_end_in_s: float | None = None, | ||
| audio_start_in_s: float = 0.0, | ||
| num_inference_steps: int = 100, |
Signed-off-by: Ekagra Ranjan <[email protected]>
|
@linyueqian - I've added the tests and added Audio Generate API doc similar to this and online serving doc similar to this. Pls have a look when you can! |
linyueqian
left a comment
There was a problem hiding this comment.
LGTM, tested locally with stable-audio-open-1.0 and the generated audio sounds reasonable.
|
@vllm-omni-reviewer |
Signed-off-by: Ekagra Ranjan <[email protected]>
Signed-off-by: Ekagra Ranjan <[email protected]>
|
I have updated the PR and resolved the merge conflict after recent changes #939. Pls have a look cc: @hsliuustc0106 |
Signed-off-by: Ekagra Ranjan <[email protected]>
|
@hsliuustc0106 @linyueqian - wondering if we can merge this PR? |
linyueqian
left a comment
There was a problem hiding this comment.
Reviewed the PR. Found a few issues inline -- please take a look before merging.
| Any attribute OpenAIServing tries to access but we don't explicitly define | ||
| will safely resolve to None. | ||
| """ | ||
| return self._Unsupported(name) |
There was a problem hiding this comment.
The docstring above says unknown attributes "will safely resolve to None", but this actually returns an _Unsupported object that is truthy and raises NotImplementedError on call or attribute access. If any caller checks if self.models.some_attr: it will pass, then explode when the object is used. If the intent is intentional fail-loudly behavior, the docstring should say so. If you want true safe null behavior, return None here instead.
There was a problem hiding this comment.
True, the comment was not updated. I'll update the comment.
vllm_omni/entrypoints/utils.py
Outdated
| model: Model name or path | ||
|
|
||
| Returns: | ||
| Model type string if found, None otherwise |
There was a problem hiding this comment.
The return annotation is -> str but the docstring says "None otherwise". This function never returns None -- it raises ValueError when the model type cannot be resolved. Please fix the docstring to say "Raises ValueError if model type cannot be determined" to avoid misleading callers.
vllm_omni/entrypoints/utils.py
Outdated
| @@ -308,7 +315,7 @@ def load_and_resolve_stage_configs( | |||
| """ | |||
| if stage_configs_path is None: | |||
| config_path = resolve_model_config_path(model) | |||
There was a problem hiding this comment.
This passes the raw HuggingFace model path (e.g. stabilityai/stable-audio-open-1.0) to resolve_model_config_path, but that function was refactored to expect a model_type string (e.g. StableAudioPipeline). The sibling change in omni_llm.py calls resolve_model_type(model) first and passes the result -- this call site in load_and_resolve_stage_configs was not updated the same way. As a result config_path will silently be None for any model specified by HF path, falling through to the default_stage_cfg_factory. For omni.py the factory saves you, but the returned config_path=None stored in self.config_path could cause silent failures downstream. Should be resolve_model_config_path(resolve_model_type(model)).
There was a problem hiding this comment.
you are right about it. Making the change
| request_id = f"audiogen-{random_uuid()}" | ||
|
|
||
| try: | ||
| sampling_params_list = self.engine_client.default_sampling_params_list |
There was a problem hiding this comment.
This assignment is dead code -- sampling_params_list is unconditionally overwritten on line 74 with a fresh [OmniDiffusionSamplingParams(...)]. Please remove this line.
Signed-off-by: Ekagra Ranjan <[email protected]>
… er-stable-audio-online
|
@linyueqian - thank you for the recent comments! These have been addressed now. |
|
LGTM! |
Purpose
/v1/audio/generateendpoint as per thisAs of now only Qwen3 TTS was supported on online serving.This PR adds support for pure diffusion model TTS like Stable Audio to online serving.
1. Added Stable Audio-specific parameters to
OpenAICreateAudioGenerateRequestextend protocol to stable audio specific params
file:
vllm_omni/entrypoints/openai/protocol/audio.py2. Serving Logic
file:
vllm_omni/entrypoints/openai/serving_audio_generate.pyrelevant logic to enable
/v1/audio/generatefile:
vllm_omni/entrypoints/openai/api_server.pyregister
/v1/audio/generate_DiffusionServingModelsnow mocks missing attributes likeinput_processor,model_config,rendererfromOpenAIServingModelswhich is needed sinceOmniOpenAIServingSpeechinheritsOpenAIServing. The mock allows not needing to update_DiffusionServingModelseverytime we upgrade vllm version which can add new attributes toOpenAIServingModels, e.g. renderer was added newly in vllm 0.153. Documentation and Examples
Created complete example suite:
examples/online_serving/stable_audio/README.md- Full documentationexamples/online_serving/stable_audio/curl_examples.sh- Shell script examplesexamples/online_serving/stable_audio/stable_audio_client.py- Python clientdocs/serving/audio_generate_api.md-v1/audio/generateAPI docdocs/user_guide/examples/online_serving/text_to_audio.md- Online serving user guide doc4. Add test
file:
tests/entrypoints/openai_api/test_serving_audio_generate.pyTest Plan
start the server and run curl_examples.sh
unittest:
pytest tests/entrypoints/openai_api/test_serving_audio_generate.pypassesTest Result
dog_5s.wav
ocean.wav
thunder_rain.wav
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)