Skip to content
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
6c6fb5e
Add online serving to Stable Audio Diffusion TTS
ekagra-ranjan Feb 6, 2026
cdec68a
make sr model specifc
ekagra-ranjan Feb 6, 2026
ea38d5a
Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…
ekagra-ranjan Feb 6, 2026
c3dad34
lint
ekagra-ranjan Feb 6, 2026
4b5d63e
save model_type in OmniBase so it can be referred easily. Reuse model…
ekagra-ranjan Feb 7, 2026
fa8fa80
update test
ekagra-ranjan Feb 7, 2026
8063b3c
Apply suggestion from @Copilot
ekagra-ranjan Feb 16, 2026
3722a9d
fix doc
ekagra-ranjan Feb 16, 2026
23235aa
fix import
ekagra-ranjan Feb 16, 2026
dfb4873
fix hint
ekagra-ranjan Feb 16, 2026
fce33d3
conflict
ekagra-ranjan Feb 16, 2026
b86dd42
fix comment
ekagra-ranjan Feb 16, 2026
e6c8cd4
fix
ekagra-ranjan Feb 16, 2026
90837b7
fix comment
ekagra-ranjan Feb 16, 2026
049ec17
add test
ekagra-ranjan Feb 17, 2026
071fb8e
add docs
ekagra-ranjan Feb 17, 2026
cf0a6c6
remove debug
ekagra-ranjan Feb 17, 2026
9b548e7
fix doc
ekagra-ranjan Feb 17, 2026
9e345ae
resolve conflict
ekagra-ranjan Feb 24, 2026
7f8380e
fix conflict
ekagra-ranjan Feb 24, 2026
88a1ad7
fix conflict
ekagra-ranjan Feb 26, 2026
296cba3
Merge branch 'main' into er-stable-audio-online
ekagra-ranjan Feb 28, 2026
6192a9d
Merge branch 'main' into er-stable-audio-online
ekagra-ranjan Mar 2, 2026
29e4fb2
resolve comments
ekagra-ranjan Mar 3, 2026
3b3e310
Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…
ekagra-ranjan Mar 3, 2026
400afa3
Merge branch 'main' into er-stable-audio-online
hsliuustc0106 Mar 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
338 changes: 338 additions & 0 deletions docs/serving/audio_generate_api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,338 @@
# Audio Generate API

vLLM-Omni provides an API for text-to-audio generation using diffusion-based models such as Stable Audio.

Unlike the [Speech API](speech_api.md) which targets text-to-speech synthesis, the Audio Generate API is designed for general-purpose audio generation from text descriptions (sound effects, music, ambient soundscapes, etc.).

Each server instance runs a single model (specified at startup via `vllm-omni serve <model> --omni`).

## Quick Start

### Start the Server

```bash
vllm-omni serve stabilityai/stable-audio-open-1.0 \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--enforce-eager \
--omni
```

### Generate Audio

**Using curl:**

```bash
curl -X POST http://localhost:8000/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "The sound of a cat purring",
"audio_length": 10.0
}' --output cat.wav
```

**Using Python:**

```python
import httpx

response = httpx.post(
"http://localhost:8000/v1/audio/generate",
json={
"input": "The sound of a cat purring",
"audio_length": 10.0,
},
timeout=300.0,
)

with open("cat.wav", "wb") as f:
f.write(response.content)
```

## API Reference

### Endpoint

```
POST /v1/audio/generate
Content-Type: application/json
```

### Request Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `input` | string | **required** | Text prompt describing the audio to generate |
| `model` | string | server's model | Model to use (optional, should match server if specified) |
| `response_format` | string | "wav" | Audio format: wav, mp3, flac, pcm, aac, opus |
| `speed` | float | 1.0 | Playback speed (0.25 - 4.0) |

#### Diffusion Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `audio_length` | float | null | Audio duration in seconds (default value is the max ~47s for `stable-audio-open-1.0`) |
| `audio_start` | float | 0.0 | Audio start time in seconds |
| `negative_prompt` | string | null | Text describing what to avoid in generation |
| `guidance_scale` | float | model default | Classifier-free guidance scale (higher = more adherence to prompt) |
| `num_inference_steps` | int | model default | Number of denoising steps (higher = better quality, slower) |
| `seed` | int | null | Random seed for reproducible generation |

### Response Format

Returns binary audio data with the appropriate `Content-Type` header:

| `response_format` | Content-Type |
|--------------------|--------------|
| `wav` | `audio/wav` |
| `mp3` | `audio/mpeg` |
| `flac` | `audio/flac` |
| `pcm` | `audio/pcm` |
| `aac` | `audio/aac` |
| `opus` | `audio/opus` |

## Examples

### Basic Generation

Generate audio with only a text prompt (model defaults for all other parameters):

```bash
curl -X POST http://localhost:8000/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "The sound of ocean waves crashing on a beach"
}' --output ocean.wav
```

### Custom Duration

Specify an explicit audio length in seconds:

```bash
curl -X POST http://localhost:8000/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "A dog barking",
"audio_length": 5.0
}' --output dog_5s.wav
```

### High Quality with Negative Prompt

Use a negative prompt to steer generation away from undesired characteristics, and increase inference steps for higher quality:

```bash
curl -X POST http://localhost:8000/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "A piano playing a gentle melody",
"audio_length": 10.0,
"negative_prompt": "Low quality, distorted, noisy",
"guidance_scale": 8.0,
"num_inference_steps": 150
}' --output piano_hq.wav
```

### Reproducible Generation

Set a `seed` to get deterministic results across runs:

```bash
curl -X POST http://localhost:8000/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "Thunder and rain sounds",
"audio_length": 15.0,
"seed": 42
}' --output thunder.wav
```

### Full Control

Combine all parameters for precise control over generation:

```bash
curl -X POST http://localhost:8000/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "Thunder and rain sounds",
"audio_length": 15.0,
"negative_prompt": "Low quality",
"guidance_scale": 7.0,
"num_inference_steps": 100,
"seed": 42
}' --output thunder_rain.wav
```

### Quick Generation (Fewer Steps)

For faster generation with slightly lower quality:

```bash
curl -X POST http://localhost:8000/v1/audio/generate \
-H "Content-Type: application/json" \
-d '{
"input": "Birds chirping in a forest",
"audio_length": 8.0,
"num_inference_steps": 50
}' --output birds_quick.wav
```

### Python Client

```python
import httpx

response = httpx.post(
"http://localhost:8000/v1/audio/generate",
json={
"input": "Thunder and rain",
"audio_length": 15.0,
"negative_prompt": "Low quality",
"guidance_scale": 7.0,
"num_inference_steps": 100,
"seed": 42,
"response_format": "wav",
},
timeout=300.0,
)

with open("thunder.wav", "wb") as f:
f.write(response.content)
```

## Parameter Tuning Guide

### `guidance_scale`

Controls how closely the generated audio follows the text prompt.

| Range | Behaviour |
|-------|-----------|
| 3 - 5 | More creative / varied output |
| 7 (default) | Balanced adherence |
| 10+ | Strict adherence to the prompt |

### `num_inference_steps`

Controls the number of denoising steps in the diffusion process.

| Steps | Quality | Speed | Use Case |
|-------|---------|-------|----------|
| 50 | Good | Fast | Quick previews |
| 100 | Very Good | Medium | General purpose |
| 150+ | Excellent | Slow | Final / critical audio |

### `audio_length`

Duration of the generated audio clip. For `stable-audio-open-1.0`, the maximum is approximately 47 seconds. If omitted, the model uses its own default length.

### `negative_prompt`

Describes characteristics to avoid. Common negative prompts include:

- `"Low quality, distorted, noisy"`
- `"Silence, static"`
- `"Music"` (when generating sound effects only)

## Supported Models

| Model | Description |
|-------|-------------|
| `stabilityai/stable-audio-open-1.0` | Open-source audio generation model, up to ~47 seconds, 44.1 kHz stereo |

## Error Responses

### 400 Bad Request

Invalid or missing parameters:

```json
{
"error": {
"message": "Audio generation model did not produce audio output.",
"type": "BadRequestError",
"param": null,
"code": 400
}
}
```

### 404 Not Found

Model mismatch:

```json
{
"error": {
"message": "The model `xxx` does not exist.",
"type": "NotFoundError",
"param": "model",
"code": 404
}
}
```

### 422 Unprocessable Entity

Pydantic validation failure (e.g. invalid `response_format`, `speed` out of range):

```json
{
"detail": [
{
"type": "literal_error",
"msg": "Input should be 'wav', 'pcm', 'flac', 'mp3', 'aac' or 'opus'",
...
}
]
}
```

## Troubleshooting

### "Audio generation model did not produce audio output"

The model finished but returned no audio data. Verify the server started successfully and the model loaded without errors.

### Server Not Responding

```bash
# Check if the server is healthy
curl http://localhost:8000/health
```

### Audio Quality Issues

- Increase `num_inference_steps` (e.g. 150).
- Add a negative prompt: `"Low quality, distorted, noisy"`.
- Increase `guidance_scale` for stronger prompt adherence.

### Generation Timeout

- Reduce `num_inference_steps`.
- Reduce `audio_length`.
- Check GPU memory with `nvidia-smi`.

### Out of Memory

- Lower `--gpu-memory-utilization` (e.g. 0.8).
- Reduce `audio_length`.

## Development

Enable debug logging:

```bash
vllm-omni serve stabilityai/stable-audio-open-1.0 \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--enforce-eager \
--omni \
--uvicorn-log-level debug
```
Loading