Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 25 additions & 96 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Traditional vLLM systems are limited to text-based, autoregressive generation. v

- **Multi-modal Models**: Text, image, video, audio, and sensor data processing
- **Non-autoregressive Architectures**: Diffusion Transformers (DiT) and other parallel generation models
- **Heterogeneous Outputs**: Beyond traditional text generation to structured, binary, and streaming outputs
- **Heterogeneous Outputs**: Beyond traditional text generation to multimodal outputs

## 🏗️ Architecture

Expand All @@ -28,119 +28,48 @@ vLLM-omni is built on a modular architecture that extends vLLM's core functional
- **Text**: Advanced tokenization and embedding generation
- **Image**: Vision encoder integration (CLIP, etc.)
- **Audio**: Speech processing and audio embedding
- **Video**: Frame-by-frame and temporal processing
- **Sensor**: IoT and sensor data interpretation

### Output Formats

- **Structured Data**: JSON, XML, and custom formats
- **Binary Outputs**: Images, audio, and video generation
- **Streaming**: Real-time progressive generation
- **Multipart**: Combined multi-modal responses

## 📋 Supported Models

### AR + Diffusion Transformer (DiT) Models
- Qwen-Image (Image generation and editing)
- Qwen-omni (Thinker-Talker-Codec structure)
- Custom DiT and hiybrid architectures
- HunyunaImage 3.0 (Ongoing)
- Qwen-Image (Ongoing)

## 🛠️ Installation

### Quick Start

#### Option 1: Docker (Recommended for macOS)

```bash
# Clone the repository
git clone https://github.com/hsliuustc0106/vllm-omni.git
cd vllm-omni

# Run the automated Docker setup
./scripts/docker-setup-macos.sh
```

#### Option 2: Local Installation

```bash
# Clone the repository
git clone https://github.com/hsliuustc0106/vllm-omni.git
cd vllm-omni

# Run the installation script
./install.sh
```

### Prerequisites

- Python 3.11+ (recommended)
- Conda or Miniconda
- Git
- CUDA 11.8+ (for GPU acceleration) or CPU-only installation

### Installation Methods

#### Method 1: Automated Installation (Recommended)
Set up basic environments
```bash
# Using shell script
./install.sh

# Or using Python script
python install.py
uv venv --python 3.12 --seed
source .venv/bin/activate
```
Install certain version of vllm with commitid: 808a7b69df479b6b3a16181711cac7ca28a9b941

#### Method 2: Manual Installation
```bash
# Create conda environment
conda create -n vllm_omni python=3.11 -y
conda activate vllm_omni

# Install PyTorch (CPU or GPU)
pip install torch>=2.7 --index-url https://download.pytorch.org/whl/cpu # CPU
# pip install torch>=2.7 --index-url https://download.pytorch.org/whl/cu121 # GPU

# Install dependencies
pip install -r requirements.txt
pip install "vllm>=0.10.2"

# Install vLLM-omni
pip install -e .
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 808a7b69df479b6b3a16181711cac7ca28a9b941
VLLM_USE_PRECOMPILED=1 uv pip install --editable .
```

### Verify Installation
## Run examples (Qwen2.5-omni)

Get into the example folder
```bash
# Test the installation
python test_installation.py

# Test basic functionality
python -c "import vllm_omni; print('Ready!')"

# Test CLI
vllm --help
cd vllm_omni
cd examples/offline_inference/qwen2_5_omni
```

For detailed installation instructions, see [INSTALL.md](INSTALL.md).

## 📥 Model Download

Models are automatically downloaded when first used, or you can pre-download them:

Modify PYTHONPATH in run.sh as your path of vllm_omni. Then run.
```bash
# Check downloaded models
python scripts/download_models.py --check-cache

# Download all default models
python scripts/download_models.py --all

# Download specific models
python scripts/download_models.py --ar-models Qwen/Qwen3-0.6B
python scripts/download_models.py --dit-models stabilityai/stable-diffusion-2-1
bash run.sh
```
The output audio is saved in ./output_audio

**Model Storage Location:**
- Default: `~/.cache/huggingface/hub/`
- AR models: 100MB - 1GB each
- DiT models: 2GB - 7GB each
## To-do list
- [x] Offline inference example for Qwen2.5-omni with single request
- [ ] Adaptation from current vllm branch to stable vllm v0.11.0
- [ ] Offline inference example for Qwen2.5-omni with streaming multiple requests
- [ ] Online inference support
- [ ] Support for other models

For detailed model management, see [MODEL_DOWNLOAD_GUIDE.md](docs/MODEL_DOWNLOAD_GUIDE.md).
For detailed model management, see [vllm_omni_design.md](docs/architecture/vllm_omni_design.md) and [high_level_arch_design.md](docs/architecture/high_level_arch_design.md).
37 changes: 37 additions & 0 deletions examples/offline_inference/qwen_2_5_omni/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Offline Example of vLLM-omni for Qwen2.5-omni

## Installation

Set up basic environments
```bash
uv venv --python 3.12 --seed
source .venv/bin/activate
```
Install certain version of vllm with commitid: 808a7b69df479b6b3a16181711cac7ca28a9b941

```bash
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 808a7b69df479b6b3a16181711cac7ca28a9b941
VLLM_USE_PRECOMPILED=1 uv pip install --editable .
```

## Run examples

Get into the example folder
```bash
cd vllm_omni
cd examples/offline_inference/qwen2_5_omni
```
Modify PYTHONPATH in run.sh as your path of vllm_omni. Then run.
```bash
bash run.sh
```
The output audio is saved in ./output_audio
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a test result?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output is a .wav audio file. Do we need to add it to example folder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already update the test plan and result. Fixed.


## To-do list
- [x] Offline inference example for Qwen2.5-omni with single request
- [ ] Adaptation from current vllm branch to stable vllm v0.11.0
- [ ] Offline inference example for Qwen2.5-omni with streaming multiple requests
- [ ] Online inference support
- [ ] Support for other models
130 changes: 130 additions & 0 deletions examples/offline_inference/qwen_2_5_omni/end2end.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
import argparse
import os
import soundfile as sf
import random
import numpy as np
import torch

from vllm.sampling_params import SamplingParams

import os as _os_env_toggle
_os_env_toggle.environ["VLLM_USE_V1"] = "1"

from vllm_omni.entrypoints.omni_llm import OmniLLM
from utils import make_omni_prompt


SEED = 42
# Set all random seeds
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

# Make PyTorch deterministic
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Set environment variables for deterministic behavior
os.environ["PYTHONHASHSEED"] = str(SEED)
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"


def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--model', required=True, help='Path to merged model directory (will be created if downloading).')
parser.add_argument('--thinker-model', type=str, default=None)
parser.add_argument('--talker-model', type=str, default=None)
parser.add_argument('--code2wav-model', type=str, default=None)
parser.add_argument('--hf-hub-id', default='Qwen/Qwen2.5-Omni-7B', help='Hugging Face repo id to download if needed.')
parser.add_argument('--hf-revision', default=None, help='Optional HF revision (branch/tag/commit).')
parser.add_argument('--prompts', required=True, nargs='+', help='Input text prompts.')
parser.add_argument('--voice-type', default='default', help='Voice type, e.g., m02, f030, default.')
parser.add_argument('--code2wav-dir', default=None, help='Path to code2wav folder (contains spk_dict.pt).')
parser.add_argument('--dit-ckpt', default=None, help='Path to DiT checkpoint file (e.g., dit.pt).')
parser.add_argument('--bigvgan-ckpt', default=None, help='Path to BigVGAN checkpoint file.')
parser.add_argument('--dtype', default='bfloat16', choices=['float16', 'bfloat16', 'float32'])
parser.add_argument('--max-model-len', type=int, default=32768)

parser.add_argument("--thinker-only", action="store_true")
parser.add_argument("--text-only", action="store_true")
parser.add_argument("--do-wave", action="store_true")
parser.add_argument('--prompt_type',
choices=[
'text', 'audio', 'audio-long', 'audio-long-chunks',
'audio-long-expand-chunks', 'image', 'video',
'video-frames', 'audio-in-video', 'audio-in-video-v2',
"audio-multi-round", "badcase-vl", "badcase-text",
"badcase-image-early-stop", "badcase-two-audios",
"badcase-two-videos", "badcase-multi-round",
"badcase-voice-type", "badcase-voice-type-v2",
"badcase-audio-tower-1", "badcase-audio-only"
],
default='text')
parser.add_argument('--use-torchvision', action='store_true')
parser.add_argument('--tokenize', action='store_true')
parser.add_argument('--output-wav', default="output.wav", help='Output wav file path.')
parser.add_argument('--thinker-hidden-states-dir', default="thinker_hidden_states", help='Path to thinker hidden states directory.')
args = parser.parse_args()
return args


def main():
args = parse_args()
model_name = args.model
omni_llm = OmniLLM(model=model_name)
thinker_sampling_params = SamplingParams(
temperature=0.0, # Deterministic - no randomness
top_p=1.0, # Disable nucleus sampling
top_k=-1, # Disable top-k sampling
max_tokens=2048,
seed=SEED, # Fixed seed for sampling
detokenize=True,
repetition_penalty=1.1,
)
talker_sampling_params = SamplingParams(
temperature=0.0, # Deterministic - no randomness
top_p=1.0, # Disable nucleus sampling
top_k=-1, # Disable top-k sampling
max_tokens=2048,
seed=SEED, # Fixed seed for sampling
detokenize=True,
repetition_penalty=1.1,
stop_token_ids=[8294]
)
code2wav_sampling_params = SamplingParams(
temperature=0.0, # Deterministic - no randomness
top_p=1.0, # Disable nucleus sampling
top_k=-1, # Disable top-k sampling
max_tokens=2048,
seed=SEED, # Fixed seed for sampling
detokenize=True,
repetition_penalty=1.1,
)

sampling_params_list = [thinker_sampling_params,
talker_sampling_params,
code2wav_sampling_params]

prompt = [make_omni_prompt(args, prompt) for prompt in args.prompts]
omni_outputs = omni_llm.generate(prompt, sampling_params_list)

os.makedirs(args.output_wav, exist_ok=True)
for stage_outputs in omni_outputs:
if stage_outputs.final_output_type == "text":
for output in stage_outputs.request_output:
request_id = output.request_id
text_output = output.outputs[0].text
print(f"Request ID: {request_id}, Text Output: {text_output}")
elif stage_outputs.final_output_type == "audio":
for output in stage_outputs.request_output:
request_id = output.request_id
audio_tensor = output.multimodal_output["audio"]
output_wav = os.path.join(args.output_wav, f"output_{output.request_id}.wav")
sf.write(output_wav, audio_tensor.detach().cpu().numpy(), samplerate=24000)
print(f"Request ID: {request_id}, Saved audio to {output_wav}")


if __name__ == "__main__":
main()
Binary file not shown.
Loading