vllm-project · hsliuustc0106 · Oct 24, 2025 · Oct 22, 2025 · Oct 23, 2025 · Oct 23, 2025
@@ -8,7 +8,7 @@ Traditional vLLM systems are limited to text-based, autoregressive generation. v
 
 - **Multi-modal Models**: Text, image, video, audio, and sensor data processing
 - **Non-autoregressive Architectures**: Diffusion Transformers (DiT) and other parallel generation models
-- **Heterogeneous Outputs**: Beyond traditional text generation to structured, binary, and streaming outputs
+- **Heterogeneous Outputs**: Beyond traditional text generation to multimodal outputs
 
 ## 🏗️ Architecture
 
@@ -28,119 +28,48 @@ vLLM-omni is built on a modular architecture that extends vLLM's core functional
 - **Text**: Advanced tokenization and embedding generation
 - **Image**: Vision encoder integration (CLIP, etc.)
 - **Audio**: Speech processing and audio embedding
-- **Video**: Frame-by-frame and temporal processing
-- **Sensor**: IoT and sensor data interpretation
-
-### Output Formats
-
-- **Structured Data**: JSON, XML, and custom formats
-- **Binary Outputs**: Images, audio, and video generation
-- **Streaming**: Real-time progressive generation
-- **Multipart**: Combined multi-modal responses
 
 ## 📋 Supported Models
 
 ### AR + Diffusion Transformer (DiT) Models
-- Qwen-Image (Image generation and editing)
 - Qwen-omni (Thinker-Talker-Codec structure)
-- Custom DiT and hiybrid architectures
+- HunyunaImage 3.0 (Ongoing)
+- Qwen-Image (Ongoing)
 
 ## 🛠️ Installation
 
-### Quick Start
-
-#### Option 1: Docker (Recommended for macOS)
-
-```bash
-# Clone the repository
-git clone https://github.com/hsliuustc0106/vllm-omni.git
-cd vllm-omni
-
-# Run the automated Docker setup
-./scripts/docker-setup-macos.sh
-```
-
-#### Option 2: Local Installation
-
-```bash
-# Clone the repository
-git clone https://github.com/hsliuustc0106/vllm-omni.git
-cd vllm-omni
-
-# Run the installation script
-./install.sh
-```
-
-### Prerequisites
-
-- Python 3.11+ (recommended)
-- Conda or Miniconda
-- Git
-- CUDA 11.8+ (for GPU acceleration) or CPU-only installation
-
-### Installation Methods
-
-#### Method 1: Automated Installation (Recommended)
+Set up basic environments
 ```bash
-# Using shell script
-./install.sh
-
-# Or using Python script
-python install.py
+uv venv --python 3.12 --seed
+source .venv/bin/activate
 ```
+Install certain version of vllm with commitid: 808a7b69df479b6b3a16181711cac7ca28a9b941
 
-#### Method 2: Manual Installation
 ```bash
-# Create conda environment
-conda create -n vllm_omni python=3.11 -y
-conda activate vllm_omni
-
-# Install PyTorch (CPU or GPU)
-pip install torch>=2.7 --index-url https://download.pytorch.org/whl/cpu  # CPU
-# pip install torch>=2.7 --index-url https://download.pytorch.org/whl/cu121  # GPU
-
-# Install dependencies
-pip install -r requirements.txt
-pip install "vllm>=0.10.2"
-
-# Install vLLM-omni
-pip install -e .
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+git checkout 808a7b69df479b6b3a16181711cac7ca28a9b941
+VLLM_USE_PRECOMPILED=1 uv pip install --editable .
 ```
 
-### Verify Installation
+## Run examples (Qwen2.5-omni)
 
+Get into the example folder
 ```bash
-# Test the installation
-python test_installation.py
-
-# Test basic functionality
-python -c "import vllm_omni; print('Ready!')"
-
-# Test CLI
-vllm --help
+cd vllm_omni
+cd examples/offline_inference/qwen2_5_omni
 ```
-
-For detailed installation instructions, see [INSTALL.md](INSTALL.md).
-
-## 📥 Model Download
-
-Models are automatically downloaded when first used, or you can pre-download them:
-
+Modify PYTHONPATH in run.sh as your path of vllm_omni. Then run.
 ```bash
-# Check downloaded models
-python scripts/download_models.py --check-cache
-
-# Download all default models
-python scripts/download_models.py --all
-
-# Download specific models
-python scripts/download_models.py --ar-models Qwen/Qwen3-0.6B
-python scripts/download_models.py --dit-models stabilityai/stable-diffusion-2-1
+bash run.sh
 ```
+The output audio is saved in ./output_audio
 
-**Model Storage Location:**
-- Default: `~/.cache/huggingface/hub/`
-- AR models: 100MB - 1GB each
-- DiT models: 2GB - 7GB each
+## To-do list
+- [x] Offline inference example for Qwen2.5-omni with single request
+- [ ] Adaptation from current vllm branch to stable vllm v0.11.0
+- [ ] Offline inference example for Qwen2.5-omni with streaming multiple requests
+- [ ] Online inference support
+- [ ] Support for other models
 
-For detailed model management, see [MODEL_DOWNLOAD_GUIDE.md](docs/MODEL_DOWNLOAD_GUIDE.md).
+For detailed model management, see [vllm_omni_design.md](docs/architecture/vllm_omni_design.md) and [high_level_arch_design.md](docs/architecture/high_level_arch_design.md).
@@ -0,0 +1,37 @@
+# Offline Example of vLLM-omni for Qwen2.5-omni
+
+## Installation
+
+Set up basic environments
+```bash
+uv venv --python 3.12 --seed
+source .venv/bin/activate
+```
+Install certain version of vllm with commitid: 808a7b69df479b6b3a16181711cac7ca28a9b941
+
+```bash
+git clone https://github.com/vllm-project/vllm.git
+cd vllm
+git checkout 808a7b69df479b6b3a16181711cac7ca28a9b941
+VLLM_USE_PRECOMPILED=1 uv pip install --editable .
+```
+
+## Run examples
+
+Get into the example folder
+```bash
+cd vllm_omni
+cd examples/offline_inference/qwen2_5_omni
+```
+Modify PYTHONPATH in run.sh as your path of vllm_omni. Then run.
+```bash
+bash run.sh
+```
+The output audio is saved in ./output_audio
+
+## To-do list
+- [x] Offline inference example for Qwen2.5-omni with single request
+- [ ] Adaptation from current vllm branch to stable vllm v0.11.0
+- [ ] Offline inference example for Qwen2.5-omni with streaming multiple requests
+- [ ] Online inference support
+- [ ] Support for other models
@@ -0,0 +1,130 @@
+import argparse
+import os
+import soundfile as sf
+import random
+import numpy as np
+import torch
+
+from vllm.sampling_params import SamplingParams
+
+import os as _os_env_toggle
+_os_env_toggle.environ["VLLM_USE_V1"] = "1"
+
+from vllm_omni.entrypoints.omni_llm import OmniLLM
+from utils import make_omni_prompt
+
+
+SEED = 42
+# Set all random seeds
+random.seed(SEED)
+np.random.seed(SEED)
+torch.manual_seed(SEED)
+torch.cuda.manual_seed(SEED)
+torch.cuda.manual_seed_all(SEED)
+
+# Make PyTorch deterministic
+torch.backends.cudnn.deterministic = True
+torch.backends.cudnn.benchmark = False
+
+# Set environment variables for deterministic behavior
+os.environ["PYTHONHASHSEED"] = str(SEED)
+os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--model', required=True, help='Path to merged model directory (will be created if downloading).')
+    parser.add_argument('--thinker-model', type=str, default=None)
+    parser.add_argument('--talker-model', type=str, default=None)
+    parser.add_argument('--code2wav-model', type=str, default=None)
+    parser.add_argument('--hf-hub-id', default='Qwen/Qwen2.5-Omni-7B', help='Hugging Face repo id to download if needed.')
+    parser.add_argument('--hf-revision', default=None, help='Optional HF revision (branch/tag/commit).')
+    parser.add_argument('--prompts', required=True, nargs='+', help='Input text prompts.')
+    parser.add_argument('--voice-type', default='default', help='Voice type, e.g., m02, f030, default.')
+    parser.add_argument('--code2wav-dir', default=None, help='Path to code2wav folder (contains spk_dict.pt).')
+    parser.add_argument('--dit-ckpt', default=None, help='Path to DiT checkpoint file (e.g., dit.pt).')
+    parser.add_argument('--bigvgan-ckpt', default=None, help='Path to BigVGAN checkpoint file.')
+    parser.add_argument('--dtype', default='bfloat16', choices=['float16', 'bfloat16', 'float32'])
+    parser.add_argument('--max-model-len', type=int, default=32768)
+
+    parser.add_argument("--thinker-only", action="store_true")
+    parser.add_argument("--text-only", action="store_true")
+    parser.add_argument("--do-wave", action="store_true")
+    parser.add_argument('--prompt_type',
+                        choices=[
+                            'text', 'audio', 'audio-long', 'audio-long-chunks',
+                            'audio-long-expand-chunks', 'image', 'video',
+                            'video-frames', 'audio-in-video', 'audio-in-video-v2',
+                            "audio-multi-round", "badcase-vl", "badcase-text",
+                            "badcase-image-early-stop", "badcase-two-audios",
+                            "badcase-two-videos", "badcase-multi-round",
+                            "badcase-voice-type", "badcase-voice-type-v2",
+                            "badcase-audio-tower-1", "badcase-audio-only"
+                        ],
+                        default='text')
+    parser.add_argument('--use-torchvision', action='store_true')
+    parser.add_argument('--tokenize', action='store_true')
+    parser.add_argument('--output-wav', default="output.wav", help='Output wav file path.')
+    parser.add_argument('--thinker-hidden-states-dir', default="thinker_hidden_states", help='Path to thinker hidden states directory.')
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    args = parse_args()
+    model_name = args.model
+    omni_llm = OmniLLM(model=model_name)
+    thinker_sampling_params = SamplingParams(
+                                            temperature=0.0,    # Deterministic - no randomness
+                                            top_p=1.0,          # Disable nucleus sampling
+                                            top_k=-1,           # Disable top-k sampling
+                                            max_tokens=2048,
+                                            seed=SEED,          # Fixed seed for sampling
+                                            detokenize=True,
+                                            repetition_penalty=1.1,
+                                            )
+    talker_sampling_params = SamplingParams(
+                                            temperature=0.0,    # Deterministic - no randomness
+                                            top_p=1.0,          # Disable nucleus sampling
+                                            top_k=-1,           # Disable top-k sampling
+                                            max_tokens=2048,
+                                            seed=SEED,          # Fixed seed for sampling
+                                            detokenize=True,
+                                            repetition_penalty=1.1,
+                                            stop_token_ids=[8294]
+                                            )
+    code2wav_sampling_params = SamplingParams(
+                                            temperature=0.0,    # Deterministic - no randomness
+                                            top_p=1.0,          # Disable nucleus sampling
+                                            top_k=-1,           # Disable top-k sampling
+                                            max_tokens=2048,
+                                            seed=SEED,          # Fixed seed for sampling
+                                            detokenize=True,
+                                            repetition_penalty=1.1,
+                                            )
+
+    sampling_params_list = [thinker_sampling_params,
+                            talker_sampling_params,
+                            code2wav_sampling_params]
+
+    prompt = [make_omni_prompt(args, prompt) for prompt in args.prompts]
+    omni_outputs = omni_llm.generate(prompt, sampling_params_list)
+
+    os.makedirs(args.output_wav, exist_ok=True)
+    for stage_outputs in omni_outputs:
+        if stage_outputs.final_output_type == "text":
+            for output in stage_outputs.request_output:
+                request_id = output.request_id
+                text_output = output.outputs[0].text
+                print(f"Request ID: {request_id}, Text Output: {text_output}")
+        elif stage_outputs.final_output_type == "audio":
+            for output in stage_outputs.request_output:
+                request_id = output.request_id
+                audio_tensor = output.multimodal_output["audio"]
+                output_wav = os.path.join(args.output_wav, f"output_{output.request_id}.wav")
+                sf.write(output_wav, audio_tensor.detach().cpu().numpy(), samplerate=24000)
+                print(f"Request ID: {request_id}, Saved audio to {output_wav}")
+
+
+if __name__ == "__main__":
+    main()