🌐Project Page |🤗 Hugging Face| 🤖 ModelScope | 🎮 Gradio Demo-zh | 🎮 Gradio Demo-en | 💬 DingTalk(钉钉)
- Introduction
- Demo
- Updates
- Key Features
- Evaluation
- Model & Benchmark Downloads
- Environment Preparation
- Example Usage
- Citation
Ming-omni-tts is a high-performance unified audio generation model that achieves precise control over speech attributes and enables single-channel synthesis of speech, environmental sounds, and music. Powered by a custom 12.5Hz continuous tokenizer and Patch-by-Patch compression, it delivers competitive inference efficiency (3.1Hz). Additionally, the model features robust text normalization capabilities for the accurate and natural narration of complex mathematical and chemical expressions.
🚀 Core Capabilities
- 🔊 Fine-grained Vocal Control: The model supports precise control over speech rate, pitch, volume, emotion, and dialect through simple commands. Notably, its accuracy for Cantonese dialect control is as high as 93%, and its emotion control accuracy reaches 46.7%, surpassing CosyVoice3.
- 🌌 Intelligent Voice Design: Features 100+ premium built-in voices and supports zero-shot voice design through natural language descriptions. Its performance on the Instruct-TTS-Eval-zh benchmark is on par with Qwen3-TTS.
- 🎶 Immersive Unified Generation: The industry’s first autoregressive model to jointly generate speech, ambient sound, and music in a single channel. Built on a custom 12.5Hz continuous tokenizer and a DiT head architecture, it delivers a seamless, "in-the-scene" auditory experience.
- ⚡ High-efficiency Inference: Introduces a "Patch-by-Patch" compression strategy that reduces the LLM inference frame rate to 3.1Hz. This significantly cuts latency and enables podcast-style audio generation while preserving naturalness and audio detail.
- 🧪 Professional Text Normalization: The model accurately parses and narrates complex formats, including mathematical expressions and chemical equations, ensuring natural-sounding output for specialized applications.
demo.mp4
- Support VLLM Inference
- Technical Report
- Ming-omni-tts Blog
Ming-omni-tts features key optimizations as follows, compared to other audio-assisted LLMs:
- Unified Continuous Audio Tokenizer: We propose a continuous VAE-based tokenizer that integrates speech, music, and general audio into a unified latent space with 12.5 Hz frame rate, yielding competitive results across audio reconstruction and various downstream synthesis benchmarks.
- Unified Audio Language Model for Speech, Music and Sound Generation: We present a unified, end-to-end audio language model that employs a single LLM backbone to perform joint generation of speech, music, and general sound. To enhance audio quality, the architecture is augmented with a Diffusion Head. Furthermore, we employ a patch-based generation strategy with a patch size of 4 and a look-back history of 32, enabling an optimal balance between local acoustic detail and long-range structural coherence.
- Reconstruction: The 12Hz tokenizer supports high-quality reconstruction across speech, music, and sound. Its performance is comparable to existing state-of-the-art methods across key fidelity metrics.
- Dialect Generation: Achieves 96% accuracy on WSYue-TTS-Eval and 86% WSC-TTS-Eval, outperforming CosyVoice3.
- Emotional Expressiveness: Delivers an average accuracy of 76.7% on CV3-Eval emotional sets and 46.7% on neutral emotion sets, significantly surpassing CosyVoice3-Base (40%) to reach SOTA levels.
- Instruction-based Voice Design: Scores 76.20% on InstructTTS-Eval-ZH. Its instruction-following capability is on par with Qwen3-TTS-VoiceDesign.
- Zero-shot Voice Clone: Exhibits exceptional stability on Seed-tts-eval (Chinese) with a WER of 0.83%, outperforming SeedTTS and GLM-TTS.
- Text Normalization (TN): On internal technical testsets, the model achieves a CER of 1.97% in normalized regions, delivering performance comparable to Gemini-2.5 Pro.
Speech metrics are evaluated on AISHELL-3(44.1khz-Chinese) and VCTK(44.1khz-English).
Music metrics are evaluated on MUSDB18(44.1khz) and MUSDB18-HQ(44.1khz).
Audio metrics are evaluated on AudioCaps.
| Model | Institution | seed-tts-eval-zh | seed-tts-eval-en | ||
|---|---|---|---|---|---|
| WER ↓ | SIM ↑ | WER ↓ | SIM ↑ | ||
| Seed-TTS | BytedanceSpeech | 1.11 | 0.796 | 2.24 | 0.762 |
| MaskGCT | College | 2.27 | 0.774 | 2.62 | 0.714 |
| E2 TTS | Microsoft | 1.97 | 0.730 | 2.19 | 0.710 |
| F5-TTS | College | 1.56 | 0.741 | 1.83 | 0.647 |
| CosyVoice 2 | Alibaba | 1.45 | 0.748 | 2.57 | 0.652 |
| Qwen3-Omni-30B-A3B | Alibaba | 1.07 | – | 1.39 | – |
| CosyVoice 3-0.5B | Alibaba | 1.16 | 0.780 | 2.02 | 0.718 |
| CosyVoice 3-1.5B | Alibaba | 0.71 | 0.775 | 1.45 | 0.695 |
| Qwen3-TTS-25Hz-0.6B-Base | Alibaba | 1.18 | – | 1.64 | – |
| Qwen3-TTS-25Hz-1.7B-Base | Alibaba | 1.10 | – | 1.49 | – |
| Qwen3-TTS-12Hz-0.6B-Base | Alibaba | 0.92 | – | 1.32 | – |
| Qwen3-TTS-12Hz-1.7B-Base | Alibaba | 0.77 | – | 1.24 | – |
| GLM-TTS | Zhipu AI | 1.03 | 0.761 | 2.23 | 0.672 |
| Ming-Flash-Omni-preview | Ant Group | 0.99 | 0.740 | 1.59 | 0.680 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 0.87 | 0.72 | 2.19 | 0.61 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 0.83 | 0.75 | 2.02 | 0.62 |
| Model | Institution | Instruction success rate | wer | sim | |||
|---|---|---|---|---|---|---|---|
| speech rate | speech volume | speech F0 | avg. | ||||
| CosyVoice3 | Alibaba | 100% | 97.67% | 65.33% | 87.67% | 1.21% | 0.58 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 97.67% | 95.00% | 91.33% | 94.67% | 0.27% | 0.712 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 96.33% | 97.00% | 83.67% | 92.33% | 0.347% | 0.776 |
Below is a comparison between Ming-omni-tts and other state-of-the-art (SOTA) models on the emotion control task.
| Model | Institution | Average | Text-Related | Text-Unrelated | ||||
|---|---|---|---|---|---|---|---|---|
| happy | sad | angry | happy | sad | angry | |||
| F5-TTS | SJTU | 0.647 | 0.92 | 0.52 | 0.72 | 0.80 | 0.28 | 0.64 |
| Sparks-TTS | HKST | 0.553 | 0.80 | 0.56 | 0.50 | 0.50 | 0.60 | 0.36 |
| GPT-SoVits | 0.517 | 0.88 | 0.54 | 0.50 | 0.48 | 0.40 | 0.30 | |
| CosyVoice2 | Alibaba | 0.587 | 0.84 | 0.72 | 0.58 | 0.56 | 0.44 | 0.38 |
| CosyVoice3-0.5B | Alibaba | 0.663 | 0.92 | 0.70 | 0.72 | 0.64 | 0.42 | 0.58 |
| CosyVoice3-1.5B | Alibaba | 0.630 | 0.86 | 0.64 | 0.72 | 0.64 | 0.44 | 0.48 |
| + DiffRO-EMO | Alibaba | 0.777 | 0.98 | 0.68 | 0.84 | 0.98 | 0.50 | 0.68 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 0.700 | 0.94 | 0.80 | 0.84 | 0.58 | 0.42 | 0.62 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 0.767 | 0.96 | 0.86 | 0.90 | 0.66 | 0.40 | 0.82 |
| Model | Institution | Average | Text-Related | Text-Unrelated | ||||
|---|---|---|---|---|---|---|---|---|
| happy | sad | angry | happy | sad | angry | |||
| CosyVoice3-0.5B | Alibaba | 0.400 | 0.68 | 0.30 | 0.78 | 0.14 | 0.04 | 0.46 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 0.343 | 0.68 | 0.26 | 0.74 | 0.14 | 0.00 | 0.24 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 0.450 | 0.78 | 0.38 | 0.76 | 0.30 | 0.02 | 0.46 |
| Model | Institution | WSC-Eval-TTS-easy | WSC-Eval-TTS-hard | WSYue-TTS-eval-Base | WSYue-TTS-eval-Coverage | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CER(%)↓ | SIM(%)↑ | ACC↑ | CER(%)↓ | SIM(%)↑ | ACC(%)↑ | CER(%)↓ | SIM(%)↑ | ACC(%)↑ | CER(%)↓ | SIM(%)↑ | ACC(%)↑ | ||
| Step-Audio-TTS | Step | 10.83 | 67.66 | – | 12.52 | 54.52 | – | 27.79 | 0.762 | 24.25 | 0.781 | – | |
| CosyVoice 2.0 | Alibaba | 7.14 | 70.27 | – | 9.06 | 60.10 | – | 14.38 | 0.812 | – | 13.74 | 0.826 | – |
| Qwen-TTS | Alibaba | 4.13 | – | – | 7.35 | – | – | – | – | – | – | – | – |
| CosyVoice2-WSC | Alibaba | 4.28 | 72.78 | – | 8.78 | 62.59 | – | – | – | – | – | – | – |
| CosyVoice2-WSC-SFT | Alibaba | 4.08 | 78.84 | – | 7.22 | 67.96 | – | – | – | – | – | – | – |
| Llasa-1B | – | – | - | – | – | – | – | 53.31 | 0.732 | 43.68 | 0.754 | ||
| Llasa-1B-Yue | – | – | – | – | – | – | – | 10.89 | 0.762 | – | 12.78 | 0.772 | |
| Edge-TTS | – | – | – | – | – | – | – | 8.30 | – | – | 9.27 | – | – |
| Cosyvoice2-Yue | – | – | – | – | – | – | – | 10.33 | 0.821 | – | 9.49 | 0.834 | – |
| CosyVoice3 | Alibaba | 3.17 | 0.696 | 68.06 | 4.07 | 0.723 | 80.90 | 8.36 | 0.611 | 91.70 | 8.95 | 0.658 | 95.80 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 2.25 | 0.695 | 82.08 | 3.18 | 0.717 | 84.42 | 9.70 | 0.598 | 96.00 | 11.62 | 0.644 | 95.80 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 2.35 | 0.730 | 83.48 | 3.19 | 0.750 | 88.44 | 6.47 | 0.622 | 96.30 | 7.87 | 0.667 | 95.81 |
| Model | Institution | ZipVoice-Dia-zh | ||
|---|---|---|---|---|
| CER ↓ | cpSIM ↑ | UTMOS ↑ | ||
| ZipVoice-Dia | Xiaomi | 3.39% | 0.553 | 2.24 |
| MoonCast | Kimi | 27.43% | 0.441 | 1.76 |
| MOSS-TTSD | Fudan | 8.62% | 0.421 | 1.70 |
| Vibevoice-1.5B | Microsoft | 12.87% | 0.455 | 1.74 |
| FireRedTTS2 | Xiaohongshu | 3.34% | 0.512 | 1.90 |
| SoulX-Podcast | Soul | 2.20% | 0.599 | 2.09 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 2.12% | 0.457 | 2.25 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 1.84% | 0.470 | 2.19 |
| Model | Institution | InstructTTSEval-ZH | |||
|---|---|---|---|---|---|
| APS ↑ | DSD ↑ | RP ↑ | Average | ||
| Qwen3TTS-12Hz-1.7B-VD | Alibaba | 85.2 | 81.1 | 65.1 | 77.13 |
| Mimo-Audio-7B-Instruct | Xiaomi | 75.7 | 74.3 | 61.5 | 70.50 |
| VoiceSculptor | NPU | 75.7 | 64.7 | 61.5 | 67.30 |
| VoxInstruct | Tsinghua | 47.5 | 52.3 | 42.6 | 47.47 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 83.85 | 75.10 | 61.50 | 73.48 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 87.30 | 79.80 | 61.50 | 76.20 |
| Model | Institution | Ming-BGM-Eval | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mulan_t | Audiobox-Aesthetics | SongEval | |||||||||||
| CE | CU | PC | PQ | Avg. | CO | MU | ME | CL | NA | Avg. | |||
| Doubao | Bytedance | 0.268 | 7.55 | 8.21 | 4.97 | 8.25 | 7.24 | 3.30 | 3.02 | 3.00 | 3.02 | 2.92 | 3.05 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 0.230 | 7.18 | 8.16 | 4.80 | 8.20 | 7.08 | 3.11 | 2.86 | 2.86 | 2.81 | 2.73 | 2.87 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 0.250 | 7.19 | 8.14 | 4.69 | 8.18 | 7.05 | 3.08 | 2.84 | 2.82 | 2.78 | 2.74 | 2.85 |
| Model | Institution | audiocaps | ||
|---|---|---|---|---|
| FDopenl3 ↓ | KLpasst ↓ | CLAPscore ↑ | ||
| AudioLDM-large | University of Surrey | 108.300 | 1.810 | 0.419 |
| Stable Audio Open | Stability AI | 96.133 | 2.148 | 0.306 |
| TangoFlux | Singapore University of Technology and Design | 137.700 | 1.041 | 0.547 |
| TangoFlux_base | Singapore University of Technology and Design | 149.270 | 1.125 | 0.523 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 74.292 | 2.257 | 0.347 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 65.918 | 1.640 | 0.424 |
| Model | Institution | Internally constructed test set | |
|---|---|---|---|
| TN-Area WER ↓ | none-TN-Area WER ↓ | ||
| Gemini-2.5 Pro | 2.00% | 0.97% | |
| Ming-omni-tts-0.5B(ours) | Ant Group | 1.97% | 0.85% |
You can download our latest model and Benchmark from both Huggingface and ModelScope.
| Model | Download |
|---|---|
| Ming-omni-tts-tokenizer-12Hz |
🤗 HuggingFace 🤖 ModelScope |
| Ming-omni-tts-0.5B |
🤗 HuggingFace 🤖 ModelScope |
| Ming-omni-tts-16.8B-A3B |
🤗 HuggingFace 🤖 ModelScope |
If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.
pip install modelscope
modelscope download --model inclusionAI/Ming-omni-tts-0.5B --local_dir inclusionAI/Ming-omni-tts-0.5B --revision master
Note: This download process will take several minutes to several hours, depending on your network conditions.
pip install -r requirements.txtYou can set up the environment using Docker in two ways.
- Option 1: Pull from Docker Hub (Recommended)
# 1. Pull the pre-built image
docker pull yongjielv/ming_uniaudio:v1.1
# 2. Run the container
docker run -it --gpus all yongjielv/ming_uniaudio:v1.1 /bin/bash- Option 2: Build from Source
# 1. Build the image
docker build -t Ming-omni-tts:v1.1 -f ./docker/ming_uniaudio.dockerfile .
# 2. Run the container
docker run -it --gpus all Ming-omni-tts:v1.1 /bin/bashgit clone https://github.com/inclusionAI/Ming-omni-tts.git
cd Ming-omni-tts
python3 cookbooks/test.pyFor detailed usage, please refer to demo.ipynb.
Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.
If you find our work helpful, feel free to give us a cite.



