Skip to content

Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control

License

Notifications You must be signed in to change notification settings

inclusionAI/Ming-omni-tts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ming-omni-tts Logo

Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control

🌐Project Page |🤗 Hugging Face| 🤖 ModelScope | 🎮 Gradio Demo-zh | 🎮 Gradio Demo-en | 💬 DingTalk(钉钉)

Table of Contents

Introduction

Ming-omni-tts is a high-performance unified audio generation model that achieves precise control over speech attributes and enables single-channel synthesis of speech, environmental sounds, and music. Powered by a custom 12.5Hz continuous tokenizer and Patch-by-Patch compression, it delivers competitive inference efficiency (3.1Hz). Additionally, the model features robust text normalization capabilities for the accurate and natural narration of complex mathematical and chemical expressions.

🚀 Core Capabilities

  • 🔊 Fine-grained Vocal Control: The model supports precise control over speech rate, pitch, volume, emotion, and dialect through simple commands. Notably, its accuracy for Cantonese dialect control is as high as 93%, and its emotion control accuracy reaches 46.7%, surpassing CosyVoice3.
  • 🌌 Intelligent Voice Design: Features 100+ premium built-in voices and supports zero-shot voice design through natural language descriptions. Its performance on the Instruct-TTS-Eval-zh benchmark is on par with Qwen3-TTS.
  • 🎶 Immersive Unified Generation: The industry’s first autoregressive model to jointly generate speech, ambient sound, and music in a single channel. Built on a custom 12.5Hz continuous tokenizer and a DiT head architecture, it delivers a seamless, "in-the-scene" auditory experience.
  • High-efficiency Inference: Introduces a "Patch-by-Patch" compression strategy that reduces the LLM inference frame rate to 3.1Hz. This significantly cuts latency and enables podcast-style audio generation while preserving naturalness and audio detail.
  • 🧪 Professional Text Normalization: The model accurately parses and narrates complex formats, including mathematical expressions and chemical equations, ensuring natural-sounding output for specialized applications.

Demo

demo.mp4

Updates

🚀 Key Features

Ming-omni-tts features key optimizations as follows, compared to other audio-assisted LLMs:

  • Unified Continuous Audio Tokenizer: We propose a continuous VAE-based tokenizer that integrates speech, music, and general audio into a unified latent space with 12.5 Hz frame rate, yielding competitive results across audio reconstruction and various downstream synthesis benchmarks.

  • Unified Audio Language Model for Speech, Music and Sound Generation: We present a unified, end-to-end audio language model that employs a single LLM backbone to perform joint generation of speech, music, and general sound. To enhance audio quality, the architecture is augmented with a Diffusion Head. Furthermore, we employ a patch-based generation strategy with a patch size of 4 and a look-back history of 32, enabling an optimal balance between local acoustic detail and long-range structural coherence.

Evaluation

  • Reconstruction: The 12Hz tokenizer supports high-quality reconstruction across speech, music, and sound. Its performance is comparable to existing state-of-the-art methods across key fidelity metrics.
  • Dialect Generation: Achieves 96% accuracy on WSYue-TTS-Eval and 86% WSC-TTS-Eval, outperforming CosyVoice3.
  • Emotional Expressiveness: Delivers an average accuracy of 76.7% on CV3-Eval emotional sets and 46.7% on neutral emotion sets, significantly surpassing CosyVoice3-Base (40%) to reach SOTA levels.
  • Instruction-based Voice Design: Scores 76.20% on InstructTTS-Eval-ZH. Its instruction-following capability is on par with Qwen3-TTS-VoiceDesign.
  • Zero-shot Voice Clone: Exhibits exceptional stability on Seed-tts-eval (Chinese) with a WER of 0.83%, outperforming SeedTTS and GLM-TTS.
  • Text Normalization (TN): On internal technical testsets, the model achieves a CER of 1.97% in normalized regions, delivering performance comparable to Gemini-2.5 Pro.

Audio Tokenizer

Speech metrics are evaluated on AISHELL-3(44.1khz-Chinese) and VCTK(44.1khz-English).
Music metrics are evaluated on MUSDB18(44.1khz) and MUSDB18-HQ(44.1khz).
Audio metrics are evaluated on AudioCaps.

Speech Controllable Generative Tasks

Zero-shot TTS

Zero-shot speech generation performance comparison on the Seed-TTS testset.
Model Institution seed-tts-eval-zh seed-tts-eval-en
WER ↓ SIM ↑ WER ↓ SIM ↑
Seed-TTS BytedanceSpeech 1.11 0.796 2.24 0.762
MaskGCT College 2.27 0.774 2.62 0.714
E2 TTS Microsoft 1.97 0.730 2.19 0.710
F5-TTS College 1.56 0.741 1.83 0.647
CosyVoice 2 Alibaba 1.45 0.748 2.57 0.652
Qwen3-Omni-30B-A3B Alibaba 1.07 1.39
CosyVoice 3-0.5B Alibaba 1.16 0.780 2.02 0.718
CosyVoice 3-1.5B Alibaba 0.71 0.775 1.45 0.695
Qwen3-TTS-25Hz-0.6B-Base Alibaba 1.18 1.64
Qwen3-TTS-25Hz-1.7B-Base Alibaba 1.10 1.49
Qwen3-TTS-12Hz-0.6B-Base Alibaba 0.92 1.32
Qwen3-TTS-12Hz-1.7B-Base Alibaba 0.77 1.24
GLM-TTS Zhipu AI 1.03 0.761 2.23 0.672
Ming-Flash-Omni-preview Ant Group 0.99 0.740 1.59 0.680
Ming-omni-tts-0.5B(ours) Ant Group 0.87 0.72 2.19 0.61
Ming-omni-tts-16.8B-A3B(ours) Ant Group 0.83 0.75 2.02 0.62

Speech Attribute Control

Model Institution Instruction success rate wer sim
speech rate speech volume speech F0 avg.
CosyVoice3 Alibaba 100% 97.67% 65.33% 87.67% 1.21% 0.58
Ming-omni-tts-0.5B(ours) Ant Group 97.67% 95.00% 91.33% 94.67% 0.27% 0.712
Ming-omni-tts-16.8B-A3B(ours) Ant Group 96.33% 97.00% 83.67% 92.33% 0.347% 0.776

Emotional Control

Below is a comparison between Ming-omni-tts and other state-of-the-art (SOTA) models on the emotion control task.

Emotion Accuracy on the Text-Related and Text-Unrelated of the CV3-Eval Emotional testsets
Model Institution Average Text-Related Text-Unrelated
happy sad angry happy sad angry
F5-TTS SJTU 0.647 0.92 0.52 0.72 0.80 0.28 0.64
Sparks-TTS HKST 0.553 0.80 0.56 0.50 0.50 0.60 0.36
GPT-SoVits 0.517 0.88 0.54 0.50 0.48 0.40 0.30
CosyVoice2 Alibaba 0.587 0.84 0.72 0.58 0.56 0.44 0.38
CosyVoice3-0.5B Alibaba 0.663 0.92 0.70 0.72 0.64 0.42 0.58
CosyVoice3-1.5B Alibaba 0.630 0.86 0.64 0.72 0.64 0.44 0.48
+ DiffRO-EMO Alibaba 0.777 0.98 0.68 0.84 0.98 0.50 0.68
Ming-omni-tts-0.5B(ours) Ant Group 0.700 0.94 0.80 0.84 0.58 0.42 0.62
Ming-omni-tts-16.8B-A3B(ours) Ant Group 0.767 0.96 0.86 0.90 0.66 0.40 0.82
Emotion Accuracy on the Text-Related and Text-Unrelated of CV3-Eval Neutral testsets
Model Institution Average Text-Related Text-Unrelated
happy sad angry happy sad angry
CosyVoice3-0.5B Alibaba 0.400 0.68 0.30 0.78 0.14 0.04 0.46
Ming-omni-tts-0.5B(ours) Ant Group 0.343 0.68 0.26 0.74 0.14 0.00 0.24
Ming-omni-tts-16.8B-A3B(ours) Ant Group 0.450 0.78 0.38 0.76 0.30 0.02 0.46

Dialect Control

Dialect performance comparison
Model Institution WSC-Eval-TTS-easy WSC-Eval-TTS-hard WSYue-TTS-eval-Base WSYue-TTS-eval-Coverage
CER(%)↓ SIM(%)↑ ACC↑ CER(%)↓ SIM(%)↑ ACC(%)↑ CER(%)↓ SIM(%)↑ ACC(%)↑ CER(%)↓ SIM(%)↑ ACC(%)↑
Step-Audio-TTS Step 10.83 67.66 12.52 54.52 27.79 0.762 24.25 0.781
CosyVoice 2.0 Alibaba 7.14 70.27 9.06 60.10 14.38 0.812 13.74 0.826
Qwen-TTS Alibaba 4.13 7.35
CosyVoice2-WSC Alibaba 4.28 72.78 8.78 62.59
CosyVoice2-WSC-SFT Alibaba 4.08 78.84 7.22 67.96
Llasa-1B - 53.31 0.732 43.68 0.754
Llasa-1B-Yue 10.89 0.762 12.78 0.772
Edge-TTS 8.30 9.27
Cosyvoice2-Yue 10.33 0.821 9.49 0.834
CosyVoice3 Alibaba 3.17 0.696 68.06 4.07 0.723 80.90 8.36 0.611 91.70 8.95 0.658 95.80
Ming-omni-tts-0.5B(ours) Ant Group 2.25 0.695 82.08 3.18 0.717 84.42 9.70 0.598 96.00 11.62 0.644 95.80
Ming-omni-tts-16.8B-A3B(ours) Ant Group 2.35 0.730 83.48 3.19 0.750 88.44 6.47 0.622 96.30 7.87 0.667 95.81

Podcast TTS

Podcast performance comparison on the ZipVoice-Dia-zh test set
Model Institution ZipVoice-Dia-zh
CER ↓ cpSIM ↑ UTMOS ↑
ZipVoice-Dia Xiaomi 3.39% 0.553 2.24
MoonCast Kimi 27.43% 0.441 1.76
MOSS-TTSD Fudan 8.62% 0.421 1.70
Vibevoice-1.5B Microsoft 12.87% 0.455 1.74
FireRedTTS2 Xiaohongshu 3.34% 0.512 1.90
SoulX-Podcast Soul 2.20% 0.599 2.09
Ming-omni-tts-0.5B(ours) Ant Group 2.12% 0.457 2.25
Ming-omni-tts-16.8B-A3B(ours) Ant Group 1.84% 0.470 2.19

Voice Design

Voice Design performance comparison on the InstructTTSEval-ZH test set
Model Institution InstructTTSEval-ZH
APS ↑ DSD ↑ RP ↑ Average
Qwen3TTS-12Hz-1.7B-VD Alibaba 85.2 81.1 65.1 77.13
Mimo-Audio-7B-Instruct Xiaomi 75.7 74.3 61.5 70.50
VoiceSculptor NPU 75.7 64.7 61.5 67.30
VoxInstruct Tsinghua 47.5 52.3 42.6 47.47
Ming-omni-tts-0.5B(ours) Ant Group 83.85 75.10 61.50 73.48
Ming-omni-tts-16.8B-A3B(ours) Ant Group 87.30 79.80 61.50 76.20

Audio & BGM Generation

Text-To-BGM

Text-to-BGM performance comparison on the Ming-BGM-Eval test set
Model Institution Ming-BGM-Eval
mulan_t Audiobox-Aesthetics SongEval
CE CU PC PQ Avg. CO MU ME CL NA Avg.
Doubao Bytedance 0.268 7.55 8.21 4.97 8.25 7.24 3.30 3.02 3.00 3.02 2.92 3.05
Ming-omni-tts-0.5B(ours) Ant Group 0.230 7.18 8.16 4.80 8.20 7.08 3.11 2.86 2.86 2.81 2.73 2.87
Ming-omni-tts-16.8B-A3B(ours) Ant Group 0.250 7.19 8.14 4.69 8.18 7.05 3.08 2.84 2.82 2.78 2.74 2.85

Text-To-Audio(TTA)

TTA performance comparison on the audiocaps test set
Model Institution audiocaps
FDopenl3 KLpasst CLAPscore
AudioLDM-large University of Surrey 108.300 1.810 0.419
Stable Audio Open Stability AI 96.133 2.148 0.306
TangoFlux Singapore University of Technology and Design 137.700 1.041 0.547
TangoFlux_base Singapore University of Technology and Design 149.270 1.125 0.523
Ming-omni-tts-0.5B(ours) Ant Group 74.292 2.257 0.347
Ming-omni-tts-16.8B-A3B(ours) Ant Group 65.918 1.640 0.424

Text Normalization

Text Normalization performance comparison on the internally constructed test set
Model Institution Internally constructed test set
TN-Area WER ↓ none-TN-Area WER ↓
Gemini-2.5 Pro Google 2.00% 0.97%
Ming-omni-tts-0.5B(ours) Ant Group 1.97% 0.85%

Model & Benchmark Downloads

You can download our latest model and Benchmark from both Huggingface and ModelScope.

Model Download
Ming-omni-tts-tokenizer-12Hz 🤗 HuggingFace
🤖 ModelScope
Ming-omni-tts-0.5B 🤗 HuggingFace
🤖 ModelScope
Ming-omni-tts-16.8B-A3B 🤗 HuggingFace
🤖 ModelScope

If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.

pip install modelscope
modelscope download --model inclusionAI/Ming-omni-tts-0.5B --local_dir inclusionAI/Ming-omni-tts-0.5B  --revision master

Note: This download process will take several minutes to several hours, depending on your network conditions.

Environment Preparation

Installation with pip

pip install -r requirements.txt

Installation with docker

You can set up the environment using Docker in two ways.

  • Option 1: Pull from Docker Hub (Recommended)
# 1. Pull the pre-built image
docker pull yongjielv/ming_uniaudio:v1.1

# 2. Run the container
docker run -it --gpus all yongjielv/ming_uniaudio:v1.1 /bin/bash
  • Option 2: Build from Source
# 1. Build the image
docker build -t Ming-omni-tts:v1.1 -f ./docker/ming_uniaudio.dockerfile .

# 2. Run the container
docker run -it --gpus all Ming-omni-tts:v1.1 /bin/bash

Example Usage

git clone https://github.com/inclusionAI/Ming-omni-tts.git
cd Ming-omni-tts
python3 cookbooks/test.py

For detailed usage, please refer to demo.ipynb.

Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.

Citation

If you find our work helpful, feel free to give us a cite.

About

Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •