Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control

🌐Project Page ｜🤗 Hugging Face｜ 🤖 ModelScope | 🎮 Gradio Demo-zh | 🎮 Gradio Demo-en | 💬 DingTalk(钉钉)

Introduction

Ming-omni-tts is a high-performance unified audio generation model that achieves precise control over speech attributes and enables single-channel synthesis of speech, environmental sounds, and music. Powered by a custom 12.5Hz continuous tokenizer and Patch-by-Patch compression, it delivers competitive inference efficiency (3.1Hz). Additionally, the model features robust text normalization capabilities for the accurate and natural narration of complex mathematical and chemical expressions.

🚀 Core Capabilities

🔊 Fine-grained Vocal Control: The model supports precise control over speech rate, pitch, volume, emotion, and dialect through simple commands. Notably, its accuracy for Cantonese dialect control is as high as 93%, and its emotion control accuracy reaches 46.7%, surpassing CosyVoice3.
🌌 Intelligent Voice Design: Features 100+ premium built-in voices and supports zero-shot voice design through natural language descriptions. Its performance on the Instruct-TTS-Eval-zh benchmark is on par with Qwen3-TTS.
🎶 Immersive Unified Generation: The industry’s first autoregressive model to jointly generate speech, ambient sound, and music in a single channel. Built on a custom 12.5Hz continuous tokenizer and a DiT head architecture, it delivers a seamless, "in-the-scene" auditory experience.
⚡ High-efficiency Inference: Introduces a "Patch-by-Patch" compression strategy that reduces the LLM inference frame rate to 3.1Hz. This significantly cuts latency and enables podcast-style audio generation while preserving naturalness and audio detail.
🧪 Professional Text Normalization: The model accurately parses and narrates complex formats, including mathematical expressions and chemical equations, ensuring natural-sounding output for specialized applications.

Demo

demo.mp4

Updates

Support VLLM Inference
Technical Report
Ming-omni-tts Blog

🚀 Key Features

Ming-omni-tts features key optimizations as follows, compared to other audio-assisted LLMs:

Unified Continuous Audio Tokenizer: We propose a continuous VAE-based tokenizer that integrates speech, music, and general audio into a unified latent space with 12.5 Hz frame rate, yielding competitive results across audio reconstruction and various downstream synthesis benchmarks.

Unified Audio Language Model for Speech, Music and Sound Generation: We present a unified, end-to-end audio language model that employs a single LLM backbone to perform joint generation of speech, music, and general sound. To enhance audio quality, the architecture is augmented with a Diffusion Head. Furthermore, we employ a patch-based generation strategy with a patch size of 4 and a look-back history of 32, enabling an optimal balance between local acoustic detail and long-range structural coherence.

Evaluation

Reconstruction: The 12Hz tokenizer supports high-quality reconstruction across speech, music, and sound. Its performance is comparable to existing state-of-the-art methods across key fidelity metrics.
Dialect Generation: Achieves 96% accuracy on WSYue-TTS-Eval and 86% WSC-TTS-Eval, outperforming CosyVoice3.
Emotional Expressiveness: Delivers an average accuracy of 76.7% on CV3-Eval emotional sets and 46.7% on neutral emotion sets, significantly surpassing CosyVoice3-Base (40%) to reach SOTA levels.
Instruction-based Voice Design: Scores 76.20% on InstructTTS-Eval-ZH. Its instruction-following capability is on par with Qwen3-TTS-VoiceDesign.
Zero-shot Voice Clone: Exhibits exceptional stability on Seed-tts-eval (Chinese) with a WER of 0.83%, outperforming SeedTTS and GLM-TTS.
Text Normalization (TN): On internal technical testsets, the model achieves a CER of 1.97% in normalized regions, delivering performance comparable to Gemini-2.5 Pro.

Audio Tokenizer

Speech metrics are evaluated on AISHELL-3(44.1khz-Chinese) and VCTK(44.1khz-English).
Music metrics are evaluated on MUSDB18(44.1khz) and MUSDB18-HQ(44.1khz).
Audio metrics are evaluated on AudioCaps.

Speech Controllable Generative Tasks

Zero-shot TTS

Zero-shot speech generation performance comparison on the Seed-TTS testset.

Model	Institution	seed-tts-eval-zh		seed-tts-eval-en
Model	Institution	WER ↓	SIM ↑	WER ↓	SIM ↑
Seed-TTS	BytedanceSpeech	1.11	0.796	2.24	0.762
MaskGCT	College	2.27	0.774	2.62	0.714
E2 TTS	Microsoft	1.97	0.730	2.19	0.710
F5-TTS	College	1.56	0.741	1.83	0.647
CosyVoice 2	Alibaba	1.45	0.748	2.57	0.652
Qwen3-Omni-30B-A3B	Alibaba	1.07	–	1.39	–
CosyVoice 3-0.5B	Alibaba	1.16	0.780	2.02	0.718
CosyVoice 3-1.5B	Alibaba	0.71	0.775	1.45	0.695
Qwen3-TTS-25Hz-0.6B-Base	Alibaba	1.18	–	1.64	–
Qwen3-TTS-25Hz-1.7B-Base	Alibaba	1.10	–	1.49	–
Qwen3-TTS-12Hz-0.6B-Base	Alibaba	0.92	–	1.32	–
Qwen3-TTS-12Hz-1.7B-Base	Alibaba	0.77	–	1.24	–
GLM-TTS	Zhipu AI	1.03	0.761	2.23	0.672
Ming-Flash-Omni-preview	Ant Group	0.99	0.740	1.59	0.680
Ming-omni-tts-0.5B(ours)	Ant Group	0.87	0.72	2.19	0.61
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	0.83	0.75	2.02	0.62

Speech Attribute Control

Model	Institution	Instruction success rate				wer	sim
Model	Institution	speech rate	speech volume	speech F0	avg.	wer	sim
CosyVoice3	Alibaba	100%	97.67%	65.33%	87.67%	1.21%	0.58
Ming-omni-tts-0.5B(ours)	Ant Group	97.67%	95.00%	91.33%	94.67%	0.27%	0.712
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	96.33%	97.00%	83.67%	92.33%	0.347%	0.776

Emotional Control

Below is a comparison between Ming-omni-tts and other state-of-the-art (SOTA) models on the emotion control task.

Emotion Accuracy on the Text-Related and Text-Unrelated of the CV3-Eval Emotional testsets

Model	Institution	Average	Text-Related			Text-Unrelated
Model	Institution	Average	happy	sad	angry	happy	sad	angry
F5-TTS	SJTU	0.647	0.92	0.52	0.72	0.80	0.28	0.64
Sparks-TTS	HKST	0.553	0.80	0.56	0.50	0.50	0.60	0.36
GPT-SoVits		0.517	0.88	0.54	0.50	0.48	0.40	0.30
CosyVoice2	Alibaba	0.587	0.84	0.72	0.58	0.56	0.44	0.38
CosyVoice3-0.5B	Alibaba	0.663	0.92	0.70	0.72	0.64	0.42	0.58
CosyVoice3-1.5B	Alibaba	0.630	0.86	0.64	0.72	0.64	0.44	0.48
+ DiffRO-EMO	Alibaba	0.777	0.98	0.68	0.84	0.98	0.50	0.68
Ming-omni-tts-0.5B(ours)	Ant Group	0.700	0.94	0.80	0.84	0.58	0.42	0.62
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	0.767	0.96	0.86	0.90	0.66	0.40	0.82

Emotion Accuracy on the Text-Related and Text-Unrelated of CV3-Eval Neutral testsets

Model	Institution	Average	Text-Related			Text-Unrelated
Model	Institution	Average	happy	sad	angry	happy	sad	angry
CosyVoice3-0.5B	Alibaba	0.400	0.68	0.30	0.78	0.14	0.04	0.46
Ming-omni-tts-0.5B(ours)	Ant Group	0.343	0.68	0.26	0.74	0.14	0.00	0.24
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	0.450	0.78	0.38	0.76	0.30	0.02	0.46

Dialect Control

Dialect performance comparison

Model	Institution	WSC-Eval-TTS-easy			WSC-Eval-TTS-hard			WSYue-TTS-eval-Base			WSYue-TTS-eval-Coverage
Model	Institution	CER(%)↓	SIM(%)↑	ACC↑	CER(%)↓	SIM(%)↑	ACC(%)↑	CER(%)↓	SIM(%)↑	ACC(%)↑	CER(%)↓	SIM(%)↑	ACC(%)↑
Step-Audio-TTS	Step	10.83	67.66	–	12.52	54.52	–	27.79	0.762		24.25	0.781	–
CosyVoice 2.0	Alibaba	7.14	70.27	–	9.06	60.10	–	14.38	0.812	–	13.74	0.826	–
Qwen-TTS	Alibaba	4.13	–	–	7.35	–	–	–	–	–	–	–	–
CosyVoice2-WSC	Alibaba	4.28	72.78	–	8.78	62.59	–	–	–	–	–	–	–
CosyVoice2-WSC-SFT	Alibaba	4.08	78.84	–	7.22	67.96	–	–	–	–	–	–	–
Llasa-1B	–	–	-	–	–	–	–	53.31	0.732		43.68	0.754
Llasa-1B-Yue	–	–	–	–	–	–	–	10.89	0.762	–	12.78	0.772
Edge-TTS	–	–	–	–	–	–	–	8.30	–	–	9.27	–	–
Cosyvoice2-Yue	–	–	–	–	–	–	–	10.33	0.821	–	9.49	0.834	–
CosyVoice3	Alibaba	3.17	0.696	68.06	4.07	0.723	80.90	8.36	0.611	91.70	8.95	0.658	95.80
Ming-omni-tts-0.5B(ours)	Ant Group	2.25	0.695	82.08	3.18	0.717	84.42	9.70	0.598	96.00	11.62	0.644	95.80
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	2.35	0.730	83.48	3.19	0.750	88.44	6.47	0.622	96.30	7.87	0.667	95.81

Podcast TTS

Podcast performance comparison on the ZipVoice-Dia-zh test set

Model	Institution	ZipVoice-Dia-zh
Model	Institution	CER ↓	cpSIM ↑	UTMOS ↑
ZipVoice-Dia	Xiaomi	3.39%	0.553	2.24
MoonCast	Kimi	27.43%	0.441	1.76
MOSS-TTSD	Fudan	8.62%	0.421	1.70
Vibevoice-1.5B	Microsoft	12.87%	0.455	1.74
FireRedTTS2	Xiaohongshu	3.34%	0.512	1.90
SoulX-Podcast	Soul	2.20%	0.599	2.09
Ming-omni-tts-0.5B(ours)	Ant Group	2.12%	0.457	2.25
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	1.84%	0.470	2.19

Voice Design

Voice Design performance comparison on the InstructTTSEval-ZH test set

Model	Institution	InstructTTSEval-ZH
Model	Institution	APS ↑	DSD ↑	RP ↑	Average
Qwen3TTS-12Hz-1.7B-VD	Alibaba	85.2	81.1	65.1	77.13
Mimo-Audio-7B-Instruct	Xiaomi	75.7	74.3	61.5	70.50
VoiceSculptor	NPU	75.7	64.7	61.5	67.30
VoxInstruct	Tsinghua	47.5	52.3	42.6	47.47
Ming-omni-tts-0.5B(ours)	Ant Group	83.85	75.10	61.50	73.48
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	87.30	79.80	61.50	76.20

Audio & BGM Generation

Text-To-BGM

Text-to-BGM performance comparison on the Ming-BGM-Eval test set

Model	Institution	Ming-BGM-Eval
		mulan_t	Audiobox-Aesthetics					SongEval
		mulan_t	CE	CU	PC	PQ	Avg.	CO	MU	ME	CL	NA	Avg.
Doubao	Bytedance	0.268	7.55	8.21	4.97	8.25	7.24	3.30	3.02	3.00	3.02	2.92	3.05
Ming-omni-tts-0.5B(ours)	Ant Group	0.230	7.18	8.16	4.80	8.20	7.08	3.11	2.86	2.86	2.81	2.73	2.87
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	0.250	7.19	8.14	4.69	8.18	7.05	3.08	2.84	2.82	2.78	2.74	2.85

Text-To-Audio(TTA)

TTA performance comparison on the audiocaps test set

Model	Institution	audiocaps
Model	Institution	FD_openl3 ↓	KL_passt ↓	CLAP_score ↑
AudioLDM-large	University of Surrey	108.300	1.810	0.419
Stable Audio Open	Stability AI	96.133	2.148	0.306
TangoFlux	Singapore University of Technology and Design	137.700	1.041	0.547
TangoFlux_base	Singapore University of Technology and Design	149.270	1.125	0.523
Ming-omni-tts-0.5B(ours)	Ant Group	74.292	2.257	0.347
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	65.918	1.640	0.424

Text Normalization

Text Normalization performance comparison on the internally constructed test set

Model	Institution	Internally constructed test set
Model	Institution	TN-Area WER ↓	none-TN-Area WER ↓
Gemini-2.5 Pro	Google	2.00%	0.97%
Ming-omni-tts-0.5B(ours)	Ant Group	1.97%	0.85%

Model & Benchmark Downloads

You can download our latest model and Benchmark from both Huggingface and ModelScope.

Model	Download
Ming-omni-tts-tokenizer-12Hz	🤗 HuggingFace 🤖 ModelScope
Ming-omni-tts-0.5B	🤗 HuggingFace 🤖 ModelScope
Ming-omni-tts-16.8B-A3B	🤗 HuggingFace 🤖 ModelScope

If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.

pip install modelscope
modelscope download --model inclusionAI/Ming-omni-tts-0.5B --local_dir inclusionAI/Ming-omni-tts-0.5B  --revision master

Note: This download process will take several minutes to several hours, depending on your network conditions.

Environment Preparation

Installation with pip

pip install -r requirements.txt

Installation with docker

You can set up the environment using Docker in two ways.

Option 1: Pull from Docker Hub (Recommended)

# 1. Pull the pre-built image
docker pull yongjielv/ming_uniaudio:v1.1

# 2. Run the container
docker run -it --gpus all yongjielv/ming_uniaudio:v1.1 /bin/bash

Option 2: Build from Source

# 1. Build the image
docker build -t Ming-omni-tts:v1.1 -f ./docker/ming_uniaudio.dockerfile .

# 2. Run the container
docker run -it --gpus all Ming-omni-tts:v1.1 /bin/bash

Example Usage

git clone https://github.com/inclusionAI/Ming-omni-tts.git
cd Ming-omni-tts
python3 cookbooks/test.py

For detailed usage, please refer to demo.ipynb.

Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.

Citation

If you find our work helpful, feel free to give us a cite.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
audio_tokenizer		audio_tokenizer
cookbooks		cookbooks
data/wavs		data/wavs
docker		docker
figures		figures
fm		fm
sentence_manager		sentence_manager
.gitignore		.gitignore
LEGAL.md		LEGAL.md
LICENSE		LICENSE
README.md		README.md
chat_format.py		chat_format.py
configuration_bailing_moe.py		configuration_bailing_moe.py
configuration_bailingmm.py		configuration_bailingmm.py
modeling_bailing_moe.py		modeling_bailing_moe.py
modeling_bailingmm.py		modeling_bailingmm.py
requirements.txt		requirements.txt
special_tokens_map.json		special_tokens_map.json
spkemb_extractor.py		spkemb_extractor.py
tokenization_bailing.py		tokenization_bailing.py
tokenizer.json		tokenizer.json
tokenizer_config.json		tokenizer_config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control

Table of Contents

Introduction

Demo

Updates

🚀 Key Features

Evaluation

Audio Tokenizer

Speech Controllable Generative Tasks

Zero-shot TTS

Speech Attribute Control

Emotional Control

Dialect Control

Podcast TTS

Voice Design

Audio & BGM Generation

Text-To-BGM

Text-To-Audio(TTA)

Text Normalization

Model & Benchmark Downloads

Environment Preparation

Installation with pip

Installation with docker

Example Usage

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

inclusionAI/Ming-omni-tts

Folders and files

Latest commit

History

Repository files navigation

Ming-omni-tts: Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control

Table of Contents

Introduction

Demo

Updates

🚀 Key Features

Evaluation

Audio Tokenizer

Speech Controllable Generative Tasks

Zero-shot TTS

Speech Attribute Control

Emotional Control

Dialect Control

Podcast TTS

Voice Design

Audio & BGM Generation

Text-To-BGM

Text-To-Audio(TTA)

Text Normalization

Model & Benchmark Downloads

Environment Preparation

Installation with pip

Installation with docker

Example Usage

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages