Learnable-Speech Technical Implementation

An unofficial implementation based on improvements of cosyvoice with learnable encoder and dac-vae, with core components adapted from CosyVoice2.

Overview

This repository provides an implementation of the Learnable-Speech model, featuring a two-stage training approach for high-quality 24kHz audio generation.

Key Features

24kHz Audio Support: High-quality audio generation at 24kHz sampling rate
Flow matching AE: Flow matching training for autoencoders
Immiscible assignment: Support immiscible adding noise while training
Contrastive Flow matching: Support Contrastive training
Checkpoint release: Release LLM and Contrastive FM checkpoint
MeanFlow: Meanflow for FM model

Architecture

Stage 1: Audio to Discrete Tokens

Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.

Stage 2: Discrete Tokens to Continuous Latent Space

Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).

Note: This implementation uses standard DAC-VAE instead of Flow-VAE.

Implementation Pipeline

1. Model Training

BPE tokens to FSQ tokens

Based on the FSQ
Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor

FSQ tokens to DAC-VAE latent

Based on Cosyvoice2 flow matching decoder
Learns continuous latent representations from discrete tokens

2. Feature Extraction

Before training the main model:

Extract discrete tokens using the trained FSQ S3Tokenizer
Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided DAC-VAE

Notes: This model is trained with scale one fsq token will have 3 fractor of frame rate in dac-vae latent, will update 2 fractor soon

3. Two-Stage Training

Train the models sequentially:

Stage 1: BPE tokens → Discrete FSQ
Stage 2: Discrete FSQ → DAC-VAE Continuous latent space

Getting Started

Prerequisites

# List your dependencies here
pip install -r requirements.txt

Training Pipeline

Extracting FSQ

pip install s3tokenizer
s3tokenizer --wav_scp data.scp \
         --device "cuda" \
         --output_dir "./data" \
         --batch_size 32 \
         --model "speech_tokenizer_v2_25hz"

or you can install via this repo, it will use filelist.txt to extract, each line in filelist.txt contains file audio path - example files_test.txt

cd speech/tools/S3Tokenizer
pip3 install .
# example cmd to run
torchrun --nproc_per_node=4 --nnodes=1 --rdzv_id=2024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" `which s3tokenizer` --root_path /data/dataset/ \
             --model speech_tokenizer_v2_25hz \
             --device "cuda" \
             --batch_size 64 \
             --file_list /speech/files_test.txt \
             --skip_existing

Extracting DAC-VAE latent

cd dac-vae
python extract_dac_latents.py --checkpoint checkpoint.pt --config config.yml --root_path dataset --output_dir dataset/dac

After processing you should have root folder with following files:

dataset_root/
├── audio_name.wav
├── audio_name.txt
├── audio_name_fsq.pt
├── audio_name_latent.pt
├── another_audio.wav
├── another_audio.txt
├── another_audio_fsq.pt
├── another_audio_latent.pt
└── ...

Stage 1: Auto Regressive Transformer

#!/bin/bash
pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B

export CUDA_VISIBLE_DEVICES="0"
num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
job_id=1986
dist_backend="nccl"
num_workers=2
prefetch=100
train_engine=torch_ddp
model=llm

torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
train.py \
--train_engine $train_engine \
--config config.yaml \
--train_data data/data.list \
--cv_data data/data.list \
--qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
--model $model \
--model_dir /data/checkpoint/$model/ \
--num_workers ${num_workers} \
--prefetch ${prefetch} \
--pin_memory \
--use_amp \
--comet_disabled

Stage 2: FLow matching decoder

#!/bin/bash
pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B
export CUDA_VISIBLE_DEVICES="0"
num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
job_id=1986
dist_backend="nccl"
num_workers=2
prefetch=100
train_engine=torch_ddp
model=llm

torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \
train.py \
--train_engine $train_engine \
--config config.yaml \
--train_data data/data.list \
--cv_data data/data.list \
--qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \
--model $model \
--model_dir /data/checkpoint/$model/ \
--num_workers ${num_workers} \
--prefetch ${prefetch} \
--pin_memory \
--use_amp \
--comet_disabled

Project Structure

minimax-speech/
├── assets/
├── dac-vae/
├── flowae/
├── speech/    
│   ├── llm/
│   ├── flow/
└── README.md

Related Projects

This implementation builds upon several key projects:

CosyVoice2: Core model architectures and training pipelines
Descript Audio Codec: Audio tokenization framework
Learnable-Speech: Original technical report and methodology

Citation

If you use this code in your research, please cite:

@article{minimax-speech,
  title={Learnable-Speech},
  author={[Learnable team]},
  year={[2025]}
  url={https://arxiv.org/pdf/2505.07916}
}

@misc{cosyvoice2,
  title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens},
  author={[FunAudioLLM Team, SpeechLab@Tongyi, Alibaba Group]},
  year={2024},
  url={https://github.com/FunAudioLLM/CosyVoice}
}

License

This project follows the licensing terms of its dependencies:

CosyVoice2 components: Check CosyVoice2 License
FSQ components: Apache 2.0 License

Acknowledgments

CosyVoice2: This implementation extensively uses code and architectures from CosyVoice2
FSQ: For the FSQ implementation
Learnable team: For the technical report and methodology
FunAudioLLM team: For the excellent CosyVoice2 codebase

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Disclaimer

The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
assets		assets
dac-vae		dac-vae
flowae		flowae
scripts		scripts
speech		speech
.dockerignore		.dockerignore
Dockerfile		Dockerfile
README.md		README.md
TRAINING_GUIDE.md		TRAINING_GUIDE.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learnable-Speech Technical Implementation

Overview

Key Features

Architecture

Stage 1: Audio to Discrete Tokens

Stage 2: Discrete Tokens to Continuous Latent Space

Implementation Pipeline

1. Model Training

BPE tokens to FSQ tokens

FSQ tokens to DAC-VAE latent

2. Feature Extraction

3. Two-Stage Training

Getting Started

Prerequisites

Training Pipeline

Project Structure

Related Projects

Citation

License

Acknowledgments

Contributing

Disclaimer

About

Uh oh!

Releases 1

Packages

Contributors 3

Uh oh!

Languages

primepake/learnable-speech

Folders and files

Latest commit

History

Repository files navigation

Learnable-Speech Technical Implementation

Overview

Key Features

Architecture

Stage 1: Audio to Discrete Tokens

Stage 2: Discrete Tokens to Continuous Latent Space

Implementation Pipeline

1. Model Training

BPE tokens to FSQ tokens

FSQ tokens to DAC-VAE latent

2. Feature Extraction

3. Two-Stage Training

Getting Started

Prerequisites

Training Pipeline

Project Structure

Related Projects

Citation

License

Acknowledgments

Contributing

Disclaimer

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Uh oh!

Languages

Packages