An unofficial implementation based on improvements of cosyvoice with learnable encoder and dac-vae, with core components adapted from CosyVoice2.
This repository provides an implementation of the Learnable-Speech model, featuring a two-stage training approach for high-quality 24kHz audio generation.
- 24kHz Audio Support: High-quality audio generation at 24kHz sampling rate
- Flow matching AE: Flow matching training for autoencoders
- Immiscible assignment: Support immiscible adding noise while training
- Contrastive Flow matching: Support Contrastive training
- Checkpoint release: Release LLM and Contrastive FM checkpoint
- MeanFlow: Meanflow for FM model
Converts raw audio into discrete representations using the FSQ (S3Tokenizer) framework.
Maps discrete tokens to a continuous latent space using a Variational Autoencoder (VAE).
Note: This implementation uses standard DAC-VAE instead of Flow-VAE.
- Based on the FSQ
- Using Auto Regressive to predict the FSQ tokens with learnable speaker extractor
- Based on Cosyvoice2 flow matching decoder
- Learns continuous latent representations from discrete tokens
Before training the main model:
- Extract discrete tokens using the trained FSQ S3Tokenizer
- Generate continuous latent representations using the trained DAC-VAE - the pretrained I provided DAC-VAE
- Notes: This model is trained with scale one fsq token will have 3 fractor of frame rate in dac-vae latent, will update 2 fractor soon
Train the models sequentially:
- Stage 1: BPE tokens → Discrete FSQ
- Stage 2: Discrete FSQ → DAC-VAE Continuous latent space
# List your dependencies here
pip install -r requirements.txt-
Extracting FSQ
pip install s3tokenizer s3tokenizer --wav_scp data.scp \ --device "cuda" \ --output_dir "./data" \ --batch_size 32 \ --model "speech_tokenizer_v2_25hz"or you can install via this repo, it will use filelist.txt to extract, each line in filelist.txt contains file audio path - example files_test.txt
cd speech/tools/S3Tokenizer pip3 install . # example cmd to run torchrun --nproc_per_node=4 --nnodes=1 --rdzv_id=2024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" `which s3tokenizer` --root_path /data/dataset/ \ --model speech_tokenizer_v2_25hz \ --device "cuda" \ --batch_size 64 \ --file_list /speech/files_test.txt \ --skip_existing -
Extracting DAC-VAE latent
cd dac-vae python extract_dac_latents.py --checkpoint checkpoint.pt --config config.yml --root_path dataset --output_dir dataset/dac
After processing you should have root folder with following files:
dataset_root/
├── audio_name.wav
├── audio_name.txt
├── audio_name_fsq.pt
├── audio_name_latent.pt
├── another_audio.wav
├── another_audio.txt
├── another_audio_fsq.pt
├── another_audio_latent.pt
└── ...
-
Stage 1: Auto Regressive Transformer
#!/bin/bash pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B export CUDA_VISIBLE_DEVICES="0" num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}') job_id=1986 dist_backend="nccl" num_workers=2 prefetch=100 train_engine=torch_ddp model=llm torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \ train.py \ --train_engine $train_engine \ --config config.yaml \ --train_data data/data.list \ --cv_data data/data.list \ --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \ --model $model \ --model_dir /data/checkpoint/$model/ \ --num_workers ${num_workers} \ --prefetch ${prefetch} \ --pin_memory \ --use_amp \ --comet_disabled
-
Stage 2: FLow matching decoder
#!/bin/bash pretrained_model_dir=./pretrained_models/CosyVoice2-0.5B export CUDA_VISIBLE_DEVICES="0" num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}') job_id=1986 dist_backend="nccl" num_workers=2 prefetch=100 train_engine=torch_ddp model=llm torchrun --nnodes=1 --nproc_per_node=$num_gpus --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:1234" \ train.py \ --train_engine $train_engine \ --config config.yaml \ --train_data data/data.list \ --cv_data data/data.list \ --qwen_pretrain_path $pretrained_model_dir/CosyVoice-BlankEN \ --model $model \ --model_dir /data/checkpoint/$model/ \ --num_workers ${num_workers} \ --prefetch ${prefetch} \ --pin_memory \ --use_amp \ --comet_disabled
minimax-speech/
├── assets/
├── dac-vae/
├── flowae/
├── speech/
│ ├── llm/
│ ├── flow/
└── README.md
This implementation builds upon several key projects:
- CosyVoice2: Core model architectures and training pipelines
- Descript Audio Codec: Audio tokenization framework
- Learnable-Speech: Original technical report and methodology
If you use this code in your research, please cite:
@article{minimax-speech,
title={Learnable-Speech},
author={[Learnable team]},
year={[2025]}
url={https://arxiv.org/pdf/2505.07916}
}
@misc{cosyvoice2,
title={CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens},
author={[FunAudioLLM Team, SpeechLab@Tongyi, Alibaba Group]},
year={2024},
url={https://github.com/FunAudioLLM/CosyVoice}
}This project follows the licensing terms of its dependencies:
- CosyVoice2 components: Check CosyVoice2 License
- FSQ components: Apache 2.0 License
- CosyVoice2: This implementation extensively uses code and architectures from CosyVoice2
- FSQ: For the FSQ implementation
- Learnable team: For the technical report and methodology
- FunAudioLLM team: For the excellent CosyVoice2 codebase
Contributions are welcome! Please feel free to submit a Pull Request.
The content provided above is for academic purposes only and is intended to demonstrate technical capabilities.
