🤗 Models | 🤗 Eval Data | 📑 Paper | 💜 BenchMark
Youri Xu2, Haoqi Gu2, Zhongqian Xie2, Chuanjiang Luo2
1Zhejiang University 2NetEase Cloud Music
training and fine-tuning code for the MuFun model proposed in Advancing the Foundation Model for Music Understanding
Models released are MuFun-Base and several finetunes MuFun-Instruct, MuFun-ACEStep, MuFun-ABC
Demo:
- http://47.121.209.64/mufun_demo_chat for MuFun-Instruct
- http://47.121.209.64/mufun_demo_acestep for MuFun-ACEStep
some related finetuning datasets: ACEStep-Songs, midi-audio-abc
Our main training code is adapted from TinyLLaVA Factory to support audio input, as for reinforcement learning we modify the HuggingFace TRL library. Data processing scripts for open datasets will be uploaded recently.
for doc and code analysis https://deepwiki.com/laitselec/MuFun is good
- Inference Code
- Installation
- Data Preparation
- Finetuning
- Train from Scratch
- Reinforcement Learning
- Custom Model Architecture
- Citation
for inference it's not necessary to install this repo, only some audio processing packages like mutagen, torchaudio are needed
(currently supported audio formats: '.wav', '.mp3', '.flac', '.opus', '.ogg')
This series of models has approximately 9B parameters. If you load the model for inference at its native BF16 precision, 24GB of VRAM will be sufficient for songs up to 5 minutes long. (VRAM consumption increases with audio duration, as 1 second of audio corresponds to roughly 10 tokens. For instance, a 3-minute song will occupy 1800 tokens in the model's context window.)
As for quantization, there is no separate pre-quantized version available at the moment. However, you can use the bitsandbytes library to quantize the model to 4-bit or 8-bit on the fly during loading. The quality loss at 4-bit is acceptable, and with this approach, 12GB of VRAM should be enough for songs of typical length.
from transformers import AutoTokenizer, AutoModelForCausalLM
hf_path = 'Yi3852/MuFun-Instruct' # or 'Yi3852/MuFun-Base' 'Yi3852/MuFun-ACEStep' 'Yi3852/MuFun-ABC'
tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False)
device='cuda'
model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True, torch_dtype="bfloat16")
model.to(device)
# single audio
# during inference the audio(converted to a sequence of embeddings) will be placed in the position of <audio> tag in the prompt
aud="/path/to/your/song.mp3"
inp="\n<audio>Can you listen to this song and tell me its lyrics?"
res=model.chat(prompt=inp, audio_files=aud, tokenizer=tokenizer)
print(res)
# multiple audios
# for multiple songs each will be placed in the coresponding <audio> tag in the prompt
aud=["/path/to/your/song1.mp3", '/path/to/your/song2.mp3']
inp="\n<audio> This is song1. <audio> This is song2. Which song do you like more? Tell me the reason."
res=model.chat(prompt=inp, audio_files=aud, tokenizer=tokenizer)
print(res)
# analyze only a specific segment of audio using the segs parameter
# format is [start_time, end_time](in seconds), for multiple audios segs can be passed like [[0,30],[60,90]], [None,[0,30.0]]
aud="/path/to/your/song.mp3"
inp="\n<audio>How is the rhythm of this music clip?"
res=model.chat(prompt=inp, audio_files=aud, segs=[0,30.0], tokenizer=tokenizer)
print(res)
# set audio_files=None will work, however it is not recommended to use it as a text modelquantization using bitsandbytes:
from transformers import BitsAndBytesConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
hf_path = 'Yi3852/MuFun-Instruct' # or 'Yi3852/MuFun-Base' 'Yi3852/MuFun-ACEStep' 'Yi3852/MuFun-ABC'
tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True, # or load_in_8bit=True for 8-bit quantization
llm_int8_skip_modules=["lm_head", 'vision_tower', 'connector']
)
model = AutoModelForCausalLM.from_pretrained(
hf_path,
trust_remote_code=True,
torch_dtype="bfloat16",
device_map="auto",
quantization_config=quantization_config
)git clone https://github.com/laitselec/MuFun.git
cd MuFun
conda create -n mufun python=3.10 -y
conda activate mufun
pip install --upgrade pip
pip install -e .
pip install flash-attn==2.7.4.post1 --no-build-isolation # optional, otherwise change --attn_implementation to sdpa in train scriptssee data_preparation_example.ipynb
dataset is in stored a json file, and the format for each sample is like this:
{
"id": "LCeUo2tfY4LFpc5r3jiZid",
"audio": "~/gtzan/genres/blues/blues.00000.wav",
"conversations": [
{
"from": "human",
"value": "<audio>\nWhat category of music does this track fall under? (choose the genre from: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock.)"
},
{
"from": "gpt",
"value": "blues"
}
]
},after modifying paramters like data path in scripts/finetune.sh, run:
sh scripts/finetune.shTraining Script Parameter(many of these parameters are inherited from transformers.TrainingArguments)
| Parameter Name | Description |
|---|---|
| deepspeed | DeepSpeed configuration file |
| data_path | Path to the training data JSON file |
| eval_data_path | Path to the evaluation data JSON file; the eval_steps parameter determines the evaluation interval |
| pretrained_model_path | Path to the initial model weights |
| per_device_train_batch_size, per_device_eval_batch_size, gradient_accumulation_steps | Adjust based on available GPU memory |
| tune_type_llm, tune_type_vision_tower, tune_type_connector | Typically full or frozen; determines whether the component is trained or frozen |
| learning_rate, warmup_ratio, lr_scheduler_type | Learning rate related parameters |
| save_steps, save_total_limit, output_dir | Checkpoint saving parameters |
modify paramters in scripts/train_scratch.sh, scripts/warmup_qwen.sh and scripts/trainfull_qwen.sh
# this will strat the warmup training where --tune_type_llm and tune_type_vision_tower are set to frozen
sh scripts/train_scratch.sh
# after above is done, modify pretrained_model_path(checkpoint-xxx) in trainfull_qwen.sh to what you get
# then in train_scratch.sh comment the bash scripts/warmup_qwen.sh line and uncomment the bash scripts/trainfull_qwen.sh line
# this will start full training where parameters of all modules are tuned
sh scripts/train_scratch.shcurrently we support GRPO, for more details see tinyllava/train/train_grpo.py and trl-main/trl/trainer/grpo_trainer.py
# install our modified trl library first
cd trl-main/
conda install -c conda-forge pyarrow
pip install .
cd ..
sh scripts/grpo.shour framework is suitable for training general llava-kind audio language models, if you want to use other type of llm, audio tower or connector, go to tinyllava/model, tinyllava/data/template and add or modify the code accordingly
@misc{jiang2025advancingfoundationmodelmusic,
title={Advancing the Foundation Model for Music Understanding},
author={Yi Jiang and Wei Wang and Xianwen Guo and Huiyun Liu and Hanrui Wang and Youri Xu and Haoqi Gu and Zhongqian Xie and Chuanjiang Luo},
year={2025},
eprint={2508.01178},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2508.01178},
}