MuFun

🤗 Models | 🤗 Eval Data | 📑 Paper | 💜 BenchMark

Yi Jiang¹, Wei Wang², Xianwen Guo², Huiyun Liu², Hanrui Wang²,

Youri Xu², Haoqi Gu², Zhongqian Xie², Chuanjiang Luo²

¹Zhejiang University ²NetEase Cloud Music

training and fine-tuning code for the MuFun model proposed in Advancing the Foundation Model for Music Understanding

Models released are MuFun-Base and several finetunes MuFun-Instruct, MuFun-ACEStep, MuFun-ABC

Demo:

http://47.121.209.64/mufun_demo_chat for MuFun-Instruct
http://47.121.209.64/mufun_demo_acestep for MuFun-ACEStep

some related finetuning datasets: ACEStep-Songs, midi-audio-abc

Our main training code is adapted from TinyLLaVA Factory to support audio input, as for reinforcement learning we modify the HuggingFace TRL library. Data processing scripts for open datasets will be uploaded recently.

for doc and code analysis https://deepwiki.com/laitselec/MuFun is good

Inference Code

for inference it's not necessary to install this repo, only some audio processing packages like mutagen, torchaudio are needed

(currently supported audio formats: '.wav', '.mp3', '.flac', '.opus', '.ogg')

This series of models has approximately 9B parameters. If you load the model for inference at its native BF16 precision, 24GB of VRAM will be sufficient for songs up to 5 minutes long. (VRAM consumption increases with audio duration, as 1 second of audio corresponds to roughly 10 tokens. For instance, a 3-minute song will occupy 1800 tokens in the model's context window.)

As for quantization, there is no separate pre-quantized version available at the moment. However, you can use the bitsandbytes library to quantize the model to 4-bit or 8-bit on the fly during loading. The quality loss at 4-bit is acceptable, and with this approach, 12GB of VRAM should be enough for songs of typical length.

from transformers import AutoTokenizer, AutoModelForCausalLM
hf_path = 'Yi3852/MuFun-Instruct' # or 'Yi3852/MuFun-Base' 'Yi3852/MuFun-ACEStep' 'Yi3852/MuFun-ABC'
tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False)
device='cuda'
model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True, torch_dtype="bfloat16")
model.to(device)

# single audio
# during inference the audio(converted to a sequence of embeddings) will be placed in the position of <audio> tag in the prompt
aud="/path/to/your/song.mp3"
inp="\n<audio>Can you listen to this song and tell me its lyrics?" 
res=model.chat(prompt=inp, audio_files=aud, tokenizer=tokenizer)
print(res)

# multiple audios
# for multiple songs each will be placed in the coresponding <audio> tag in the prompt
aud=["/path/to/your/song1.mp3", '/path/to/your/song2.mp3']
inp="\n<audio> This is song1. <audio> This is song2. Which song do you like more? Tell me the reason."
res=model.chat(prompt=inp, audio_files=aud, tokenizer=tokenizer)
print(res)

# analyze only a specific segment of audio using the segs parameter
# format is [start_time, end_time](in seconds), for multiple audios segs can be passed like [[0,30],[60,90]], [None,[0,30.0]]
aud="/path/to/your/song.mp3"
inp="\n<audio>How is the rhythm of this music clip?"
res=model.chat(prompt=inp, audio_files=aud, segs=[0,30.0], tokenizer=tokenizer)
print(res)

# set audio_files=None will work, however it is not recommended to use it as a text model

quantization using bitsandbytes:

from transformers import BitsAndBytesConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
hf_path = 'Yi3852/MuFun-Instruct' # or 'Yi3852/MuFun-Base' 'Yi3852/MuFun-ACEStep' 'Yi3852/MuFun-ABC'
tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # or load_in_8bit=True for 8-bit quantization
    llm_int8_skip_modules=["lm_head", 'vision_tower', 'connector']
)
model = AutoModelForCausalLM.from_pretrained(
    hf_path,
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
    quantization_config=quantization_config
)

Installation

git clone https://github.com/laitselec/MuFun.git
cd MuFun

conda create -n mufun python=3.10 -y
conda activate mufun
pip install --upgrade pip

pip install -e .
pip install flash-attn==2.7.4.post1 --no-build-isolation # optional, otherwise change --attn_implementation to sdpa in train scripts

Data Preparation

see data_preparation_example.ipynb dataset is in stored a json file, and the format for each sample is like this:

{
    "id": "LCeUo2tfY4LFpc5r3jiZid",
    "audio": "~/gtzan/genres/blues/blues.00000.wav",
    "conversations": [
        {
            "from": "human",
            "value": "<audio>\nWhat category of music does this track fall under? (choose the genre from: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock.)"
        },
        {
            "from": "gpt",
            "value": "blues"
        }
    ]
},

Finetuning

after modifying paramters like data path in scripts/finetune.sh, run:

sh scripts/finetune.sh

Training Script Parameter(many of these parameters are inherited from transformers.TrainingArguments)

Parameter Name	Description
deepspeed	DeepSpeed configuration file
data_path	Path to the training data JSON file
eval_data_path	Path to the evaluation data JSON file; the `eval_steps` parameter determines the evaluation interval
pretrained_model_path	Path to the initial model weights
per_device_train_batch_size, per_device_eval_batch_size, gradient_accumulation_steps	Adjust based on available GPU memory
tune_type_llm, tune_type_vision_tower, tune_type_connector	Typically `full` or `frozen`; determines whether the component is trained or frozen
learning_rate, warmup_ratio, lr_scheduler_type	Learning rate related parameters
save_steps, save_total_limit, output_dir	Checkpoint saving parameters

Train from Scratch

modify paramters in scripts/train_scratch.sh, scripts/warmup_qwen.sh and scripts/trainfull_qwen.sh

# this will strat the warmup training where --tune_type_llm and tune_type_vision_tower are set to frozen 
sh scripts/train_scratch.sh

# after above is done, modify pretrained_model_path(checkpoint-xxx) in trainfull_qwen.sh to what you get
# then in train_scratch.sh comment the bash scripts/warmup_qwen.sh line and uncomment the bash scripts/trainfull_qwen.sh line
# this will start full training where parameters of all modules are tuned
sh scripts/train_scratch.sh

Reinforcement Learning

currently we support GRPO, for more details see tinyllava/train/train_grpo.py and trl-main/trl/trainer/grpo_trainer.py

# install our modified trl library first
cd trl-main/
conda install -c conda-forge pyarrow
pip install .
cd ..

sh scripts/grpo.sh

Custom Model Architecture

our framework is suitable for training general llava-kind audio language models, if you want to use other type of llm, audio tower or connector, go to tinyllava/model, tinyllava/data/template and add or modify the code accordingly

Citation

@misc{jiang2025advancingfoundationmodelmusic,
      title={Advancing the Foundation Model for Music Understanding}, 
      author={Yi Jiang and Wei Wang and Xianwen Guo and Huiyun Liu and Hanrui Wang and Youri Xu and Haoqi Gu and Zhongqian Xie and Chuanjiang Luo},
      year={2025},
      eprint={2508.01178},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2508.01178}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
demo		demo
scripts		scripts
tinyllava		tinyllava
trl-main		trl-main
data_preparation_example.ipynb		data_preparation_example.ipynb
.gitignore		.gitignore
LICENSE		LICENSE
MuFun.png		MuFun.png
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MuFun

Contents

Inference Code

Installation

Data Preparation

Finetuning

Train from Scratch

Reinforcement Learning

Custom Model Architecture

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

laitselec/MuFun

Folders and files

Latest commit

History

Repository files navigation

MuFun

Contents

Inference Code

Installation

Data Preparation

Finetuning

Train from Scratch

Reinforcement Learning

Custom Model Architecture

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages