Skip to content

laitselec/MuFun

Repository files navigation

MuFun

version version

🤗 Models   |   🤗 Eval Data   |    📑 Paper    |   💜 BenchMark  

Yi Jiang1, Wei Wang2, Xianwen Guo2, Huiyun Liu2, Hanrui Wang2,

Youri Xu2, Haoqi Gu2, Zhongqian Xie2, Chuanjiang Luo2

1Zhejiang University      2NetEase Cloud Music

training and fine-tuning code for the MuFun model proposed in Advancing the Foundation Model for Music Understanding

Models released are MuFun-Base and several finetunes MuFun-Instruct, MuFun-ACEStep, MuFun-ABC

Demo:

some related finetuning datasets: ACEStep-Songs, midi-audio-abc

Our main training code is adapted from TinyLLaVA Factory to support audio input, as for reinforcement learning we modify the HuggingFace TRL library. Data processing scripts for open datasets will be uploaded recently.

for doc and code analysis https://deepwiki.com/laitselec/MuFun is good

Contents

Inference Code

for inference it's not necessary to install this repo, only some audio processing packages like mutagen, torchaudio are needed

(currently supported audio formats: '.wav', '.mp3', '.flac', '.opus', '.ogg')

This series of models has approximately 9B parameters. If you load the model for inference at its native BF16 precision, 24GB of VRAM will be sufficient for songs up to 5 minutes long. (VRAM consumption increases with audio duration, as 1 second of audio corresponds to roughly 10 tokens. For instance, a 3-minute song will occupy 1800 tokens in the model's context window.)

As for quantization, there is no separate pre-quantized version available at the moment. However, you can use the bitsandbytes library to quantize the model to 4-bit or 8-bit on the fly during loading. The quality loss at 4-bit is acceptable, and with this approach, 12GB of VRAM should be enough for songs of typical length.

from transformers import AutoTokenizer, AutoModelForCausalLM
hf_path = 'Yi3852/MuFun-Instruct' # or 'Yi3852/MuFun-Base' 'Yi3852/MuFun-ACEStep' 'Yi3852/MuFun-ABC'
tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False)
device='cuda'
model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True, torch_dtype="bfloat16")
model.to(device)

# single audio
# during inference the audio(converted to a sequence of embeddings) will be placed in the position of <audio> tag in the prompt
aud="/path/to/your/song.mp3"
inp="\n<audio>Can you listen to this song and tell me its lyrics?" 
res=model.chat(prompt=inp, audio_files=aud, tokenizer=tokenizer)
print(res)

# multiple audios
# for multiple songs each will be placed in the coresponding <audio> tag in the prompt
aud=["/path/to/your/song1.mp3", '/path/to/your/song2.mp3']
inp="\n<audio> This is song1. <audio> This is song2. Which song do you like more? Tell me the reason."
res=model.chat(prompt=inp, audio_files=aud, tokenizer=tokenizer)
print(res)

# analyze only a specific segment of audio using the segs parameter
# format is [start_time, end_time](in seconds), for multiple audios segs can be passed like [[0,30],[60,90]], [None,[0,30.0]]
aud="/path/to/your/song.mp3"
inp="\n<audio>How is the rhythm of this music clip?"
res=model.chat(prompt=inp, audio_files=aud, segs=[0,30.0], tokenizer=tokenizer)
print(res)

# set audio_files=None will work, however it is not recommended to use it as a text model

quantization using bitsandbytes:

from transformers import BitsAndBytesConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
hf_path = 'Yi3852/MuFun-Instruct' # or 'Yi3852/MuFun-Base' 'Yi3852/MuFun-ACEStep' 'Yi3852/MuFun-ABC'
tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,  # or load_in_8bit=True for 8-bit quantization
    llm_int8_skip_modules=["lm_head", 'vision_tower', 'connector']
)
model = AutoModelForCausalLM.from_pretrained(
    hf_path,
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
    quantization_config=quantization_config
)

Installation

git clone https://github.com/laitselec/MuFun.git
cd MuFun

conda create -n mufun python=3.10 -y
conda activate mufun
pip install --upgrade pip

pip install -e .
pip install flash-attn==2.7.4.post1 --no-build-isolation # optional, otherwise change --attn_implementation to sdpa in train scripts

Data Preparation

see data_preparation_example.ipynb dataset is in stored a json file, and the format for each sample is like this:

{
    "id": "LCeUo2tfY4LFpc5r3jiZid",
    "audio": "~/gtzan/genres/blues/blues.00000.wav",
    "conversations": [
        {
            "from": "human",
            "value": "<audio>\nWhat category of music does this track fall under? (choose the genre from: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock.)"
        },
        {
            "from": "gpt",
            "value": "blues"
        }
    ]
},

Finetuning

after modifying paramters like data path in scripts/finetune.sh, run:

sh scripts/finetune.sh

Training Script Parameter(many of these parameters are inherited from transformers.TrainingArguments)

Parameter Name Description
deepspeed DeepSpeed configuration file
data_path Path to the training data JSON file
eval_data_path Path to the evaluation data JSON file; the eval_steps parameter determines the evaluation interval
pretrained_model_path Path to the initial model weights
per_device_train_batch_size, per_device_eval_batch_size, gradient_accumulation_steps Adjust based on available GPU memory
tune_type_llm, tune_type_vision_tower, tune_type_connector Typically full or frozen; determines whether the component is trained or frozen
learning_rate, warmup_ratio, lr_scheduler_type Learning rate related parameters
save_steps, save_total_limit, output_dir Checkpoint saving parameters

Train from Scratch

modify paramters in scripts/train_scratch.sh, scripts/warmup_qwen.sh and scripts/trainfull_qwen.sh

# this will strat the warmup training where --tune_type_llm and tune_type_vision_tower are set to frozen 
sh scripts/train_scratch.sh

# after above is done, modify pretrained_model_path(checkpoint-xxx) in trainfull_qwen.sh to what you get
# then in train_scratch.sh comment the bash scripts/warmup_qwen.sh line and uncomment the bash scripts/trainfull_qwen.sh line
# this will start full training where parameters of all modules are tuned
sh scripts/train_scratch.sh

Reinforcement Learning

currently we support GRPO, for more details see tinyllava/train/train_grpo.py and trl-main/trl/trainer/grpo_trainer.py

# install our modified trl library first
cd trl-main/
conda install -c conda-forge pyarrow
pip install .
cd ..

sh scripts/grpo.sh

Custom Model Architecture

our framework is suitable for training general llava-kind audio language models, if you want to use other type of llm, audio tower or connector, go to tinyllava/model, tinyllava/data/template and add or modify the code accordingly

Citation

@misc{jiang2025advancingfoundationmodelmusic,
      title={Advancing the Foundation Model for Music Understanding}, 
      author={Yi Jiang and Wei Wang and Xianwen Guo and Huiyun Liu and Hanrui Wang and Youri Xu and Haoqi Gu and Zhongqian Xie and Chuanjiang Luo},
      year={2025},
      eprint={2508.01178},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2508.01178}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published