Curated list of Multimodal Large Language Model (MLLM) Tuning resources, aligned with our work:
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model
Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering. While MLLMs demonstrate remarkable versatility, MLLMs appears limited performance on special application. But tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert Specialization, where distribution shifts between pre-training and target datasets constrain target performance, and Open-World Stabilization, where catastrophic forgetting erases the model general knowledge. In this work, we systematically review recent advancements in MLLM tuning methodologies, classifying them into three paradigms: (I) Selective Tuning, (II) Additive Tuning, and (III) Reparameterization Tuning. Furthermore, we benchmark these tuning strategies across popular MLLM architectures and diverse downstream tasks to establish standardized evaluation analysis and systematic tuning principles. Finally, we highlight several open challenges in this domain and propose future research directions.
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2024.10 | AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models | ICLR'24 | link | link |
| 2023.12 | Sparse is Enough in Fine-tuning Pre-trained Large Language Models | ICML'24 | link | link |
| 2023.12 | Gradient-based Parameter Selection for Efficient Fine-Tuning | CVPR'24 | link | link |
| 2023.11 | Unified Low-Resource Sequence Labeling by Sample-Aware Dynamic Sparse Finetuning | EMNLP'23 | link | link |
| 2023.08 | Overcoming Generic Knowledge Loss with Selective Parameter Update | CVPR'24 | link | link |
| 2023.06 | LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation | ICML'23 | link | link |
| 2022.10 | ROSE: Robust Selective Fine-tuning for Pre-trained Language Models | IJCAI'22 | link | link |
| 2022.05 | Parameter-Efficient Sparsity for Large Language Models Fine-Tuning | IJCAI'22 | link | link |
| 2021.10 | Composable Sparse Fine-Tuning for Cross-Lingual Transfer | ACL'22 | link | link |
| 2021.09 | Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning | EMNLP'21 | link | link |
| 2015.06 | Learning both Weights and Connections for Efficient Neural Networks | NeurIPS'15 | link | - |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2024.12 | Revisiting Weight Averaging for Model Merging | arXiv'24 | link | link |
| 2024.10 | Parameter Competition Balancing for Model Merging | NeurIPS'24 | link | link |
| 2024.06 | Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging | NeurIPS'24 | link | link |
| 2024.05 | TEMR-Merging: Tuning-Free High-Performance Model Merging | NeurIPS'24 | link | link |
| 2024.02 | Model Tailor: Mitigating Catastrophic Forgetting in Multi-modal Large Language Models | ICML'24 | link | link |
| 2023.11 | Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch | ICML'24 | link | link |
| 2023.06 | TIES-Merging: Resolving Interference When Merging Models | NeurIPS'23 | link | link |
| 2021.11 | Merging Models with Fisher-Weighted Averaging | NeurIPS'22 | link | link |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2024.04 | Conditional Prototype Rectification Prompt Learning | TCSVT'25 | link | link |
| 2023.11 | Meta-Adapter: An Online Few-shot Learner for Vision-Language Model | NeurIPS'23 | link | link |
| 2023.09 | GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph | NeurIPS'23 | link | link |
| 2023.04 | Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement | ICCV'23 | link | link |
| 2023.03 | Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens | CVPR'23 | link | link |
| 2023.02 | Side Adapter Network for Open-Vocabulary Semantic Segmentation | CVPR'23 | link | link |
| 2022.11 | Task Residual for Tuning Vision-Language Models | CVPR'23 | link | link |
| 2022.06 | LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning | NeurIPS'22 | link | link |
| 2021.11 | Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling | ECCV'22 | link | link |
| 2021.10 | CLIP-Adapter: Better Vision-Language Models with Feature Adapters | IJCV'23 | link | link |
| 2019.02 | Parameter-Efficient Transfer Learning for NLP | ICML'19 | link | link |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2024.03 | PromptKD: Unsupervised Prompt Distillation for Vision-Language Models | CVPR'24 | link | link |
| 2024.03 | Domain-agnostic mutual prompting for unsupervised domain adaptation | CVPR'24 | link | link |
| 2024.01 | Learning to prompt with text only supervision for vision-language models | AAAI'25 | link | link |
| 2023.11 | ArGue: Attribute-guided prompt tuning for vision-language models | CVPR'24 | link | link |
| 2023.09 | Dept: Decoupled prompt tuning | CVPR'24 | link | link |
| 2023.09 | Distribution-aware prompt tuning for vision-language models | ICCV'23 | link | link |
| 2023.08 | Knowledge-aware prompt tuning for generalizable vision-language models | ICCV'23 | link | - |
| 2023.07 | Self-regulating prompts: Foundational model adaptation without forgetting | ICCV'23 | link | link |
| 2023.03 | Visual-language prompt tuning with knowledge-guided context optimization | CVPR'23 | link | link |
| 2022.10 | MaPLe: Multi-modal prompt learning | CVPR'23 | link | link |
| 2022.10 | Prompt learning with optimal transport for vision-language models | ICLR'23 | link | link |
| 2022.06 | Dualcoop: Fast adaptation to multi-label recognition with limited annotations | NeurIPS'22 | link | link |
| 2022.05 | Prompt-aligned Gradient for Prompt Tuning | ICCV'23 | link | link |
| 2022.03 | Visual prompt tuning | ECCV'22 | link | link |
| 2022.03 | Conditional prompt learning for vision-language models | CVPR'22 | link | link |
| 2021.09 | Learning to prompt for vision-language models | IJCV'22 | link | link |
| 2021.01 | Prefix-Tuning: Optimizing Continuous Prompts for Generation | ACL'21 | link | link |
| 2020.10 | AUTOPROMPT: Eliciting Knowledge from Language Models with Automatically Generated Prompts | EMNLP'20 | link | link |
| 2019.11 | How Can We Know What Language Models Know? | TACL'20 | link | link |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2025.02 | REMEDY: Recipe merging dynamics in large vision-language models | ICLR'25 | link | - |
| 2024.12 | Lora.rar: Learning to merge loras via hypernetworks for subject-style conditioned image generation | ICCV'25 | link | link |
| 2024.08 | Teamlora: Boosting low-rank adaptation with expert collaboration and competition | arXiv'24 | link | link |
| 2024.06 | Twinmerging: Dynamic integration of modular expertise in model merging | NeurIPS'24 | link | link |
| 2024.06 | Mixture-of-subspaces in low-rank adaptation | EMNLP'24 | link | link |
| 2024.06 | Sharelora: Parameter efficient and robust large language model fine-tuning via shared low-rank adaptation | arXiv'24 | link | link |
| 2024.05 | Parameter-Efficient Fine-Tuning with Discrete Fourier Transform | ICML'24 | link | link |
| 2024.03 | Mtlora: Low-rank adaptation approach for efficient multi-task learning | CVPR'24 | link | link |
| 2024.02 | Multimodal instruction tuning with conditional mixture of lora | ACL'24 | link | link |
| 2023.12 | Loramoe: Alleviating world knowledge forgetting in large language models via moe-style plugin | ACL'24 | link | link |
| 2023.10 | Vera: Vectorbased random matrix adaptation | ICLR'24 | link | link |
| 2023.07 | Lorahub: Efficient cross-task generalization via dynamic lora composition | COLM'24 | link | link |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 2025.03 | Lorasculpt: Sculpting lora for harmonizing general and specialized knowledge in multimodal large language models | CVPR'25 | link | link |
| 2024.10 | Controlled low-rank adaptation with subspace regularization for continued training on large language models | arXiv'24 | link | - |
| 2024.07 | Learn to preserve and diversify: Parameter-efficient group with orthogonal regularization for domain generalization | ECCV'24 | link | link |
| 2024.06 | Corda: Context-oriented decomposition adaptation of large language models for task-aware parameter-efficient fine-tuning | NeurIPS'24 | link | link |
| 2024.06 | Milora: Harnessing minor singular components for parameter-efficient llm finetuning | NAACL'25 | link | link |
| 2024.04 | Pissa: Principal singular values and singular vectors adaptation of large language models | NeurIPS'24 | link | link |
| 2024.03 | Lora meets dropout under a unified framework | ACL'24 | link | - |
| 2024.03 | Bilora: A bi-level optimization framework for overfitting-resilient low-rank adaptation of large pre-trained models | arXiv'24 | link | - |
| 2024.02 | Melora: mini-ensemble low-rank adapters for parameter-efficient fine-tuning | ACL'24 | link | link |
| 2024.02 | Prolora: Partial rotation empowers more parameter-efficient lora | ACL'24 | link | link |
| 2024.02 | Lora+: Efficient low rank adaptation of large models | ICML'24 | link | link |
| 2024.02 | Dora: Weight-decomposed low-rank adaptation | ICML'24 | link | link |
| 2024.02 | Flora: Low-rank adapters are secretly gradient compressors | ICML'24 | link | link |
| 2023.08 | Bayesian low-rank adaptation for large language models | ICLR'24 | link | link |
This repository is currently maintained by Wenke Huang π¨βπ».
If you have any questions, concerns, or suggestions regarding the contents of this repository or the resources shared here, feel free to reach out! I'm more than happy to assist you with any inquiries or help you navigate through the materials.
Please don't hesitate to send an email to me at [email protected] π§ or Wechat π€.
If you find this repository helpful for your research, we would greatly appreciate it if you could cite our papers. β¨
@misc{MLLMTuning_arXiv25,
title={Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model},
author={Wenke Huang, Jian Liang, Xianda Guo, Yiyang Fang, Guancheng Wan, Xuankun Rong, Chi Wen, Zekun Shi, Qingyun Li, Didi Zhu, Yanbiao Ma, Ke Liang, Bin Yang, He Li, Jiawei Shao, Mang Ye, Bo Du},
year={2025},
eprint={2503.04543},
archivePrefix={arXiv},
primaryClass={cs.CR}
}
@inproceedings{LiangLoRASculpt_CVPR2025,
author = {Liang, Jian and Huang, Wenke and Wan, Guancheng and Yang, Qu and Ye, Mang},
title = {LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models},
booktitle = {CVPR},
year = {2025},
}
@inproceedings{FangSEPM_ICML2025,
title = {Catch Your Emotion: Sharpening Emotion Perception in Multimodal Large Language Models},
author = {Fang, Yiyang and Liang, Jian and Huang, Wenke and Li, He and Su, Kehua and Ye, Mang},
booktitle = {ICML},
year = {2025},
}
@misc{ye2025surveysafetylargevisionlanguage,
title={A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations},
author={Mang Ye and Xuankun Rong and Wenke Huang and Bo Du and Nenghai Yu and Dacheng Tao},
year={2025},
eprint={2502.14881},
archivePrefix={arXiv},
primaryClass={cs.CR}
}[1] LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models - CVPR 2025 [Link][Code]
[2] Catch Your Emotion: Sharpening Emotion Perception in Multimodal Large Language Models - ICML 2025 [Link][Code]
[3] A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations - arXiv 2025 [Link][Code]
You Only Live Once.
I hope that all players have fun.
