DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

🌐 Homepage • 🗃️ arXiv • 📃 PDF • 💻 Code • 🤗 Models

Zhenhailong Wang^1*, Senthil Purushwalkam^2*, Caiming Xiong², Silvio Savarese², Heng Ji¹, Ran Xu²

¹University of Illinois Urbana-Champaign ²Salesforce Research

^*Equal Contribution

Installation

Minimal setup

This allows using DyMU encoders to obtain dynamic length visual features.

pip install -e .

VLM specific setup

install the llava/llava-one-vision package following:

if using llava-1.5

conda create -n llava python=3.10 -y
conda activate llava
cd LLaVA
pip install --upgrade pip
pip install -e ".[train]"

if using llava-one-vision

conda create -n llava_next python=3.10 -y
conda activate llava_next
cd LLaVA-NeXT
pip install --upgrade pip
pip install -e ".[train]"

Upgrade several pip modules for compatibility with open_clip:

pip install --upgrade transformers accelerate sentencepiece deepspeed peft line-profiler
pip install torch-scatter -f https://data.pyg.org/whl/torch-2.1.2+cu121.html
pip install --upgrade timm ipdb

Install the custom open_clip

cd .. # cd to the root of the repo
pip install -e .

Threshold Finding with DToMe

The threshold finding only requires inferencing on a set of images. A sufficiently large (e.g., 250K) and diverse dataset would be ideal. The thresholds will be stored as an avaraged statistic across all batches. The key function for doing DToMe can be found in src/open_clip/tome.py batch_level_bipartite_soft_matching()

Preparing image dataset: prepare a JSON file with the following format:

  [
    {
      "image": "cat1.jpg" # relative path to the image in your image directory
    },
    {
      "image": "dog2.png"
    },
  ...
  ]

Run threshold finding: please find the example script in:
```
  bash threshold_finding.sh
```

Inference

Download DyMU encoder checkpoints with pre-computed from here. Or run threshold finding as described in here. Put the encoder checkpoints under checkpoints/threshold_checkpoints

Dynamic length visual encoding usage examples

DyMU with Siglip encoder example:
```
python inference_siglip.py
```
DyMU with OpenAI CLIP encoder example:
```
python inference_openai_clip.py
```

VLM inference with DyMU encoders

Make sure the VLM specific installation for the expected VLM is done as described here.

Llava-1.5:

Download pretrained llava-1.5 checkpoint, e.g., https://huggingface.co/liuhaotian/llava-v1.5-7b, and put it under checkpoints/vlm_checkpoints.
Modify the mm_vision_tower field in the config.json to ViT-L-14-336-tome-72out for pointing the model to use DyMU vision tower. (72out here is only a template, one can use any thresholds during inference)

Run inference example:

conda activate llava
CUDA_VISIBLE_DEVICES=0 python LLaVA/inference_dymu_llava.py

Llava-One-Vision:

Download pretrained llava-ov checkpoint, e.g., https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-si, and put it under checkpoints/vlm_checkpoints.
Modify the mm_vision_tower field in the config.json to ViT-SO400M-14-SigLIP-384-tome for pointing the model to use DyMU vision tower.

Run inference example:

conda activate llava_next
CUDA_VISIBLE_DEVICES=0 python LLaVA-NeXT/inference_dymu_llava_ov.py

Implementation Notes

In the paper, we demonstrate efficiency gains in terms of FLOPs using Virtual Token Unmerging (VTU) within Self-Attention blocks. However, in practice, we find that directly expanding Q and K to their full lengths and leveraging highly optimized sdpa or a single matmul function leads to shorter wall clock time. Therefore, we default to this faster, simpler implementation. For completeness, we also provide an implementation that strictly follows the exact VTU attention decomposition, located in LLaVA/llava/model/language_model/llava_llama_w_exact_vtu_attn.py. This can be used as a direct drop-in replacement for LLaVA/llava/model/language_model/llava_llama.py. We encourage readers to explore further optimizations to reduce the wall clock time of the exact VTU attention. Note: When using the exact VTU implementation, please explicitly set attn_implementation to eager when loading the model via from_pretrained.
For LLaVA-One-Vision, the input to the encoder is a batch of image crops. In DyMU, since each crop may retain a variable number of tokens after each layer, sequence padding is required, which introduces additional computational overhead. We experimented with adding token packing via a custom Triton kernel, but it currently results in worse wall clock time. Thus we default to the with-padding version. We encourage further exploration of optimization strategies.

Citation

@misc{wang2025dymudynamicmergingvirtual,
  title={DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs}, 
  author={Zhenhailong Wang and Senthil Purushwalkam and Caiming Xiong and Silvio Savarese and Heng Ji and Ran Xu},
  year={2025},
  eprint={2504.17040},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2504.17040}, 
}

Acknowledgement

The codebase is based on amazing repos including: open_clip, llava, llava-next

Name		Name	Last commit message	Last commit date
Latest commit History 702 Commits
LLaVA-NeXT		LLaVA-NeXT
LLaVA		LLaVA
docs		docs
src		src
static		static
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
index.html		index.html
inference_openai_clip.py		inference_openai_clip.py
inference_siglip.py		inference_siglip.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-test.txt		requirements-test.txt
requirements-training.txt		requirements-training.txt
requirements.txt		requirements.txt
threshold_finding.sh		threshold_finding.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

Installation

Minimal setup

VLM specific setup

Threshold Finding with DToMe

Inference

Dynamic length visual encoding usage examples

VLM inference with DyMU encoders

Llava-1.5:

Llava-One-Vision:

Implementation Notes

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

Installation

Minimal setup

VLM specific setup

Threshold Finding with DToMe

Inference

Dynamic length visual encoding usage examples

VLM inference with DyMU encoders

Llava-1.5:

Llava-One-Vision:

Implementation Notes

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages