Skip to content

Latest commit

 

History

History
268 lines (193 loc) · 11.1 KB

File metadata and controls

268 lines (193 loc) · 11.1 KB

VisionSelector-Logo VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

📄 Paper | 🤗 VisionSelector-Qwen2.5-VL-3B 🤗 VisionSelector-Qwen2.5-VL-7B 🤗 VisionSelector-LLaVA-OV-1.5-8B

Jiaying Zhu1, Yurui Zhu*2, Xin Lu1, Wenrui Yan2, Dong Li1, Kunlin Liu2, Xueyang Fu*1, Zheng-Jun Zha1

1University of Science and Technology of China, 2ZTE Corporation

*Equal Advising

📝 To do list

  • Release training code
  • Release evaluation code
  • Release comparison method code
  • Release model weights
  • Release inference code

👀 Overview

We introduce VisionSelector, a novel, end-to-end learnable framework that fundamentally re-casts visual token compression as an optimization-driven decision process. VisionSelector seamlessly integrates into existing MLLMs without modifying the backbone, achieving adaptive and superior efficiency.

Our key technical innovations include:

  • A Differentiable Top-K Selection Mechanism that ensures end-to-end gradient flow while maintaining full compatibility with high-performance acceleration kernels like FlashAttention.
  • A Curriculum Annealing Strategy with a composite loss, which effectively bridges the performance gap between soft training selection and hard inference selection.
  • A backbone-decoupled Learnable Importance Scorer (LIS) that enables models, trained at a single compression rate, to robustly generalize to various compression budgets during inference.

VisionSelector is highly efficient, requiring only 12.85M trainable parameters. It achieves substantial performance-efficiency advancements: a 12.14% performance gain at 10% token retention, and a 1.73× prefill acceleration (with 86.08% memory reduction) at 20% retention. VisionSelector consistently outperforms state-of-the-art baselines across 13 image and video understanding benchmarks.

Our approach

💾 Dataset Preparation and Configuration

To reproduce our results and train the VisionSelector module, you need to download and configure the following datasets from the nyu-visionx/Cambrian-10M repository.

1. Dataset Downloading

Please download the following datasets and annotations and place them under the datasets/ folder.

Dataset Size Download Link (Hugging Face)
OCR_VQA $\sim 80\text{K}$ ocr_vqa.tar.gz
ChartQA $\sim 28\text{K}$ chartqa.tar.gz
TextQA $\sim 21\text{K}$ textvqa.tar.gz
COCO $\sim 364\text{K}$ coco.tar.gz
Annotation Download Link (Hugging Face)
Cambrian737K Cambrian737k.jsonl

Generating Dataset Annotations:

The large annotation file (Cambrian737k.jsonl) needs to be split into individual JSONL files for each target dataset (OCR_VQA, ChartQA, COCO) to match the required directory structure.

Execute the following commands from the project root to perform this filtering and splitting using the filter_json.py script:

cd VisionSelector
python datasets/filter_json.py
python datasets/sample_merge_json_llavaov.py # for llava-ov-1.5

Required Directory Structure:

After downloading and extracting, your project directory should contain the following structure:

VisionSelector/
└── datasets/
    ├── ocr_vqa/
    ├── ocr_vqa_cambrian.jsonl
    ├── chartqa/
    ├── chartqa_cambrian.jsonl
    ├── textvqa/
    ├── textvqa_cambrian.jsonl
    ├── coco/
    ├── coco_cambrian.jsonl
    └── textvqa_ocrvqa_cambrian.jsonl

2. Dataset config for training

The data paths and annotation paths for training are defined in VisionSelector/qwen-vl-finetune/qwenvl/data/__init__.py.

DATASET_NAME = {
    "annotation_path": "/path/to/annotations.json",
    "data_path": "/path/to/image/data",
}

🔧 Installation - Qwen2.5VL

1. Environment and Basic Dependencies

We recommend setting up a dedicated Conda environment.

conda create -n visionselector python=3.10
conda activate visionselector

2. Package Installation

Navigate to your project root directory (VisionSelector) and install the required packages.

cd VisionSelector
pip install qwen-vl-utils[decord]
pip install -r requirements.txt
pip install transformers==4.50.0

Recommended Package Versions:

For optimal compatibility, we recommend the following package versions:

Package Recommended Version
torch 2.6.0
torchvision 0.21.0
transformers 4.50.0
deepspeed 0.16.4
flash_attn 2.7.4.post1
triton 3.2.0
accelerate 1.4.0
torchcodec 0.2

🚀 Quick Start - Qwen2.5VL

1. Training

To train the VisionSelector (e.g., integrated with Qwen2.5-VL-7B) to learn crucial token selection, execute the script in the qwen-vl-finetune directory.

cd VisionSelector/qwen-vl-finetune
bash scripts/sft_7b.sh # for VisionSelector-Qwen2.5-VL-7B
bash scripts/sft_3b.sh # for VisionSelector-Qwen2.5-VL-3B
bash scripts/sft_dynamic.sh # Reimplementation of Dynamic-LLaVA's image predictor on Qwen2.5VL(Dynamic-Qwen)

2. Evaluation, Inference and Visualizations

We utilize lmms-eval for comprehensive benchmarking.

Setup lmms-eval

First, set up the evaluation environment:

cd VisionSelector/lmms-eval
pip install -e .
cd ../qwen-evaluation

Run Evaluations

Use the provided scripts to evaluate VisionSelector against comparison methods and generate visualizations.

Command Purpose
bash run_token_compression.sh Evaluation for Original Model and Comparison Token Compression Methods
bash run_selector.sh Evaluation for the VisionSelector Method
bash run_dynamic_qwen.sh Evaluation for the Dynamic-Qwen Method

To capture Max GPU Memory, Prefill Time, Latency Time and Number of visual tokens, set the following environment variable before running the evaluation script:

EVAL_TIME=True

Run Inference

You can test different token pruning methods (VisionZip, DivPrune) and VisionSelector inference with this script:

bash run_inference.sh

Visualizations

To generate activation heatmaps and token pruning visualizations:

bash run_visual.sh # for VisionZip, DivPrune and VisionSelector

🔧 Installation - LLaVA-OV-1.5

1. Environment, Basic Dependencies and Package Installation

To ensure compatibility with the LLaVA-OneVision-1.5 framework, activate the pre-created visionselector environment first and then adjust the transformers package version as follows:

conda activate visionselector
pip uninstall transformers
pip install transformers==4.53.1

🚀 Quick Start - LLaVA-OV-1.5

1. Training

To train the VisionSelector (e.g., integrated with LLaVA-OneVision-1.5) to learn crucial token selection, execute the script in the llava-ov-15 directory.

cd VisionSelector/llava-ov-15
bash scripts/finetune_selector_8b.sh # for LLaVA-OneVision-1.5

2. Evaluation

cd llava-ov-15
bash run_ov_token_compression.sh # for orignal model and comparison method
bash run_ov_selector.sh # for VisionSelector

3. Inference

bash run_ov_inference.sh

Cititation

If this work contributes to your research, please cite:

@misc{zhu2025visionselectorendtoendlearnablevisual,
      title={VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs}, 
      author={Jiaying Zhu and Yurui Zhu and Xin Lu and Wenrui Yan and Dong Li and Kunlin Liu and Xueyang Fu and Zheng-Jun Zha},
      year={2025},
      eprint={2510.16598},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.16598}, 
}

Acknowledgement

This work is built upon the foundational contributions of several excellent open-source projects. We express our sincere gratitude to the developers of the following resources, which were instrumental in the development and evaluation of VisionSelector: