Skip to content

ILOT-code/FSR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning

arXiv

Enwei Tong1, Yuanchao Bai*1, Yao Zhu2, Junjun Jiang1, Xianming Liu1

1Harbin Institute of Technology, 2Zhejiang University


πŸ“’ News

  • [2026-02-05] Our paper "Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning" is now available! Code will be released soon.

πŸ“– Abstract

Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory. While training-free token pruning offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive compression.

We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions. FSR dynamically allocates a limited token budget through a three-stage process:

Extensive experiments show that FSR consistently improves the accuracy-efficiency trade-off across LLaVA-1.5, LLaVA-NeXT, Qwen2.5-VL, and LLaVA-Video.

πŸš€ Methodology

FSR Framework Overview

FSR mimics the human visual cognitive process ("Focus, Scan, then Refine") to efficiently prune visual tokens:

  1. Focus (Local Evidence): Identifies critical regions by fusing visual saliency with instruction relevance, ensuring the model locks onto query-related objects.
  2. Scan (Global Context): Expands the field of view using Conditional Context Sampling (CCS) to capture diverse background information that complements the focused area.
  3. Refine (Aggregation): Instead of hard pruning, it aggregates discarded but relevant details into context anchors, preserving fine-grained textures without increasing the token budget.

πŸ“Š Performance

We extensively evaluate FSR on diverse benchmarks, covering standard benchmarks, high-resolution visual processing, advanced architectures, and video understanding.

1. πŸ† Standard Benchmarks (LLaVA-1.5)

On the widely used LLaVA-1.5-7B, FSR consistently outperforms state-of-the-art pruning methods (including HoloV, VisPruner, and CDPruner) across different pruning ratios.

Performance on LLaVA-1.5-7B

2. πŸ–ΌοΈ High-Resolution Inputs (LLaVA-NeXT)

High-resolution models often generate massive redundant tokens. FSR effectively eliminates this redundancy. Notably, on LLaVA-NeXT-13B, FSR even slightly surpasses the original unpruned model at a 77.8% reduction ratio, suggesting that FSR effectively filters out noise.

LLaVA-NeXT-7B Performance LLaVA-NeXT-13B Performance

3. πŸš€ Advanced Architectures (Qwen2.5-VL)

We extended FSR to Qwen2.5-VL-7B, a stronger baseline with dynamic resolution support. FSR continues to lead, demonstrating strong generalization capabilities across different model architectures.

Performance on Qwen2.5-VL

4. πŸŽ₯ Video Understanding (LLaVA-Video)

FSR generalizes effectively to the temporal domain. On LLaVA-Video-7B-Qwen2, FSR preserves critical spatiotemporal cues, achieving 99.6% of the original performance while removing 60% of the tokens.

Performance on Video Benchmarks

5. ⚑️ Efficiency & Speedup

FSR delivers a superior accuracy-efficiency trade-off. By retaining only 64 visual tokens (~89% reduction), it significantly reduces memory footprint and latency while outperforming other methods.

Efficiency Analysis

About

Official PyTorch implementation of the paper "Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors