- Authors: Peiyu Xie
Multimodal Large Language Models (MLLMs) have demonstrated significant potential across various multimodal tasks, including retrieval, summarization, and reasoning. However, it remains a substantial challenge for MLLMs to understand and precisely retrieval specific moments from a video, which requires fine-grained spatial and temporal understanding of a video. To tackle the issue, we propose the Caption Assisted MLLMs from Coarse to finE (CALCE). By selecting key frames from video events and converting audio into captions, CALCE achieves fine-gained segmentation of video events, providing a robust foundation for precise moment retrieval. Moreover, we explore a comprehensive multi-stage training framework, further allowing our proposed method to progressively retrieval video moments from rough to precise.
Creating conda environment
conda create -n CALCE python=3.8
conda activate CALCE
pip install -r requirements.txtWe train CALCE on QVHighlights and Charades-STA.
| Dataset | Original | Preprocessed | Captions |
|---|---|---|---|
| QVHighlights | Download | Download | Download |
| Charades-STA | Download | Download | - |
Please download original data and preprocess them via our scripts (Path needs to be customized).
Or download the preprocessed data we provide, and put them under annotaion path
Then, Download the QVHighlights caption files (or extract it by Whispher), and put them under the same folder of the video folder.
Path/To/QVHighlights
├── videos
│ ├── xxxxxx1.mp4
│ └── xxxxxx2.mp4
│ ......
└── srts
├── xxxxxx1.srt
└── xxxxxx2.srt
......
The checkpoints of the two stages training ara provided below.
| Dataset | Stage 1 | Stage 2 |
|---|---|---|
| QVHighlights | Download | Download |
| Charades-STA | Download | Download |
Please download the checkpoints and put them under checkpoint path.
CALCE
└── lavis
├── datasets
│ └──annotations
│ ├── charades
│ │ ├── train.json
│ │ └── test.json
│ └── qvh
│ ├── train.json
│ ├── val.json
│ └── test_dummy.json
└── results
├── charades
│ ├── CALCE_Charades_60_stage1-1
│ │ └── checkpoint.best
│ └── CALCE_Charades_120_stage2-1
│ └── checkpoint.best
└── qvh
├── CALCE_QVH_75_stage1-1
│ └── checkpoint.best
└── CALCE_QVH_150_stage1-1
└── checkpoint.best
We provide CALCE training and inference script examples as follows.
And please refer to dataset page to customize your data path.
You might want to update the config files for the respective runs to fit on your machine. They are currently set to run on 4 A100-80GB GPUs.
sh run_scripts/CALCE/eval/qvh_stage1.sh
sh run_scripts/CALCE/eval/charades_stage1.shpython process_data/merge_stage1_result.py --evalsh run_scripts/CALCE/eval/qvh_stage2.sh
sh run_scripts/CALCE/eval/charades_stage2.shShould return (on val split):
| QVH | R1@0.5 | R1@0.7 | mIoU | mAP@0.5 | mAP@0.75 |
|---|---|---|---|---|---|
| Stage 1 | 78.00 | 64.84 | 72.89 | 70.81 | 57.73 |
| Stage 2 | 78.13 | 65.42 | 72.95 | 71.02 | 58.11 |
| Charades-STA | R1@0.5 | R1@0.7 | mIoU |
|---|---|---|---|
| Stage 1 | 68.87 | 49.87 | 59.13 |
| Stage 2 | 70.35 | 50.59 | 60.06 |
sh run_scripts/CALCE/train/qvh_stage1.sh
sh run_scripts/CALCE/train/charades_stage1.shget the stage 1 result on the training split.
sh run_scripts/CALCE/infer/qvh.sh
sh run_scripts/CALCE/infer/charades.shThen, merge Stage 1 result.
python process_data/merge_stage1_result.pyOr download the merged files we provide above. Should have:
datasets
└──annotations
├── charades
│ ├── train.json
│ ├── test.json
│ ├── train_stage2.json
│ └── test_stage2.json
└── qvh
├── train.json
├── val.json
├── test_dummy.json
├── train_stage2.json
├── val_stage2.json
└── test_dummy_stage2.json
sh run_scripts/CALCE/train/qvh_stage2.sh
sh run_scripts/CALCE/train/charades_stage2.shWe thank the developers of MR.Blip for their public code release.
