From Coarse to Fine: Caption Assisted Multimodal Large Language Model for Video Moment Retrieval

Authors: Peiyu Xie

Multimodal Large Language Models (MLLMs) have demonstrated significant potential across various multimodal tasks, including retrieval, summarization, and reasoning. However, it remains a substantial challenge for MLLMs to understand and precisely retrieval specific moments from a video, which requires fine-grained spatial and temporal understanding of a video. To tackle the issue, we propose the Caption Assisted MLLMs from Coarse to finE (CALCE). By selecting key frames from video events and converting audio into captions, CALCE achieves fine-gained segmentation of video events, providing a robust foundation for precise moment retrieval. Moreover, we explore a comprehensive multi-stage training framework, further allowing our proposed method to progressively retrieval video moments from rough to precise.

Setup

Install Dependencies

Creating conda environment

conda create -n CALCE python=3.8
conda activate CALCE
pip install -r requirements.txt

Download Datasets and Preprocess

We train CALCE on QVHighlights and Charades-STA.

Dataset	Original	Preprocessed	Captions
QVHighlights	Download	Download	Download
Charades-STA	Download	Download	-

Please download original data and preprocess them via our scripts (Path needs to be customized).

Or download the preprocessed data we provide, and put them under annotaion path

Then, Download the QVHighlights caption files (or extract it by Whispher), and put them under the same folder of the video folder.

Path/To/QVHighlights
├── videos
│       ├── xxxxxx1.mp4
│       └── xxxxxx2.mp4
│            ......
└── srts
          ├── xxxxxx1.srt
          └── xxxxxx2.srt
               ......

Download Checkpoints

The checkpoints of the two stages training ara provided below.

Dataset	Stage 1	Stage 2
QVHighlights	Download	Download
Charades-STA	Download	Download

Please download the checkpoints and put them under checkpoint path.

Project Structure

CALCE
└── lavis
         ├── datasets
         │       └──annotations
         │                 ├── charades
         │                 │        ├── train.json
         │                 │        └── test.json
         │                 └── qvh
         │                           ├── train.json
         │                           ├── val.json
         │                           └── test_dummy.json
         └── results
                    ├── charades
                    │       ├── CALCE_Charades_60_stage1-1
                    │       │         └── checkpoint.best
                    │       └── CALCE_Charades_120_stage2-1
                    │                  └── checkpoint.best
                    └── qvh
                            ├── CALCE_QVH_75_stage1-1
                            │         └── checkpoint.best
                            └── CALCE_QVH_150_stage1-1
                                       └── checkpoint.best

Training and Inference

We provide CALCE training and inference script examples as follows.

And please refer to dataset page to customize your data path.

You might want to update the config files for the respective runs to fit on your machine. They are currently set to run on 4 A100-80GB GPUs.

Evaluation

Stage 1

sh run_scripts/CALCE/eval/qvh_stage1.sh
sh run_scripts/CALCE/eval/charades_stage1.sh

Merge Stage 1 Result

python process_data/merge_stage1_result.py --eval

Stage 2

sh run_scripts/CALCE/eval/qvh_stage2.sh
sh run_scripts/CALCE/eval/charades_stage2.sh

Should return (on val split):

QVH	R1@0.5	R1@0.7	mIoU	mAP@0.5	mAP@0.75
Stage 1	78.00	64.84	72.89	70.81	57.73
Stage 2	78.13	65.42	72.95	71.02	58.11

Charades-STA	R1@0.5	R1@0.7	mIoU
Stage 1	68.87	49.87	59.13
Stage 2	70.35	50.59	60.06

Training

Stage 1

sh run_scripts/CALCE/train/qvh_stage1.sh
sh run_scripts/CALCE/train/charades_stage1.sh

Stage 1 Infer

get the stage 1 result on the training split.

sh run_scripts/CALCE/infer/qvh.sh
sh run_scripts/CALCE/infer/charades.sh

Then, merge Stage 1 result.

python process_data/merge_stage1_result.py

Or download the merged files we provide above. Should have:

datasets
    └──annotations
                 ├── charades
                 │        ├── train.json
                 │        ├── test.json
                 │        ├── train_stage2.json
                 │        └── test_stage2.json
                 └── qvh
                           ├── train.json
                           ├── val.json
                           ├── test_dummy.json
                           ├── train_stage2.json
                           ├── val_stage2.json
                           └── test_dummy_stage2.json

Stage 2

sh run_scripts/CALCE/train/qvh_stage2.sh
sh run_scripts/CALCE/train/charades_stage2.sh

Acknowledgments

We thank the developers of MR.Blip for their public code release.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
lavis		lavis
process_data		process_data
run_scripts		run_scripts
standalone_eval		standalone_eval
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
eval_item.py		eval_item.py
evaluate.py		evaluate.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Coarse to Fine: Caption Assisted Multimodal Large Language Model for Video Moment Retrieval

Setup

Install Dependencies

Download Datasets and Preprocess

Download Checkpoints

Project Structure

Training and Inference

Evaluation

Stage 1

Merge Stage 1 Result

Stage 2

Training

Stage 1

Stage 1 Infer

Stage 2

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

tjhd1475/CALCE

Folders and files

Latest commit

History

Repository files navigation

From Coarse to Fine: Caption Assisted Multimodal Large Language Model for Video Moment Retrieval

Setup

Install Dependencies

Download Datasets and Preprocess

Download Checkpoints

Project Structure

Training and Inference

Evaluation

Stage 1

Merge Stage 1 Result

Stage 2

Training

Stage 1

Stage 1 Infer

Stage 2

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages