Get Start

Keyworks: Advertisement Videos, VideoQA, Multimodal Reasoning, GRPO

🎉 This work has been accepted by ICCV 2025.
🚩 Its arXiv version is available at https://arxiv.org/abs/2509.08621 .

Xinwei Long^1*, Kai Tian^1*, Peng Xu¹⁺, Guoli Jia¹, Jingxuan Li², Sa Yang³, Yihua Shao⁴, Kaiyan Zhang¹,Che Jiang¹, Hao Xu⁵, Yang Liu², Jiaheng Ma², Bowen Zhou^1,6†

¹ Tsinghua University

² Independent Researcher

³ Peking University

⁴ CASIA

⁵ Harvard University

⁶ Shanghai Artificial Intelligence Lab

^* Equal Contribution

⁺ corresponding authors

AdsQA is the first large-scale benchmark targeting advertisement video understanding through LLMs. Ad videos are rich, symbolic, emotionally charged, and ideal for evaluating cognitive-level reasoning beyond physical perception.

🌟 Why ads? Unlike typical visual data, ads are professionally crafted to convey themes, metaphors, and targeted emotions.
📦 What’s AdsQA? A benchmark built on 1,544 ad videos and 10,962 clips totaling 22.7 hours, annotated via a novel multi-agent pipeline.
🚀 Our Model: ReAd-R is a Reinforced Ad Reasoner trained using reward-based optimization, outperforming chain-of-thought and agent-based methods.
🎯 5 Tasks: Visual Concepts, Emotion, Themes, Persuasion, and Audience.

🔥 AdsQA is used as the test set of ICCV 2025 MARS2 multimodal reasoning challenge.

💥 See the MARS2 official report at https://arxiv.org/abs/2509.14142 .

Our Contribution.
- The AdsQA benchmark introduces a comprehensive, large-scale video QA dataset specifically designed around the complex and information-rich nature of advertisement videos. It offers a diverse and well-structured data source to evaluate LLMs on implicit reasoning tasks.
Figure: Statistics of AdsQA benchmark (duration, domain, regions, etc).
- ReAd-R. We propose ReAd-R—a DeepSeek-R1–styled RL reasoning model that reflects, answers, and learns from outcome-based rewards, avoiding costly step-wise/COT supervision.
Figure: Architecture of ReAd-R.

Experiments

Get Start

Data Acquisition

1. Video Data Acquisition

According to the Terms of Use of the data source, we cannot store or redistribute the original video files. Instead, we provide open-source access to the video URLs. Please follow these steps to acquire the video data:

Obtain the complete list of video URLs from this link. The file contains URLs for both the training and test set videos.
Use our provided script preprocess/download_videos.py to download all videos.

Example usage:

python preprocess/download_videos.py --url_file [path_to_url_file] --output_dir [video_output_directory]

If any videos are inaccessible or the URLs have expired, please feel free to open an issue or contact us directly via email.

2. Video Preprocessing (Optional):

For our ReAd-R model, we preprocessed videos using video_clip.py and preprocess/transform_parquet.py. Preprocessed files are also available for convenience at this link.

Example usage:

cd preprocess
python video_clip.py # 
python transform_parquet.py # converts the dataset into Parquet format for training.

Note: You may customize preprocessing (e.g., different sampling rates, resolutions) based on your specific requirements.

3. Question and Annotation Data Acquisition

Download the following annotation files from this link:

train.json - Training set questions and annotations
testset_question.json - Test set (ids, videos, and questions) for inference
testset_groundtruth.json - Test set (ids, questions, ground-truth answers, and their meta_info) for model-based auto evaluation.

!!! Important Usage Note: The meta_info field is only for model-based auto evaluation purposes; DO NOT use meta_info as model input during the inference.

Training, Inference, and Evaluation

1. Requirements

We use the EasyR1 framework for reinforcement learning (RL) training.

conda create -n ReadR python=3.10
conda activate ReadR

cd ReadR
pip install -e .

2. Train

We provide the training code for ReAd-R. Please use the following script to run the training code.

bash examples/adsqa.sh

Meanwhile, we have released our checkpoint (!!! 🔥 Model files re-uploaded on 2025-10-22):

Model	LINK
Qwen2.5-7B-VL-ReAd-R	🤗Huggingface

3. Inference

We provide inference scripts in the evaluation directory. During inference, we use Automatic Speech Recognition (ASR) results as input features. The corresponding ASR data (asr_set.json) is also included in the same directory. The ASR transcripts were generated using Whisper and then translated into English via GPT-4o. For visual processing, we extract frames at 1 FPS (with a maximum of 32 frames) and use the default max_pixel setting from Qwen2.5-VL.

This script will create a directory for each question (named with its question_id), inside which the prediction result from each model will be saved using the given model name (See examples in ./evaluation/results). Please use the following script to run the inference code.

# for GRPO model inference
python evaluation/eval_adQA.py --video_dir your_video_path --file_dir test_file_dir --model_dir test_model_save_path --model_name your_model_name
# for Qwen2.5-VL model inference
python evaluation/eval_adQA_qwen2d5-7b.py --video_dir your_video_path --file_dir test_file_dir --model_dir test_model_save_path --model_name your_model_name

4. Evaluation

We use GPT-4o as the judge model. Please refer to our ./evaluation/model_evaluation.py script to score the prediction results. In this file, you will need to specify the prediction file name, directory, and your GPT-4o API key and base URL. The model-based evaluation results will be saved to the score field in the prediction file.

Please use the following script to run the model-based evaluation code.

# for model-based evaluation
python evaluation/model_evaluation.py --eval_name prediction_file_name --test_file groundtruth_file --results_dir dir_you_save_prediction_files --api_key your_api_key --api_base your_api_url_base

Contact

If you have any questions, please feel free to contact me:

longxw22@mails.tsinghua.edu.cn

tk23@mails.tsinghua.edu.cn

⭐ Citation

If you find our dataset, code, or model useful in your research, please consider citing our paper and MARS2 Workshop:

```
@inproceedings{long2025adsqa,
    author    = {Long, Xinwei and Tian, Kai and Xu, Peng and Jia, Guoli and Li, Jingxuan and Yang, Sa and Shao, Yihua and Zhang, Kaiyan and Jiang, Che and Xu, Hao and Liu, Yang and Ma, Jiaheng and Zhou, Bowen},
    title     = {AdsQA: Towards Advertisement Video Understanding},
    booktitle = {ICCV},
    year      = {2025}
}
```

```
@inproceedings{xu2025mars2,
author    = {Xu, Peng and Xiong, Shengwu and Zhang, Jiajun and Chen, Yaxiong and Zhou, Bowen and Loy, Chen Change and Clifton, David and Lee, Kyoung Mu and Van Gool, Luc and others},
title     = {MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook},
booktitle = {ICCV Workshop},
year      = {2025}
}
```

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
ReadR		ReadR
assets		assets
evaluation		evaluation
preprocess		preprocess
static		static
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Get Start

Data Acquisition

1. Video Data Acquisition

2. Video Preprocessing (Optional):

3. Question and Annotation Data Acquisition

Training, Inference, and Evaluation

1. Requirements

2. Train

3. Inference

4. Evaluation

Contact

⭐ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Get Start

Data Acquisition

1. Video Data Acquisition

2. Video Preprocessing (Optional):

3. Question and Annotation Data Acquisition

Training, Inference, and Evaluation

1. Requirements

2. Train

3. Inference

4. Evaluation

Contact

⭐ Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages