Implementation for "Causality Matters: How Temporal Information Emerges in Video Language Models". [paper]
├── test.sh # running scripts
├── test_*.py # entry files
├── model # modified models and utils
├── evaluation # scripts to evaluate performance
├── dataset # gt for anet-qa
└── analyze # scripts to draw figuresconda create -n causality python=3.10
conda activate causality
pip install -r requirements.txtIf you want to test LLaVA-OneVision, you should install LLaVA following the instruction in https://github.com/LLaVA-VL/LLaVA-NeXT.
You can download the datasets:
- TempCompass: https://huggingface.co/datasets/lmms-lab/TempCompass.
- NExT-QA: https://huggingface.co/datasets/lmms-lab/NExTQA
- ActivityNet-QA: https://huggingface.co/datasets/lmms-lab/ActivityNetQA
And you should change the video folder path when testing.
- If you want to collect results, you can refer to the test.sh.
- We provide tests for different experimental settings:
- test_model_rm.py: Section 4.1
- test_model_shuffle.py: Section 4.2
- test_model_ablate.py: Section 4.3
- test_model.py: Section 5
- test_model_app*: Section 6
- For mask_type in test_*.py, we provide:
- Query_Last
- Frame_Last
- Frame_Query
- Frame_Frame
- Frame{x}_Query1: Query can only attend to all of the frames, but each frame can only attend to
- x = 1: All of the previous frames
- x = 2: The corresponding areas of all the previous frames
- x = 3: The immediate previous frame
- x = 4: The corresponding area of the immediate previous frame
- Frame{x}_Query2_{y}: Query can only attento to a specific frame
y- x = 1, 2, 3, 4: Each is the same like the previous introduced
- y starts from 0
- If you have collected results, and you want to do analysis, you could refer to the scripts in the
analyzefolder. - If you have collected results, and you want to do evaluation, you could refer to the scripts in the
evaluationfolder. - For detatiled arguments, you can check
parse_arguments()functions in each entry file.
We would like to express our gratitude to the following excellent projects:
- Qwen2.5-VL: We did experiments on the Qwen2-VL and Qwen2.5-VL models.
- LLaVA-NeXT: We did experiments on the LLaVA-OneVision model.
- MovieChat: We referenced its evaluation of open-ended questions.
- cross-modal-information-flow-in-MLLM: We referenced its figure generation part.
We also sincerely thank the providers and curators of the datasets utilized in our project.
@article{shi2025causality,
title={Causality Matters: How Temporal Information Emerges in Video Language Models},
author={Shi, Yumeng and Long, Quanyu and Wu, Yin and Wang, Wenya},
journal={arXiv preprint arXiv:2508.11576},
year={2025}
}