We propose MDocAgent, a novel multi-modal multi-agent framework for document question answering. It integrates text and image retrieval through five specialized agents — general, critical, text, image, and summarizing agents — enabling collaborative reasoning across modalities. Experiments on five benchmarks show a 12.1% improvement over state-of-the-art methods, demonstrating its effectiveness in handling complex real-world documents.
- Clone this repository and navigate to MDocAgent folder
git clone https://github.com/aiming-lab/MDocAgent.git
cd MDocAgent- Install Package: Create conda environment
conda create -n mdocagent python=3.12
conda activate mdocagent
bash install.sh- Data Preparation
-
Create a data directory:
mkdir data cd data -
Download the dataset from huggingface and place it in the
datadirectory. The documents of PaperText are same as PaperTab. You can use symbol link or make a copy. -
Return to the project root:
cd ../ -
Extract the data using:
python scripts/extract.py --config-name <dataset> # (choose from mmlb / ldu / ptab / ptext / feta)
The extracted texts and images will be saved in tmp/<dataset>.
-
Text Retrieval
Set the retrieval type to
textinconfig/base.yaml:defaults: - retrieval: text
Then run:
python scripts/retrieve.py --config-name <dataset>
-
Image Retrieval
Switch the retrieval type to
imageinconfig/base.yaml:defaults: - retrieval: image
Run the retrieval process again:
python scripts/retrieve.py --config-name <dataset>
The retrieval results will be stored in:
data/<dataset>/sample-with-retrieval-results.json
Run the following command:
python scripts/predict.py --config-name <dataset> run-name=<run-name>Note:
<run-name>can be any string to uniquely identify this run (required).
The inference results will be saved to:
results/<dataset>/<run-name>/<run-time>.json
To specify the top-4 retrieval candidates, use:
python scripts/predict.py --config-name <dataset> run-name=<run-name> dataset.top_k=4-
Add your OpenAI API key in
config/model/openai.yaml. -
Run the evaluation (make sure
<run-name>matches your inference run):python scripts/eval.py --config-name <dataset> run-name=<run-name>
The evaluation results will be saved in:
results/<dataset>/<run-name>/results.txt
Note: Evaluation will use the newest inference result file with same
<run-name>.
@article{han2025mdocagent,
title={MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding},
author={Han, Siwei and Xia, Peng and Zhang, Ruiyi and Sun, Tong and Li, Yun and Zhu, Hongtu and Yao, Huaxiu},
journal={arXiv preprint arXiv:2503.13964},
year={2025}
}