MDocAgent

Overview

We propose MDocAgent, a novel multi-modal multi-agent framework for document question answering. It integrates text and image retrieval through five specialized agents — general, critical, text, image, and summarizing agents — enabling collaborative reasoning across modalities. Experiments on five benchmarks show a 12.1% improvement over state-of-the-art methods, demonstrating its effectiveness in handling complex real-world documents.

Requirements

Clone this repository and navigate to MDocAgent folder

git clone https://github.com/aiming-lab/MDocAgent.git
cd MDocAgent

Install Package: Create conda environment

conda create -n mdocagent python=3.12
conda activate mdocagent
bash install.sh

Data Preparation

Create a data directory:
```
mkdir data
cd data
```
Download the dataset from huggingface and place it in the data directory. The documents of PaperText are same as PaperTab. You can use symbol link or make a copy.
Return to the project root:
```
cd ../
```

Extract the data using:

python scripts/extract.py --config-name <dataset>  # (choose from mmlb / ldu / ptab / ptext / feta)

The extracted texts and images will be saved in tmp/<dataset>.

Retrieval

Text Retrieval

Set the retrieval type to text in config/base.yaml:

defaults:
- retrieval: text

Then run:

python scripts/retrieve.py --config-name <dataset>

Image Retrieval

Switch the retrieval type to image in config/base.yaml:
```
defaults:
- retrieval: image
```
Run the retrieval process again:
```
python scripts/retrieve.py --config-name <dataset>
```

The retrieval results will be stored in:

data/<dataset>/sample-with-retrieval-results.json

Multi-Agent Inference

Run the following command:

python scripts/predict.py --config-name <dataset> run-name=<run-name>

Note: <run-name> can be any string to uniquely identify this run (required).

The inference results will be saved to:

results/<dataset>/<run-name>/<run-time>.json

To specify the top-4 retrieval candidates, use:

python scripts/predict.py --config-name <dataset> run-name=<run-name> dataset.top_k=4

Evaluation

Add your OpenAI API key in config/model/openai.yaml.

Run the evaluation (make sure <run-name> matches your inference run):

python scripts/eval.py --config-name <dataset> run-name=<run-name>

The evaluation results will be saved in:

results/<dataset>/<run-name>/results.txt

Note: Evaluation will use the newest inference result file with same <run-name>.

Citation

@article{han2025mdocagent,
  title={MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding},
  author={Han, Siwei and Xia, Peng and Zhang, Ruiyi and Sun, Tong and Li, Yun and Zhu, Hongtu and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2503.13964},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
agents		agents
config		config
media		media
models		models
mydatasets		mydatasets
retrieval		retrieval
scripts		scripts
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MDocAgent

Overview

Requirements

Retrieval

Multi-Agent Inference

Evaluation

Citation

About

Uh oh!

Releases

Packages

Contributors 4

Languages

License

aiming-lab/MDocAgent

Folders and files

Latest commit

History

Repository files navigation

MDocAgent

Overview

Requirements

Retrieval

Multi-Agent Inference

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages