A tool for processing academic papers with .tex source files to extract:
- Object detection results
- LaTeX source code with visual bounding box pairs
- Layout reading orders
- GitHub Repository: https://github.com/Alpha-Innovator/DocGenome
- HuggingFace dataset: https://huggingface.co/datasets/U4R/DocGenome/tree/main
-
Python Environment
- Python 3.8 or higher
- Anaconda (recommended) - Installation Guide
-
TeX Live Distribution
- Required for LaTeX compilation
- Installation guide available at tug.org/texlive
For Ubuntu users:
sudo apt-get install texlive-full # Requires ~5.4GB disk spaceNote:
texlive-fullis recommended to avoid missing package errors. See package differences.
-
Create and activate conda environment:
conda create --name doc_parser python=3.8 conda activate doc_parser
-
Install the package:
pip install -e .
Run the parser on your LaTeX file:
python main.py --file_name path_to_paper/paper.texResults are stored in path_to_paper/output/result:
path_to_paper
├── output
│ ├── paper_colored/ # Rendered paper images
│ │ ├── thread-0001-page-01.jpg
│ │ └── ...
│ └── result/
│ ├── layout_annotation.json # Object detection results (COCO format)
│ ├── reading_annotation.json # Bounding box to LaTeX source mapping
│ ├── ordering_annotation.json # Reading order relationships
│ ├── quality_report.json
│ ├── texts.json # Original tex contents
│ ├── layout_info.json # Raw detection results
│ ├── layout_metadata.json # Paper layout information
│ ├── page_*.jpg # Pages with bounding boxes
│ └── block_*.jpg # Individual block images
-
Object Detection Results
layout_annotation.jsonandpage_*.jpg- Uses COCO format
-
Reading Detection Results
reading_annotation.json- Maps bounding boxes to original LaTeX content
-
Reading Order Results
ordering_annotation.json- Defines relationships between blocks using triples: (relationship, from, to)
Each bounding box is classified into one of these categories:
| Category | Name | Super Category | Description |
|---|---|---|---|
| 0 | Algorithm | Algorithm | Algorithm environments |
| 1 | Caption | Caption | Figure, Table, Algorithm captions |
| 2 | Equation | Equation | Display equations (equation, align) |
| 3 | Figure | Figure | Figures |
| 4 | Footnote | Footnote | Footnotes |
| 5 | List | List | itemize, enumerate, description |
| 6 | Others | Others | Currently unused |
| 7 | Table | Table | Tables |
| 8 | Text | Text | Plain text without equations |
| 9 | Text-EQ | Text | Text with inline equations |
| 10 | Title | Title | Section/subsection titles |
| 11 | Reference | Reference | References |
| 12 | PaperTitle | Title | Paper title |
| 13 | Code | Algorithm | Code listings |
| 14 | Abstract | Text | Paper abstract |
-
Latexpand Error
ValueError: Failed to run the command "latexpand..."
Solution:
- Check latexpand version:
latexpand --help - If < 1.6, upgrade using:
- Download from latexpand v1.6
- Update existing script:
sudo vim $(which latexpand)
- Check latexpand version:
-
PDF2Image Error
PDFInfoNotInstalledError: Unable to get page count
Solution:
sudo apt-get install poppler-utils
-
Missing Block PDF
- If
block_*.pdfis missing, the LaTeX rendering likely failed - This is case-specific and requires manual investigation
- If
- Custom Environments: Some custom environments (e.g.,
\newtheorem{defn}[thm]{Definition}) require manual addition toenvs.text_envs - Rendering Issues: Some environments may fail during PDF compilation
- Special Figures: TikZ and similar formats may not be correctly classified
Build the documentation using Sphinx:
cd docs
sphinx-build . _buildView the documentation by opening docs/_build/index.html in a browser.
Built using:
if you found this package useful, please cite:
@article{xia2024docgenome,
title={DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models},
author={Xia, Renqiu and Mao, Song and Yan, Xiangchao and Zhou, Hongbin and Zhang, Bo and Peng, Haoyang and Pi, Jiahao and Fu, Daocheng and Wu, Wenjie and Ye, Hancheng and others},
journal={arXiv preprint arXiv:2406.11633},
year={2024}
}