A tool for processing academic papers with .tex
source files to extract:
- Object detection results
- LaTeX source code with visual bounding box pairs
- Layout reading orders
- GitHub Repository: https://github.com/Alpha-Innovator/DocGenome
- HuggingFace dataset: https://huggingface.co/datasets/U4R/DocGenome/tree/main
-
Python Environment
- Python 3.8 or higher
- Anaconda (recommended) - Installation Guide
-
TeX Live Distribution
- Required for LaTeX compilation
- Installation guide available at tug.org/texlive
For Ubuntu users:
sudo apt-get install texlive-full # Requires ~5.4GB disk space
Note:
texlive-full
is recommended to avoid missing package errors. See package differences.
-
Create and activate conda environment:
conda create --name doc_parser python=3.8 conda activate doc_parser
-
Install the package:
pip install -e .
Run the parser on your LaTeX file:
python main.py --file_name path_to_paper/paper.tex
Results are stored in path_to_paper/output/result
:
path_to_paper
├── output
│ ├── paper_colored/ # Rendered paper images
│ │ ├── thread-0001-page-01.jpg
│ │ └── ...
│ └── result/
│ ├── layout_annotation.json # Object detection results (COCO format)
│ ├── reading_annotation.json # Bounding box to LaTeX source mapping
│ ├── ordering_annotation.json # Reading order relationships
│ ├── quality_report.json
│ ├── texts.json # Original tex contents
│ ├── layout_info.json # Raw detection results
│ ├── layout_metadata.json # Paper layout information
│ ├── page_*.jpg # Pages with bounding boxes
│ └── block_*.jpg # Individual block images
-
Object Detection Results
layout_annotation.json
andpage_*.jpg
- Uses COCO format
-
Reading Detection Results
reading_annotation.json
- Maps bounding boxes to original LaTeX content
-
Reading Order Results
ordering_annotation.json
- Defines relationships between blocks using triples: (relationship, from, to)
Each bounding box is classified into one of these categories:
Category | Name | Super Category | Description |
---|---|---|---|
0 | Algorithm | Algorithm | Algorithm environments |
1 | Caption | Caption | Figure, Table, Algorithm captions |
2 | Equation | Equation | Display equations (equation, align) |
3 | Figure | Figure | Figures |
4 | Footnote | Footnote | Footnotes |
5 | List | List | itemize, enumerate, description |
6 | Others | Others | Currently unused |
7 | Table | Table | Tables |
8 | Text | Text | Plain text without equations |
9 | Text-EQ | Text | Text with inline equations |
10 | Title | Title | Section/subsection titles |
11 | Reference | Reference | References |
12 | PaperTitle | Title | Paper title |
13 | Code | Algorithm | Code listings |
14 | Abstract | Text | Paper abstract |
-
Latexpand Error
ValueError: Failed to run the command "latexpand..."
Solution:
- Check latexpand version:
latexpand --help
- If < 1.6, upgrade using:
- Download from latexpand v1.6
- Update existing script:
sudo vim $(which latexpand)
- Check latexpand version:
-
PDF2Image Error
PDFInfoNotInstalledError: Unable to get page count
Solution:
sudo apt-get install poppler-utils
-
Missing Block PDF
- If
block_*.pdf
is missing, the LaTeX rendering likely failed - This is case-specific and requires manual investigation
- If
- Custom Environments: Some custom environments (e.g.,
\newtheorem{defn}[thm]{Definition}
) require manual addition toenvs.text_envs
- Rendering Issues: Some environments may fail during PDF compilation
- Special Figures: TikZ and similar formats may not be correctly classified
Build the documentation using Sphinx:
cd docs
sphinx-build . _build
View the documentation by opening docs/_build/index.html
in a browser.
Built using:
if you found this package useful, please cite:
@article{xia2024docgenome,
title={DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models},
author={Xia, Renqiu and Mao, Song and Yan, Xiangchao and Zhou, Hongbin and Zhang, Bo and Peng, Haoyang and Pi, Jiahao and Fu, Daocheng and Wu, Wenjie and Ye, Hancheng and others},
journal={arXiv preprint arXiv:2406.11633},
year={2024}
}