DocParser

A tool for processing academic papers with .tex source files to extract:

Object detection results
LaTeX source code with visual bounding box pairs
Layout reading orders

Project Links

GitHub Repository: https://github.com/Alpha-Innovator/DocGenome
HuggingFace dataset: https://huggingface.co/datasets/U4R/DocGenome/tree/main

Installation

Prerequisites

Python Environment
- Python 3.8 or higher
- Anaconda (recommended) - Installation Guide
TeX Live Distribution
- Required for LaTeX compilation
- Installation guide available at tug.org/texlive
For Ubuntu users:
```
sudo apt-get install texlive-full  # Requires ~5.4GB disk space
```
Note: texlive-full is recommended to avoid missing package errors. See package differences.

Setup

Create and activate conda environment:

conda create --name doc_parser python=3.8
conda activate doc_parser

Install the package:
```
pip install -e .
```

Usage

Run the parser on your LaTeX file:

python main.py --file_name path_to_paper/paper.tex

Output Structure

Results are stored in path_to_paper/output/result:

path_to_paper
├── output
│   ├── paper_colored/           # Rendered paper images
│   │   ├── thread-0001-page-01.jpg
│   │   └── ...
│   └── result/
│       ├── layout_annotation.json    # Object detection results (COCO format)
│       ├── reading_annotation.json   # Bounding box to LaTeX source mapping
│       ├── ordering_annotation.json  # Reading order relationships
│       ├── quality_report.json      
│       ├── texts.json               # Original tex contents
│       ├── layout_info.json         # Raw detection results
│       ├── layout_metadata.json     # Paper layout information
│       ├── page_*.jpg              # Pages with bounding boxes
│       └── block_*.jpg             # Individual block images

Output Components

Object Detection Results
- layout_annotation.json and page_*.jpg
- Uses COCO format
Reading Detection Results
- reading_annotation.json
- Maps bounding boxes to original LaTeX content
Reading Order Results
- ordering_annotation.json
- Defines relationships between blocks using triples: (relationship, from, to)

Troubleshooting

Common Issues

Latexpand Error
```
ValueError: Failed to run the command "latexpand..."
```
Solution:
- Check latexpand version: latexpand --help
- If < 1.6, upgrade using:
  1. Download from latexpand v1.6
  2. Update existing script: sudo vim $(which latexpand)

PDF2Image Error

PDFInfoNotInstalledError: Unable to get page count

Solution:

sudo apt-get install poppler-utils

Missing Block PDF
- If block_*.pdf is missing, the LaTeX rendering likely failed
- This is case-specific and requires manual investigation

Known Limitations

Custom Environments: Some custom environments (e.g., \newtheorem{defn}[thm]{Definition}) require manual addition to envs.text_envs
Rendering Issues: Some environments may fail during PDF compilation
Special Figures: TikZ and similar formats may not be correctly classified

Documentation

Build the documentation using Sphinx:

cd docs
sphinx-build . _build

View the documentation by opening docs/_build/index.html in a browser.

Acknowledgements

Built using:

Citation

if you found this package useful, please cite:

@article{xia2024docgenome,
  title={DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models},
  author={Xia, Renqiu and Mao, Song and Yan, Xiangchao and Zhou, Hongbin and Zhang, Bo and Peng, Haoyang and Pi, Jiahao and Fu, Daocheng and Wu, Wenjie and Ye, Hancheng and others},
  journal={arXiv preprint arXiv:2406.11633},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 846 Commits
.github/workflows		.github/workflows
.vscode		.vscode
DocParser		DocParser
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
dataset_readme.md		dataset_readme.md
setup.py		setup.py

Category	Name	Super Category	Description
0	Algorithm	Algorithm	Algorithm environments
1	Caption	Caption	Figure, Table, Algorithm captions
2	Equation	Equation	Display equations (equation, align)
3	Figure	Figure	Figures
4	Footnote	Footnote	Footnotes
5	List	List	itemize, enumerate, description
6	Others	Others	Currently unused
7	Table	Table	Tables
8	Text	Text	Plain text without equations
9	Text-EQ	Text	Text with inline equations
10	Title	Title	Section/subsection titles
11	Reference	Reference	References
12	PaperTitle	Title	Paper title
13	Code	Algorithm	Code listings
14	Abstract	Text	Paper abstract

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocParser

Project Links

Installation

Prerequisites

Setup

Usage

Output Structure

Output Components

Categories

Troubleshooting

Common Issues

Known Limitations

Documentation

Acknowledgements

Citation

About

Releases 6

Packages

Contributors 2

Languages

Alpha-Innovator/DocParser

Folders and files

Latest commit

History

Repository files navigation

DocParser

Project Links

Installation

Prerequisites

Setup

Usage

Output Structure

Output Components

Categories

Troubleshooting

Common Issues

Known Limitations

Documentation

Acknowledgements

Citation

About

Resources

Stars

Watchers

Forks

Releases 6

Packages 0

Contributors 2

Languages

Packages