Skip to content

cxsoto/article-regions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

article-regions

A dataset of region-annotated scientific articles from PubMed Central, for document layout analysis/segmentation.

Regions

Nine document regions are annotated:

  • Title
  • Authors
  • Abstract
  • Body
  • Figure
  • Figure Caption
  • Table
  • Table Caption
  • References

Getting and preparing documents

A script is included to download the corresponding article PDFs from PubMed Central, as well as render the article pages to JPG images. Some Linux utilities are used: curl, pdfinfo (from poppler-utils), and convert (from imagemagick). Note that by default ImageMagick disables support for PDF files, but this can easily be fixed by updating its policy file:

sudo sed -i 's/rights="none" pattern="PDF"/rights="read|write" pattern="PDF"/' /etc/ImageMagick-6/policy.xml

Citation

C.X. Soto and S.J. Yoo, "Visual Detection with Context for Document Layout Analysis", proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.

https://www.aclweb.org/anthology/D19-1348.pdf

About

A dataset of region-annotated scientific articles.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages