pdfs

PDF exploration

An ongoing project for the Digital Preservation Unit at the UMich Library has been to get more information about our PDF files in Deep Blue Documents than what is readily available from the system itself. We are interested in "preservation metadata", what version of PDF is it, what system created it, etc.

The initial set is 134643 files.

This GitHub repo contains the various scripts I used to analyze the results of running various PDF tools on these files, Tika, JHOVE, and veraPDF as well as sample result files.

Tika (https://github.com/chrismattmann/tika-python): change file path in python script

  python3 basicTikaParser.py

JHOVE (https://jhove.openpreservation.org/getting-started/):

  /Applications/JHOVE/jhove ../PDF_Assessment_v1.3-1.pdf

Run JHOVE on folders:

  /Applications/JHOVE/jhove -m PDF-hul ./SamplePile/10-1/ > jhove-output.txt

PDFinfo (https://www.xpdfreader.com/pdfinfo-man.html):

  ./xpdf-tools-mac-4.03/bin64/pdfinfo ../PDF_Assessment_v1.3-1.pdf

ExifTool (https://exiftool.org/):

  exiftool ../PDF_Assessment_v1.3-1.pdf

PDFBox (https://pdfbox.apache.org/download.html):

  java -jar preflight-app-3.0.0.jar ../PDF_Assessment_v1.3-1.pdf

VeraPDF (https://docs.verapdf.org/cli/validation/):

  /Applications/veraPDF/verapdf -f 0 ../PDF_Assessment_v1.3-1.pdf

Run VeraPDF on folders:

  /Applications/veraPDF/verapdf -f ua1 --recurse ../SamplePile/10-1/ > veraPDF-output.txt

JHOVE parser (run within /apps):

  python3 JhoveOutputParser.py

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
jhove		jhove
tika		tika
veraPDF		veraPDF
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdfs

About

Releases

Packages

Languages

mutanthumb/pdfs

Folders and files

Latest commit

History

Repository files navigation

pdfs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages