HOCRViewer

Read books in HOCR format with Mirador.

Requirements

Python 3.5
Optional: An SQLite version that supports FTS5 (check with sqlite3 ":memory:" "PRAGMA compile_options;" |grep FTS5)

Installation

$ pip install -r requirements.txt

Data format

The HOCR file must contain all pages as ocr_page elements. These must have a title attribute that contains the following fields (as per the HOCR Specification):

ppageno: The physical page number
image: The relative path (from the HOCR file) to the page image
bbox: The dimensions of the image

Additionally, each ocr_page element must have an id attribute that assigns a unique identifier to the page.

Example:

<div class="ocr_page" id="page_0005"
     title="ppageno 4; image spyri_heidi_1880/00000005.tif; bbox 0 0 2013 2985"/>

Alternatively, HOCR files with accompanying images that are stored like the Google 1000 Books dataset (download instructions) can be indexed and viewed as well.

Usage

Simply point the application to a directory containing hOCR files and it will serve a web interface where you can view them:

$ python hocrviewer.py serve /mnt/data/hocr

You can alternatively index your files before serving them. This has two main advantages: It significantly reduces the response times for the manifests and annotations and it enables the search within the books (not yet usable from Mirador, but keep an eye on this PR).

To do so, run the index subcommand with the path to the directory with your HOCR files as the first argument. By default, the database will be written to ~/.config/hocrviewer/hocrviewer.db, but you can override this with the --db-path option that is passed before the subcommand:

$ python hocrviewer.py --db-path /tmp/test.db index /mnt/data/hocr

After the index has been created, run the application with the serve subcommand (making sure that you pass the same --db-path value as during indexing).

$ python hocrviewer.py --db-path /tmp/test.db serve

The application exposes all books as IIIF manifests at /iiif/<book_name>, where book_name is the file name of the HOCR file for the book without the .html extension.

Planned Features

Search across all books (backend done, user interface missing)
Edit OCR with a custom AnnotationEditor implementation for Mirador
Browse books in a paginated view outside of Mirador (which gets overwhelmed with large libraries)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

HOCRViewer

Requirements

Installation

Data format

Usage

Planned Features

Files

README.md

Latest commit

History

README.md

File metadata and controls

HOCRViewer

Requirements

Installation

Data format

Usage

Planned Features