Skip to content

Glossary

Konstantin Baierer edited this page May 9, 2018 · 15 revisions

OCR-D Glossary

Glossary of terms from the domain of image processing/OCR as used within the OCR-D framework

Layout and Typography

Block

A block is a polygon inside a page.

Block type

The semantics or function of a block such as heading, page number, column, print space...

Glyph

TODO

Grapheme Cluster

See Glyph

Line

See TextLine

Region

See Region

Symbol

See Glyph

TextLine

A TextLine is a block of text without line breaks.

Word

A word is a sequence of glyphs not containing any word-bounding whitespace.

Data

Evaluation data

TODO

Ground Truth

TODO

"Referenzdaten"

TODO

Training data

TODO

Activities

Document analysis

TODO

Page segmentation

TODO

Data Persistence

Software repository

The software repository contains all document analysis algorithms developed during the project including tests. It will also contain the documentation and installation instructions for deploying a document analysis workflow.

Research data repository

The research data repository contains the results of all activities during document analysis. At least it contains the end results of every processed document and its full provenance. The research data repository must be available locally.

Model repository

Contains all trained (OCR) models for document analysis. The model repository must be available locally. Ideally, a publicly available model repository will be developed.

Clone this wiki locally