Skip to content

Glossary

Konstantin Baierer edited this page May 14, 2018 · 15 revisions

OCR-D Glossary

Glossary of terms from the domain of image processing/OCR as used within the OCR-D framework

Layout and Typography

Block

A block is a polygon inside a page.

Block type

The semantics or function of a block such as heading, page number, column, print space...

Glyph

TODO

Grapheme Cluster

See Glyph

Line

See TextLine

Reading Order

Reading order is the intended order of regions within a document.

Region

See Region

Symbol

See Glyph

TextLine

A TextLine is a block of text without line breaks.

Word

A word is a sequence of glyphs not containing any word-bounding whitespace.

Data

Evaluation data

TODO

Ground Truth

Ground Truth in the context of OCR-D is transcriptions in PAGE-XML format in combination with the original image.

"Referenzdaten"

TODO

Training data

Most LSTM will be trained on line transcription/line image tuples. These can be generated from PAGE-XML of the Ground Truth.

Activities

Binarization

Binarization means converting all colors in an image to either black or white.

Controlled term: binarized

See Felix' Niklas interactive demo

Dewarping

Manipulating an image in such a way that it is rectangular, all text lines are parallel to bottom/top edge of page and creases/folds/curving of page into spine of book has been corrected.

See Matt Zucker's entry on Dewarping.

Despeckling

Remove artifacts such as smudges, ink blots, underlinings etc. from an image.

Dewskewing

Rotate image so that all text lines are horizontal.

Grayscale normalization

ISSUE: https://github.com/OCR-D/spec/issues/41

Controlled term: gray_normalized.

Gray normalization is similar to binarization but instead of a purely bitonal image, the output can also contain shades of gray to avoid inadvertently combining glyphs when they are very close together.

Document analysis

Document analysis is the detection of structure on the document level to create a table of contents.

Reading order detection

Detects the reading order of blocks.

Cropping

Detecting the print space in a page, as opposed to the margins. It is a form of block segmentation

Border removal

--> Cropping

Segmentation

Segmentation means detecting areas within an image.

Specific segmentation algorithms are labelled by the semantics of the regions they detect not the semantics of the input, i.e. an algorithm that detects blocks is called block segmentation.

Block segmentation

Segment an image into blocks. Also determines whether this is a text or non-text block (e.g. images).

Controlled term: SEG-BLOCK

Block classification

Determine the [type](#block type) of a detected block.

Line segmentation

Segment blocks into textlines.

Controlled term: SEG-LINE

Word segmentation

Segment a textline into words

Controlled term: SEG-WORD

Glyph segmentation

Segment a textline into glyphs

Controlled term: SEG-GLYPH

Data Persistence

Software repository

TODO wrong use of document analysis

The software repository contains all document analysis algorithms developed during the project including tests. It will also contain the documentation and installation instructions for deploying a document analysis workflow.

Ground Truth repository

Contains all the ground truth data.

Research data repository

TODO wrong use of document analysis

The research data repository contains the results of all activities during document analysis. At least it contains the end results of every processed document and its full provenance. The research data repository must be available locally.

Model repository

TODO wrong use of document analysis

Contains all trained (OCR) models for document analysis. The model repository must be available locally. Ideally, a publicly available model repository will be developed.

Clone this wiki locally