Skip to content

Workflow Guide text recognition

Konstantin Baierer edited this page Sep 30, 2020 · 6 revisions

This processor recognizes text in segmented lines.

An overview on the existing model repositories and short descriptions on the most important models can be found here.

Available processors

Processor Parameter Remarks Call
ocrd-tesserocr-recognize -P model Fraktur
model "GT4HistOCR_50000000.997_191951
Recommended
Model can be found here
/tessdata_best/GT4HistOCR_50000000.997_191951.traineddata)
TESSDATA_PREFIX="/test/data/tesseractmodels/" ocrd-tesserocr-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -P model Fraktur
ocrd-calamari-recognize -P checkpoint "/path/to/models/*.ckpt.json" Recommended
Model can be found here; you need to **pass your local path to the model on your hard drive** as parameter value for this processor to work!
ocrd-calamari-recognize -I OCR-D-DEWARP-LINE -O OCR-D-OCR -P checkpoint /path/to/models/\*.ckpt.json

Note: For ocrd-tesserocr the environment variable TESSDATA_PREFIX has to be set to point to the directory where the used models are stored. (The directory should at least contain the following models: deu.traineddata, eng.taineddata, osd.traineddata)

Note: If you want to go on with the optional post correction, you should also set the textequiv_level to glyph or in the case of ocrd-calamari-recognize at least word (which is already the default for ocrd-tesserocr-recognize).

Notes on parameter usage

E.g.

  • which parameters do you use with what values?
  • which parameters are insufficiently documented?
  • which aspects of a processor should be parameterizable but are not?

Notes on document-specific usage

E.g. which processors worked best with what material? -- feel free to post sample images here, too.

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials
Discussions
Expert section on OCR-D- workflows
Particular workflow steps
Recommended workflows
Workflow Guide
Videos
Section on Ground Truth
Clone this wiki locally