Workflow Guide region segmentation

In this processing step, an (optimized) document image is taken as an input and the image is segmented into the various regions, including columns. Segments are also classified, either coarse (text, separator, image, table, ...) or fine-grained (paragraph, marginalia, heading, ...).

Note: The ocrd-tesserocr-segment, ocrd-tesserocr-recognize, ocrd-eynollah-segment, ocrd-sbb-textline-detector and ocrd-cis-ocropy-segment processors do not only segment the page, but also the text lines within the detected text regions in one step. Therefore with those (and only with those!) processors you don't need to segment into lines in an extra step and can continue with step 13 - line-level dewarping.

Note: If you use ocrd-tesserocr-segment-region, which uses only bounding boxes instead of polygon coordinates, then you should post-process via ocrd-segment-repair with plausibilize=True to obtain better results without large overlaps. Alternatively, consider using the all-in-one capabilities of ocrd-tesserocr-segment and ocrd-tesserocr-recognize, which can do region segmentation and line segmentation (and optionally also text recognition) in one step by querying Tesseract's internal iterator (accessing the more precise polygon outlines instead of just coarse bounding boxes with lots of hard-to-recover overlap). Alternatively, run with shrink_polygons=True (accessing that same iterator to calculate convex hull polygons).

Note: All the ocrd-tesserocr-segment* processors internally delegate to ocrd-tesserocr-recognize, so you can replace calls to these task-specific processors with calls to ocrd-tesserocr-recognize with specific parameters:

processor call	`ocrd-tesserocr-recognize` parameters
ocrd-tesserocr-segment-region -P overwrite_regions true	ocrd-tesserocr-recognize -P textequiv_level region -P segmentation_level region -P overwrite_segments true
ocrd-tesserocr-segment-table -P overwrite_cells true	ocrd-tesserocr-recognize -P textequiv_level cell -P segmentation_level cell -P overwrite_segments true
ocrd-tesserocr-segment-line -P overwrite_lines true	ocrd-tesserocr-recognize -P textequiv_level line -P segmentation_level line -P overwrite_segments true
ocrd-tesserocr-segment-word -P overwrite_words true	ocrd-tesserocr-recognize -P textequiv_level word -P segmentation_level word -P overwrite_segments true

Note: The three parameters segmentation_level, textequiv_level and model define the behavior of ocrd-tesserocr-recognize:

segmentation_level determines the highest level to segment. Use "none" to disable segmentation altogether, i.e. only recognize existing segments.
textequiv_level determines the lowest level to segment. Use "none" to segment until the lowest level ("glyph") and disable recognition altogether, only analyse layout.
model determines the model to use for text recognition. Use "" or do not set at all to disable recognition, i.e. only analyse layout.

Examples:

To segment existing regions into lines (and only lines) only: segmentation_level="line", textequiv_level="line", model=""
To segment existing regions into lines (and only lines) and recognize text: segmentation_level="line", textequiv_level="line", model="Fraktur"

For detailed descriptions of behaviour and options, see tesserocr's README and ocrd-tesserocr-recognize/segment/segment-region/segment-table/segment-line/segment-word --help help.

Available processors

Processor	Parameter	Remarks	Call
ocrd-tesserocr-segment	`-P find_tables false -P shrink_polygons true`	Recommended. Will reuse internal tesseract iterators to produce a complete segmentation with tight polygons instead of bounding boxes where possible	`ocrd-tesserocr-segment -I OCR-D-DEWARP-PAGE -O OCR-D-SEG -P find_tables false -P shrink_polygons true`
ocrd-eynollah-segment	`-P models`	Models can be found here or downloaded with the OCR-D resource manager; If you didn't download the model with the `resmgr`, for `model` you need to pass the absolute path on your hard drive as parameter value.	`ocrd-eynollah-segment -I OCR-D-IMG -O OCR-D-SEG -P models default`
ocrd-sbb-textline-detector	`-P model modelname`	Models can be found here or downloaded with the OCR-D resource manager; If you didn't download the model with `resmgr`, for `model` you need to pass the local filesystem path as parameter value.	`ocrd-sbb-textline-detector -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-LINE -P model /path/to/model`
ocrd-cis-ocropy-segment	`-P level-of-operation page`		`ocrd-cis-ocropy-segment -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-LINE -P level-of-operation page`
ocrd-tesserocr-segment-region	`-P find_tables false`	Recommended	`ocrd-tesserocr-segment-region -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG -P find_tables false -P shrink_polygons true`
ocrd-segment-repair	`-P plausibilize true`	Only to be used after `ocrd-tesserocr-segment-region`	`ocrd-segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true`
ocrd-anybaseocr-block-segmentation	`-P block_segmentation_model mrcnn_name -P block_segmentation_weights /path/to/model/block_segmentation_weights.h5`	For available models take a look at this site ocr download them via OCR-D resource manager; If you didn't use `resmgr`, you need to pass the local filesystem path as parameter value.	`ocrd-anybaseocr-block-segmentation -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG -P block_segmentation_model mrcnn_name -P block_segmentation_weights /path/to/model/block_segmentation_weights.h5`
ocrd-pc-segmentation			`ocrd-pc-segmentation -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG`
ocrd-detectron2-segment		For available models, any model for Detectron2 forks trained on document layout analysis datasets can be integrated; instructions and examples can be found here

Notes on parameter usage

E.g.

which parameters do you use with what values?
which parameters are insufficiently documented?
which aspects of a processor should be parameterizable but are not?

Notes on document-specific usage

E.g. which processors worked best with what material? -- feel free to post sample images here, too.

ocrd-tesserocr-segment-region tends to produce floating_regions on non-standard layout like lists, e.g. found in newspapers. It furthermore struggles with multicolumn texts like http://tudigit.ulb.tu-darmstadt.de/show/Gue-11660-24

ocrd-sbb-textline-detector does no segmentation into headings, paragraph and regions, but is quite good with finding text regions.

ocrd-tesserocr-segment-region with -P find_tables true subsequently needs a separate segmentation step for the table regions using ocrd-tesserocr-segment-table (see https://github.com/OCR-D/ocrd_tesserocr/issues/134 and https://github.com/OCR-D/ocrd_all/issues/190 for details) or ocrd-cis-ocropy-segment with -P level-of-operation table.

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials

Discussions

Expert section on OCR-D- workflows

Particular workflow steps

Recommended workflows

Successful Workflows for Particular Material (Template)

Workflow Guide

Videos

Section on Ground Truth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow Guide region segmentation

Available processors

Notes on parameter usage

Notes on document-specific usage

Clone this wiki locally