-
Notifications
You must be signed in to change notification settings - Fork 7
Workflow Guide region segmentation
In this processing step, an (optimized) document image is taken as an input and the image is segmented into the various regions, including columns. Segments are also classified, either coarse (text, separator, image, table, ...) or fine-grained (paragraph, marginalia, heading, ...).
Note: The ocrd-tesserocr-segment
, ocrd-tesserocr-recognize
, ocrd-eynollah-segment
, ocrd-sbb-textline-detector
and
ocrd-cis-ocropy-segment
processors do not only segment the page, but
also the text lines within the detected text regions in one
step. Therefore with those (and only with those!) processors you don't need to
segment into lines in an extra step and can continue with step 13 - line-level dewarping.
Note: If you use ocrd-tesserocr-segment-region
, which uses only bounding
boxes instead of polygon coordinates, then you should post-process via
ocrd-segment-repair
with plausibilize=True
to obtain better results without
large overlaps. Alternatively, consider using the all-in-one capabilities of
ocrd-tesserocr-segment
and ocrd-tesserocr-recognize
, which can do region
segmentation and line segmentation (and optionally also text recognition) in
one step by querying Tesseract's internal iterator (accessing the more precise
polygon outlines instead of just coarse bounding boxes with lots of
hard-to-recover overlap). Alternatively, run with shrink_polygons=True
(accessing that same iterator to calculate convex hull polygons).
Note: All the ocrd-tesserocr-segment*
processors internally delegate to
ocrd-tesserocr-recognize
, so you can replace calls to these task-specific
processors with calls to ocrd-tesserocr-recognize
with specific parameters:
processor call |
ocrd-tesserocr-recognize parameters |
---|---|
ocrd-tesserocr-segment-region -P overwrite_regions true | ocrd-tesserocr-recognize -P textequiv_level region -P segmentation_level region -P overwrite_segments true |
ocrd-tesserocr-segment-table -P overwrite_cells true | ocrd-tesserocr-recognize -P textequiv_level cell -P segmentation_level cell -P overwrite_segments true |
ocrd-tesserocr-segment-line -P overwrite_lines true | ocrd-tesserocr-recognize -P textequiv_level line -P segmentation_level line -P overwrite_segments true |
ocrd-tesserocr-segment-word -P overwrite_words true | ocrd-tesserocr-recognize -P textequiv_level word -P segmentation_level word -P overwrite_segments true |
Note: The three parameters segmentation_level
, textequiv_level
and
model
define the behavior of ocrd-tesserocr-recognize
:
-
segmentation_level
determines the highest level to segment. Use"none"
to disable segmentation altogether, i.e. only recognize existing segments. -
textequiv_level
determines the lowest level to segment. Use"none"
to segment until the lowest level ("glyph"
) and disable recognition altogether, only analyse layout. -
model
determines the model to use for text recognition. Use""
or do not set at all to disable recognition, i.e. only analyse layout.
Examples:
- To segment existing regions into lines (and only lines) only:
segmentation_level="line"
,textequiv_level="line"
,model=""
- To segment existing regions into lines (and only lines) and recognize text:
segmentation_level="line"
,textequiv_level="line"
,model="Fraktur"
For detailed descriptions of behaviour and options, see tesserocr's README and
ocrd-tesserocr-recognize/segment/segment-region/segment-table/segment-line/segment-word --help
help.
Processor | Parameter | Remarks | Call |
---|---|---|---|
ocrd-tesserocr-segment | -P find_tables false -P shrink_polygons true |
Recommended. Will reuse internal tesseract iterators to produce a complete segmentation with tight polygons instead of bounding boxes where possible | ocrd-tesserocr-segment -I OCR-D-DEWARP-PAGE -O OCR-D-SEG -P find_tables false -P shrink_polygons true |
ocrd-eynollah-segment | -P models |
Models can be found here or downloaded with the OCR-D resource manager; If you didn't download the model with the resmgr , for model you need to pass the absolute path on your hard drive as parameter value. |
ocrd-eynollah-segment -I OCR-D-IMG -O OCR-D-SEG -P models default |
ocrd-sbb-textline-detector | -P model modelname |
Models can be found here or downloaded with the OCR-D resource manager; If you didn't download the model with resmgr , for model you need to pass the local filesystem path as parameter value. |
ocrd-sbb-textline-detector -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-LINE -P model /path/to/model |
ocrd-cis-ocropy-segment | -P level-of-operation page |
ocrd-cis-ocropy-segment -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-LINE -P level-of-operation page |
|
ocrd-tesserocr-segment-region | -P find_tables false |
Recommended | ocrd-tesserocr-segment-region -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG -P find_tables false -P shrink_polygons true |
ocrd-segment-repair | -P plausibilize true |
Only to be used after ocrd-tesserocr-segment-region
|
ocrd-segment-repair -I OCR-D-SEG-REG -O OCR-D-SEG-REPAIR -P plausibilize true |
ocrd-anybaseocr-block-segmentation | -P block_segmentation_model mrcnn_name -P block_segmentation_weights /path/to/model/block_segmentation_weights.h5 |
For available models take a look at this site ocr download them via OCR-D resource manager;
If you didn't use resmgr , you need to pass the local filesystem path as parameter value. |
ocrd-anybaseocr-block-segmentation -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG -P block_segmentation_model mrcnn_name -P block_segmentation_weights /path/to/model/block_segmentation_weights.h5 |
ocrd-pc-segmentation | ocrd-pc-segmentation -I OCR-D-DEWARP-PAGE -O OCR-D-SEG-REG |
||
ocrd-detectron2-segment | For available models, any model for Detectron2 forks trained on document layout analysis datasets can be integrated; instructions and examples can be found here |
E.g.
- which parameters do you use with what values?
- which parameters are insufficiently documented?
- which aspects of a processor should be parameterizable but are not?
E.g. which processors worked best with what material? -- feel free to post sample images here, too.
ocrd-tesserocr-segment-region
tends to produce floating_regions
on non-standard layout like lists, e.g. found in newspapers. It furthermore struggles with multicolumn texts like http://tudigit.ulb.tu-darmstadt.de/show/Gue-11660-24
ocrd-sbb-textline-detector
does no segmentation into headings, paragraph and regions, but is quite good with finding text regions.
ocrd-tesserocr-segment-region
with -P find_tables true
subsequently needs a separate segmentation step for the table regions using ocrd-tesserocr-segment-table
(see https://github.com/OCR-D/ocrd_tesserocr/issues/134 and https://github.com/OCR-D/ocrd_all/issues/190 for details) or ocrd-cis-ocropy-segment
with -P level-of-operation table
.
Welcome to the OCR-D wiki, a companion to the OCR-D website.
Articles and tutorials
- Running OCR-D on macOS
- Running OCR-D in Windows 10 with Windows Subsystem for Linux
- Running OCR-D on POWER8 (IBM pSeries)
- Running browse-ocrd in a Docker container
- OCR-D Installation on NVIDIA Jetson Nano and Xavier
- Mapping PAGE to ALTO
- Comparison of OCR formats (outdated)
- A Practicioner's View on Binarization
- How to use the bulk-add command to generate workspaces from existing files
- Evaluation of (intermediary) steps of an OCR workflow
- A quickstart guide to ocrd workspace
- Introduction to parameters in OCR-D
- Introduction to OCR-D processors
- Introduction to OCR-D workflows
- Visualizing (intermediate) OCR-D-results
- Guide to updating ocrd workspace calls for 2.15.0+
- Introduction to Docker in OCR-D
- How to import Abbyy-generated ALTO
- How to create ALTO for DFG Viewer
- How to create searchable fulltext data for DFG Viewer
- Setup native CUDA Toolkit for Qurator tools on Ubuntu 18.04
- OCR-D Code Review Guidelines
- OCR-D Recommendations for Using CI in Your Repository
Expert section on OCR-D- workflows
Particular workflow steps
Workflow Guide
- Workflow Guide: preprocessing
- Workflow Guide: binarization
- Workflow Guide: cropping
- Workflow Guide: denoising
- Workflow Guide: deskewing
- Workflow Guide: dewarping
- Workflow Guide: region-segmentation
- Workflow Guide: clipping
- Workflow Guide: line-segmentation
- Workflow Guide: resegmentation
- Workflow Guide: olr-evaluation
- Workflow Guide: text-recognition
- Workflow Guide: text-alignment
- Workflow Guide: post-correction
- Workflow Guide: ocr-evaluation
- Workflow Guide: adaptation-of-coordinates
- Workflow Guide: format-conversion
- Workflow Guide: generic transformations
- Workflow Guide: dummy processing
- Workflow Guide: archiving
- Workflow Guide: recommended workflows