Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate ocrd v3 #216

Open
wants to merge 35 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
4f98e6d
adapt to ocrd v3 Processor init (automatic ocrd-tool.json loading)
bertsky Jul 6, 2024
a9168e0
tests: adapt to ocrd v3 init (setup only via run_processor)
bertsky Jul 6, 2024
eb661f4
adapt to ocrd v3 (process→process_page_pcgts)…
bertsky Jul 7, 2024
95d2837
require ocrd>=3.0
bertsky Jul 7, 2024
47dee36
ocrd-tool.json: add cardinality specs
bertsky Aug 13, 2024
e9d562b
require ocrd 3.0 prerelease
bertsky Aug 13, 2024
f6c5ea0
binarize: use final v3 API
bertsky Aug 15, 2024
3fd8265
crop: adapt to final v3 API
bertsky Aug 15, 2024
a66fbbe
deskew: adapt to final v3 API
bertsky Aug 15, 2024
ae10667
fontshape: adapt to final v3 API
bertsky Aug 15, 2024
4c22245
recognize: use final v3 API
bertsky Aug 15, 2024
491003f
segment: adapt to final v3 API
bertsky Aug 16, 2024
0adfdee
segment_line: adapt to final v3 API
bertsky Aug 16, 2024
1d7efa5
segment_region: adapt to final v3 API
bertsky Aug 16, 2024
aadd01b
segment_table: adapt to final v3 API
bertsky Aug 16, 2024
f5099c7
segment_word: adapt to final v3 API
bertsky Aug 16, 2024
013de28
deskew: no segment.id for suffix on page level
bertsky Aug 16, 2024
ff258a3
CI: ex py37, in py311
bertsky Aug 16, 2024
276735b
adapt to v3 b1, replace inheritance w/ proxy pattern
bertsky Aug 25, 2024
7ae25a3
tests: adapt to etree in v3 b1
bertsky Aug 25, 2024
ef09995
require ocrd>=3.0.0b1
bertsky Aug 26, 2024
972ac76
test_recognize: also test with METS Server and METS caching
bertsky Aug 29, 2024
a0d7ffa
limit max_workers=1 (libtesseract is not thread-safe)
bertsky Aug 29, 2024
a406400
conftest: simplify
bertsky Aug 30, 2024
81fe66f
require ocrd>=3.0.0b3
bertsky Aug 30, 2024
4e7fa70
test_cli: use subprocess CLI instead of monkeypatching env for TESSDA…
bertsky Aug 31, 2024
b76a4f5
test: all in pytest call
bertsky Aug 31, 2024
c9b8f3a
test: do not skip failured pages
bertsky Aug 31, 2024
6d26cf0
require ocrd>=3.0.0b4
bertsky Sep 2, 2024
6ca668e
require ocrd>=3.0.0b6 (mp), unlimit max_workers
bertsky Oct 29, 2024
2a8b23b
test: simplify, use all configs in all tests
bertsky Oct 29, 2024
8dc5a4f
Merge branch 'master' into migrate-ocrd-v3
bertsky Oct 30, 2024
23d7f7f
CI: add RAM, more verbose
bertsky Oct 30, 2024
1a157a5
require core >= 3
kba Jan 20, 2025
e0e5e4d
update tesser{act,ocr}
kba Jan 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,5 @@ disable =
wrong-import-order,
duplicate-code

# allow indented whitespace (as required by interpreter):
no-space-check=empty-line

# allow non-snake-case identifiers:
good-names=n,i
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@ Versioned according to [Semantic Versioning](http://semver.org/).

## Unreleased

Changed:

* adapt to ocrd 3.0, #216

## [0.19.1] - 2024-07-01

Fixed:
Expand Down
194 changes: 84 additions & 110 deletions ocrd_tesserocr/binarize.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,35 +6,26 @@
PSM, RIL
)

from ocrd_utils import (
getLogger,
assert_file_grp_cardinality,
make_file_id,
MIMETYPE_PAGE
)
from ocrd_modelfactory import page_from_file
from ocrd_models.ocrd_page import (
AlternativeImageType,
TextRegionType,
to_xml
)

from .config import OCRD_TOOL
from .recognize import TesserocrRecognize

TOOL = 'ocrd-tesserocr-binarize'

class TesserocrBinarize(TesserocrRecognize):
def __init__(self, *args, **kwargs):
kwargs.setdefault('ocrd_tool', OCRD_TOOL['tools'][TOOL])
super().__init__(*args, **kwargs)
if hasattr(self, 'parameter'):
self.logger = getLogger('processor.TesserocrBinarize')
@property
def executable(self):
return 'ocrd-tesserocr-binarize'

def _init(self):
# use default model (eng) with vanilla tesserocr API
self.tessapi = PyTessBaseAPI()

def process(self):
def process_page_pcgts(self, pcgts, output_file_id=None, page_id=None):
Copy link
Member

@kba kba Aug 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides missing the typing, this has a different signature, pcgts is not variadic here. Python allows this and it is convenient but I am wondering whether it would be better to be consistent with the typing and signature?

Suggested change
def process_page_pcgts(self, pcgts, output_file_id=None, page_id=None):
def process_page_pcgts(self, *input_pcgts, output_file_id : Optional[str] = None, page_id : Optional[str] = None) -> OcrdPage:
pcgts = input_pcgts[0]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typing – sure, I was just to lazy again.

Variadic – I thought it would be clearer like that. So we would not have to do any arity checking in the function itself – a type checker could simply detect invalid use cases which do pass multiple pages at once. But perhaps I am wrong. (Also, we already have the arity assertion in setup.)

"""Performs binarization of the region / line with Tesseract on the workspace.

Open and deserialize PAGE input files and their respective images,
Open and deserialize PAGE input file and its respective images,
then iterate over the element hierarchy down to the requested level.

Set up Tesseract to recognize the segment image's layout, and get
Expand All @@ -47,109 +38,92 @@ def process(self):

Produce a new output file by serialising the resulting hierarchy.
"""
assert_file_grp_cardinality(self.input_file_grp, 1)
assert_file_grp_cardinality(self.output_file_grp, 1)

sepmask = self.parameter['tiseg']
oplevel = self.parameter['operation_level']

with PyTessBaseAPI() as tessapi:
for n, input_file in enumerate(self.input_files):
file_id = make_file_id(input_file, self.output_file_grp)
page_id = input_file.pageId or input_file.ID
self.logger.info("INPUT FILE %i / %s", n, page_id)
pcgts = page_from_file(self.workspace.download_file(input_file))
self.add_metadata(pcgts)
page = pcgts.get_Page()
page_image, page_xywh, page_image_info = self.workspace.image_from_page(
page, page_id)
if self.parameter['dpi'] > 0:
dpi = self.parameter['dpi']
self.logger.info("Page '%s' images will use %d DPI from parameter override", page_id, dpi)
elif page_image_info.resolution != 1:
dpi = page_image_info.resolution
if page_image_info.resolutionUnit == 'cm':
dpi = round(dpi * 2.54)
self.logger.info("Page '%s' images will use %d DPI from image meta-data", page_id, dpi)
else:
dpi = 0
self.logger.info("Page '%s' images will use DPI estimated from segmentation", page_id)
tessapi.SetVariable('user_defined_dpi', str(dpi))
self.logger.info("Binarizing on '%s' level in page '%s'", oplevel, page_id)
page = pcgts.get_Page()
page_image, page_xywh, page_image_info = self.workspace.image_from_page(
page, page_id)
if self.parameter['dpi'] > 0:
dpi = self.parameter['dpi']
self.logger.info("Page '%s' images will use %d DPI from parameter override", page_id, dpi)
elif page_image_info.resolution != 1:
dpi = page_image_info.resolution
if page_image_info.resolutionUnit == 'cm':
dpi = round(dpi * 2.54)
self.logger.info("Page '%s' images will use %d DPI from image meta-data", page_id, dpi)
else:
dpi = 0
self.logger.info("Page '%s' images will use DPI estimated from segmentation", page_id)
self.tessapi.SetVariable('user_defined_dpi', str(dpi))
self.logger.info("Binarizing on '%s' level in page '%s'", oplevel, page_id)

if oplevel == 'page':
tessapi.SetPageSegMode(PSM.AUTO_ONLY)
tessapi.SetImage(page_image)
if sepmask:
# will trigger FindLines() → SegmentPage() → AutoPageSeg()
# → SetupPageSegAndDetectOrientation() → FindAndRemoveLines() + FindImages()
tessapi.AnalyseLayout()
page_image_bin = tessapi.GetThresholdedImage()
if page_image_bin:
# update METS (add the image file):
file_path = self.workspace.save_image_file(page_image_bin,
file_id + '.IMG-BIN',
page_id=input_file.pageId,
file_grp=self.output_file_grp)
# update PAGE (reference the image file):
features = page_xywh['features'] + ",binarized"
if sepmask:
features += ",clipped"
page.add_AlternativeImage(AlternativeImageType(
filename=file_path, comments=features))
else:
self.logger.error('Cannot binarize %s', "page '%s'" % page_id)
else:
regions = page.get_TextRegion() + page.get_TableRegion()
if not regions:
self.logger.warning("Page '%s' contains no text regions", page_id)
for region in regions:
region_image, region_xywh = self.workspace.image_from_segment(
region, page_image, page_xywh)
if oplevel == 'region':
tessapi.SetPageSegMode(PSM.SINGLE_BLOCK)
self._process_segment(tessapi, RIL.BLOCK, region, region_image, region_xywh,
"region '%s'" % region.id, input_file.pageId,
file_id + '_' + region.id)
elif isinstance(region, TextRegionType):
lines = region.get_TextLine()
if not lines:
self.logger.warning("Page '%s' region '%s' contains no text lines",
page_id, region.id)
for line in lines:
line_image, line_xywh = self.workspace.image_from_segment(
line, region_image, region_xywh)
tessapi.SetPageSegMode(PSM.SINGLE_LINE)
self._process_segment(tessapi, RIL.TEXTLINE, line, line_image, line_xywh,
"line '%s'" % line.id, input_file.pageId,
file_id + '_' + region.id + '_' + line.id)
if oplevel == 'page':
image = self._process_segment(-1, page, page_image, page_xywh,
page_id, output_file_id)
if image:
return [pcgts, image]
else:
return pcgts

file_id = make_file_id(input_file, self.output_file_grp)
pcgts.set_pcGtsId(file_id)
self.workspace.add_file(
file_id=file_id,
file_grp=self.output_file_grp,
page_id=input_file.pageId,
mimetype=MIMETYPE_PAGE,
local_filename=os.path.join(self.output_file_grp,
file_id + '.xml'),
content=to_xml(pcgts))
result = [pcgts]
regions = page.get_AllRegions(classes=['Text', 'Table'])
if not regions:
self.logger.warning("Page '%s' contains no text regions", page_id)
for region in regions:
region_image, region_xywh = self.workspace.image_from_segment(
region, page_image, page_xywh)
if oplevel == 'region':
image = self._process_segment(RIL.BLOCK, region, region_image, region_xywh,
"region '%s'" % region.id,
output_file_id + '_' + region.id)
if image:
result.append(image)
elif isinstance(region, TextRegionType):
lines = region.get_TextLine()
if not lines:
self.logger.warning("Page '%s' region '%s' contains no text lines",
page_id, region.id)
for line in lines:
line_image, line_xywh = self.workspace.image_from_segment(
line, region_image, region_xywh)
image = self._process_segment(RIL.TEXTLINE, line, line_image, line_xywh,
"line '%s'" % line.id,
output_file_id + '_' + region.id + '_' + line.id)
if image:
result.append(image)

def _process_segment(self, tessapi, ril, segment, image, xywh, where, page_id, file_id):
tessapi.SetImage(image)
return result

def _process_segment(self, ril, segment, image, xywh, where, file_id):
self.tessapi.SetImage(image)
features = xywh['features'] + ",binarized"
image_bin = None
layout = tessapi.AnalyseLayout()
if layout:
image_bin = layout.GetBinaryImage(ril)
if ril == -1:
# page level
self.tessapi.SetPageSegMode(PSM.AUTO_ONLY)
if self.parameter['tiseg']:
features += ",clipped"
# will trigger FindLines() → SegmentPage() → AutoPageSeg()
# → SetupPageSegAndDetectOrientation() → FindAndRemoveLines() + FindImages()
self.tessapi.AnalyseLayout()
image_bin = self.tessapi.GetThresholdedImage()
else:
if ril == RIL.BLOCK:
self.tessapi.SetPageSegMode(PSM.SINGLE_BLOCK)
if ril == RIL.TEXTLINE:
self.tessapi.SetPageSegMode(PSM.SINGLE_LINE)
layout = self.tessapi.AnalyseLayout()
if layout:
image_bin = layout.GetBinaryImage(ril)
if not image_bin:
self.logger.error('Cannot binarize %s', where)
return
return False
# update METS (add the image file):
file_path = self.workspace.save_image_file(image_bin,
file_id + '.IMG-BIN',
page_id=page_id,
file_grp=self.output_file_grp)
file_id += '.IMG-BIN'
file_path = os.path.join(self.output_file_grp, file_id + '.png')
# update PAGE (reference the image file):
features = xywh['features'] + ",binarized"
segment.add_AlternativeImage(AlternativeImageType(
filename=file_path, comments=features))
return image_bin, file_id, file_path
Loading