Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesserocr-deskew - directory $TESSDATA_PREFIX ? #351

Closed
jbarth-ubhd opened this issue Mar 15, 2023 · 12 comments · Fixed by #353
Closed

tesserocr-deskew - directory $TESSDATA_PREFIX ? #351

jbarth-ubhd opened this issue Mar 15, 2023 · 12 comments · Fixed by #353

Comments

@jbarth-ubhd
Copy link

jbarth-ubhd commented Mar 15, 2023

PS: ocrd.sif is from ocrd/all:2022-08-15

> singularity exec -e --env-file /home/hd/hd_hd/hd_wu120/ocrd.env --env MAGICK_TEMPORARY_PATH=/scratch/
►hd_wu120_job_700507_p01n10 --env TMPDIR=/scratch/hd_wu120_job_700507_p01n10 --env TESSDATA_PREFIX=/home/hd/hd_hd/
►hd_wu120/ocrd_models/tessdata /home/hd/hd_hd/hd_wu120/ocrd.sif ocrd-tesserocr-deskew -P operation_level page -I 
►OCR-D-003 -O OCR-D-004
GID: readonly variable
UID: readonly variable
Traceback (most recent call last):
  File "/usr/local/bin/ocrd-tesserocr-deskew", line 8, in <module>
    sys.exit(ocrd_tesserocr_deskew())
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/ocrd_tesserocr/cli.py", line 58, in ocrd_tesserocr_deskew
    return ocrd_cli_wrap_processor(TesserocrDeskew, *args, **kwargs)
  File "/build/core/ocrd/ocrd/decorators/__init__.py", line 108, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/build/core/ocrd/ocrd/processor/helpers.py", line 88, in run_processor
    processor.process()
  File "/usr/local/lib/python3.6/site-packages/ocrd_tesserocr/deskew.py", line 68, in process
    psm=PSM.AUTO_OSD
  File "tesserocr.pyx", line 1219, in tesserocr.PyTessBaseAPI.__cinit__
    self._init_api(cpath, clang, oem, NULL, 0, NULL, NULL, False, psm)
  File "tesserocr.pyx", line 1233, in tesserocr.PyTessBaseAPI._init_api
    raise RuntimeError('Failed to init API, possibly an invalid tessdata path: {}'.format(path))
RuntimeError: Failed to init API, possibly an invalid tessdata path: /home/hd/hd_hd/hd_wu120/ocrd_models/tessdata
Command exited with non-zero status 1

but the directory does contains this files:

$ cd /home/hd/hd_hd/hd_wu120/ocrd_models/tessdata && find . -type f -printf "%-30p %9s\n"|sort
./configs/alto                        23
./configs/ambigs.train               146
./configs/api_config                  26
./configs/bigram                     129
./configs/box.train                  311
./configs/box.train.stderr           311
./configs/digits                      37
./configs/get.images                  24
./configs/hocr                        40
./configs/inter                       59
./configs/kannada                    101
./configs/linebox                     70
./configs/logfile                     25
./configs/lstmbox                     26
./configs/lstmdebug                   98
./configs/lstm.train                 282
./configs/makebox                     26
./configs/pdf                         22
./configs/quiet                       21
./configs/rebox                       65
./configs/strokewidth                377
./configs/tsv                         22
./configs/txt                        166
./configs/unlv                        45
./configs/wordstrbox                  29
./deu.traineddata                8628461
./eng.traineddata               15400601
./frak2021_1.069.traineddata     5060763
./fra.traineddata                3972885
./GT4HistOCR_50000000.997_191951.traineddata   4591424
./pdf.ttf                            572
./script/Latin.traineddata     101402885
./tessconfigs/batch                   49
./tessconfigs/batch.nochop            37
./tessconfigs/matdemo                243
./tessconfigs/msdemo                 368
./tessconfigs/nobatch                  1
./tessconfigs/segdemo                295
@bertsky
Copy link
Collaborator

bertsky commented Mar 15, 2023

Thanks @jbarth-ubhd for the detailled report and analysis.

Simple reason: osd.traineddata is missing. Used to get installed – checking why not.

@bertsky
Copy link
Collaborator

bertsky commented Mar 15, 2023

Got it!

TESSDATA = $(VIRTUAL_ENV)/share/tessdata/

… must now be $(VIRTUAL_ENV)/share/ocrd-resources/ocrd-tesserocr-recognize.

So we have a mismatch between the install-time location and the runtime/resmgr location.

@bertsky
Copy link
Collaborator

bertsky commented Mar 15, 2023

must now be $(VIRTUAL_ENV)/share/ocrd-resources/ocrd-tesserocr-recognize.

No, that would not work either, because we use configure --prefix=$(VIRTUAL_ENV), so Tesseract will be compiled for the share/tessdata.

Rather, there was a superflous environment variable override:

ENV TESSDATA_PREFIX $XDG_DATA_HOME/ocrd-resources/ocrd-tesserocr-recognize

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Mar 16, 2023

Just wanted to check ocrd resmgr list-available on my workstation (ubuntu 20.04, docker, docker pulled a lot of files for ocrd/all):

jb@pers16:~> alias docker_ocrd
alias docker_ocrd='sudo docker run --user $(id -u) --workdir /data --volume $PWD/data:/data --volume $PWD/models:/
►usr/local/share/ocrd-resources ocrd/all'

jb@pers16:~> docker_ocrd ocrd resmgr list-available
Traceback (most recent call last):
  File "/usr/local/bin/ocrd", line 33, in <module>
    sys.exit(load_entry_point('ocrd', 'console_scripts', 'ocrd')())
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/build/core/ocrd/ocrd/cli/resmgr.py", line 47, in list_available
    resmgr = OcrdResourceManager()
  File "/build/core/ocrd/ocrd/resource_manager.py", line 34, in __init__
    self.user_list.parent.mkdir(parents=True)
  File "/usr/lib/python3.6/pathlib.py", line 1248, in mkdir
    self._accessor.mkdir(self, mode)
  File "/usr/lib/python3.6/pathlib.py", line 387, in wrapped
    return strfunc(str(pathobj), *args)
PermissionError: [Errno 13] Permission denied: '/.config/ocrd'

@jbarth-ubhd
Copy link
Author

ah... with --volume $PWD/.config:/.config it works

jb@pers16:~> sudo docker run --user $(id -u) --workdir /data --volume $PWD/data:/data --volume $PWD/models:/usr/
►local/share/ocrd-resources --volume $PWD/.config:/.config ocrd/all ocrd resmgr list-available
ocrd-tesserocr-recognize
- Fraktur_GT4HistOCR.traineddata  (https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/Fraktur_5000000/
►tessdata_fast/Fraktur_50000000.334_450937.traineddata)
  Tesseract LSTM model trained on GT4HistOCR
- ONB.traineddata  (https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/data/ONB/tessdata_best/
►ONB_1.195_300718_989100.traineddata)
  Tesseract LSTM model based on Austrian National Library newspaper data
- equ.traineddata  (https://github.com/tesseract-ocr/tessdata_fast/raw/main/equ.traineddata)
  Tesseract equ model
...

@jbarth-ubhd
Copy link
Author

... almost

jb@pers16:~> docker_ocrd ocrd resmgr download ocrd-tesserocr-recognize configs
12:30:17.190 INFO ocrd.cli.resmgr - Downloading resource {'url': 'https://github.com/tesseract-ocr/tesseract/
►archive/main.tar.gz', 'name': 'configs', 'description': 'Tesseract configs (parameter sets) for use with the 
►standalone tesseract CLI', 'size': 1915529, 'type': 'tarball', 'path_in_archive': 'tesseract-main/tessdata/configs
►', 'parameter_usage': 'as-is', 'version_range': '>= 0.0.1'}
12:30:17.193 INFO ocrd.resource_manager._download_impl - Downloading https://github.com/tesseract-ocr/tesseract/
►archive/main.tar.gz to download.tar.xx
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 175, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.6/dist-packages/urllib3/util/connection.py", line 72, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution
...

Is this my ubuntu 20.04 with dnsmasq in NetworkManager.conf?

root@pers16:/home/jb# cat /etc/NetworkManager/NetworkManager.conf
[main]
plugins=ifupdown,keyfile,ofono
dns=dnsmasq

no-auto-default=00:01:02:12:40:C5,00:21:9B:5E:BE:17,90:1B:0E:42:7D:AE,

[ifupdown]
managed=false

@jbarth-ubhd
Copy link
Author

sudo docker run --dns A.B.C.D ... helped.

@jbarth-ubhd
Copy link
Author

BTW no osd.traineddata in ~/models/ocrd-tesserocr-recognize/

@bertsky
Copy link
Collaborator

bertsky commented Mar 16, 2023

ah... with --volume $PWD/.config:/.config it works

yes, sorry, we forgot to document this on https://ocr-d.de/en/models#models-and-docker

now tracking under OCR-D/ocrd-website#318

@bertsky
Copy link
Collaborator

bertsky commented Mar 16, 2023

BTW no osd.traineddata in ~/models/ocrd-tesserocr-recognize/

like I said above (see PR with fix), there must not be TESSDATA_PREFIX at install time (make all or make install-tesseract).

@bertsky
Copy link
Collaborator

bertsky commented Mar 16, 2023

sudo docker run --dns A.B.C.D ... helped.

I remember seeing this problem before. Also happens at build-time (docker build). You can also try with --network=host or --network=bridge.

@jbarth-ubhd
Copy link
Author

schnief (german)

@kba kba closed this as completed in #353 Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants