Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got exception using ocrd_detectron 2 with ocrd_all Release v2022-12-01 #15

Closed
stefanCCS opened this issue Dec 5, 2022 · 37 comments
Closed

Comments

@stefanCCS
Copy link

I have got an exception using ocrd-detectron2-segment as follows - please clarify (I can provide workspace, if needed):

(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials/Detectron2Test$ ocrd-detectron2-segment -I OCR-D-BIN -O ORD-D-REG-DETECTRON2 -p /home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_DocBank_X101.json
13:14:28.481 INFO processor.Detectron2Segment - Using compute device cpu
13:14:28.482 INFO processor.Detectron2Segment - Loading config '/home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml'
Traceback (most recent call last):
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/bin/ocrd-detectron2-segment", line 8, in <module>
    sys.exit(ocrd_detectron2_segment())
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/cli.py", line 9, in ocrd_detectron2_segment
    return ocrd_cli_wrap_processor(Detectron2Segment, *args, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 117, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 82, in run_processor
    parameter=parameter
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 91, in __init__
    self.setup()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 116, in setup
    cfg.merge_from_file(temp_config)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/config/config.py", line 46, in merge_from_file
    loaded_cfg = self.load_yaml_with_base(cfg_filename, allow_unsafe=allow_unsafe)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/fvcore/common/config.py", line 61, in load_yaml_with_base
    cfg = yaml.safe_load(f)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/__init__.py", line 125, in safe_load
    return load(stream, SafeLoader)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/__init__.py", line 79, in load
    loader = Loader(stream)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/loader.py", line 34, in __init__
    Reader.__init__(self, stream)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/reader.py", line 85, in __init__
    self.determine_encoding()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/reader.py", line 124, in determine_encoding
    self.update_raw()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/reader.py", line 178, in update_raw
    data = self.stream.read(size)
  File "/usr/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 10: invalid start byte
@bertsky
Copy link
Owner

bertsky commented Dec 5, 2022

Can you please show the contents of your model file /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml, and describe how you got (downloaded) it?

@stefanCCS
Copy link
Author

It is a VERY BIG file:

(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials/Detectron2Test$ ll /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml
-rw-rw-r-- 1 ocrdadmin ocrdadmin 783884362 Dec  2 10:32 /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml

And, I have not downloaded in advance (I have thought, that this is done automatically, when I use -p).

--> So, maybe I should do ?

ocrd resmgr download ocrd-detectron2-segment DocBank_X101.yaml
ocrd resmgr download ocrd-detectron2-segment DocBank_X101.pth

And, now it see the output of ocrd-detectron2-segment -L is a bit strange (I only would expect JSON files, but I can see also pth/yaml files ?!):

(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials/Detectron2Test$ ocrd-detectron2-segment -L
/home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.pth
/home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_DocBank_X101.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_Math_R50.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_NewspaperNavigator_R50.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_R101.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_R101_JPLeoRX.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_R50.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_R50_JPLeoRX.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_X101.json
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_TableBank_X152.json

@kba
Copy link
Contributor

kba commented Dec 5, 2022

And, I have not downloaded in advance (I have thought, that this is done automatically, when I use -p).

No, we have a PR for that OCR-D/core#799 but it got delayed because it is difficult to test.

To me it looks like the data got corrupted during download.

Try

ocrd resmgr download --overwrite ocrd-detectron2-segment DocBank_X101.yaml

@bertsky
Copy link
Owner

bertsky commented Dec 5, 2022

And, now it see the output of ocrd-detectron2-segment -L is a bit strange

No, that one seems correct.

I also believe some earlier download attempt must have been corrupted.

@stefanCCS
Copy link
Author

Has not helped - still VERY BIG file.
I assume this in depending of this zip-Source-File.
See here:

(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials$ ocrd resmgr download --overwrite ocrd-detectron2-segment DocBank_X101.yaml
14:30:19.029 INFO ocrd.cli.resmgr - Downloading registered resource 'DocBank_X101.yaml' (https://layoutlm.blob.core.windows.net/docbank/model_zoo/X101.zip)
  [------------------------------------]    0%14:30:22.528 INFO ocrd.resource_manager._download_impl - Downloading https://layoutlm.blob.core.windows.net/docbank/model_zoo/X101.zip to /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml
  [####################################]  100%
14:36:41.449 INFO ocrd.cli.resmgr - Installed resource https://layoutlm.blob.core.windows.net/docbank/model_zoo/X101.zip under /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml
14:36:41.449 INFO ocrd.cli.resmgr - Use in parameters as 'DocBank_X101.yaml'
(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials$ ll /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml
-rw-rw-r-- 1 ocrdadmin ocrdadmin 783884362 Dec  5 14:36 /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml

@kba
Copy link
Contributor

kba commented Dec 5, 2022

Has not helped - still VERY BIG file.

That is to be expected, it's a ZIP file containing a huge neural network (model itself is 797 MiB).

But does it work now with the processor, i.e. has the bitflip been corrected by redownloading?

@stefanCCS
Copy link
Author

Sorry, forgot to mention: I still get the same Exception:

(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials/Detectron2Test$ ocrd-detectron2-segment -I OCR-D-BIN -O ORD-D-REG-DETECTRON2 -p /home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_DocBank_X101.json
14:51:05.690 INFO processor.Detectron2Segment - Using compute device cpu
14:51:05.690 INFO processor.Detectron2Segment - Loading config '/home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.yaml'
Traceback (most recent call last):
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/bin/ocrd-detectron2-segment", line 8, in <module>
    sys.exit(ocrd_detectron2_segment())
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/cli.py", line 9, in ocrd_detectron2_segment
    return ocrd_cli_wrap_processor(Detectron2Segment, *args, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 117, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 82, in run_processor
    parameter=parameter
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 91, in __init__
    self.setup()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 116, in setup
    cfg.merge_from_file(temp_config)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/config/config.py", line 46, in merge_from_file
    loaded_cfg = self.load_yaml_with_base(cfg_filename, allow_unsafe=allow_unsafe)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/fvcore/common/config.py", line 61, in load_yaml_with_base
    cfg = yaml.safe_load(f)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/__init__.py", line 125, in safe_load
    return load(stream, SafeLoader)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/__init__.py", line 79, in load
    loader = Loader(stream)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/loader.py", line 34, in __init__
    Reader.__init__(self, stream)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/reader.py", line 85, in __init__
    self.determine_encoding()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/reader.py", line 124, in determine_encoding
    self.update_raw()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/reader.py", line 178, in update_raw
    data = self.stream.read(size)
  File "/usr/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 10: invalid start byte

@stefanCCS
Copy link
Author

Has not helped - still VERY BIG file.

That is to be expected, it's a ZIP file containing a huge neural network (model itself is 797 MiB).

But does it work now with the processor, i.e. has the bitflip been corrected by redownloading?

This is still strange for me, as I would expect to get the unzipped-yaml file (which should a be very small text file)

@stefanCCS
Copy link
Author

Now, I have used ocrd-detectron2-segement with resources which are NOT in Zip-File.
And, I have got a different exception:

 ocrd-detectron2-segment -I OCR-D-BIN -O ORD-D-REG-DETECTRON2 -p /home/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_NewspaperNavigator_R50.js
Traceback (most recent call last):
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/bin/ocrd-detectron2-segment", line 8, in <module>
    sys.exit(ocrd_detectron2_segment())
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1054, in main
    with self.make_context(prog_name, args, **extra) as ctx:
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 920, in make_context
    self.parse_args(ctx, args)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1378, in parse_args
    value, args = param.handle_parse_result(ctx, opts, args)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 2360, in handle_parse_result
    value = self.process_value(ctx, value)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 2322, in process_value
    value = self.callback(ctx, self, value)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/decorators/parameter_option.py", line 8, in _handle_param_option
    return parse_json_string_or_file(*list(value))
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_utils/str.py", line 179, in parse_json_string_or_file
    raise err       # pylint: disable=raising-bad-type
ValueError: Error parsing '/home/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_NewspaperNavigator_R50.js': Expecting value: line 1 column 1 (char 0)

@kba , @bertsky : If you like, we can do a VC, where I can show this directly ...

@bertsky
Copy link
Owner

bertsky commented Dec 5, 2022

ocrd-detectron2-segment -I OCR-D-BIN -O ORD-D-REG-DETECTRON2 -p /home/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_NewspaperNavigator_R50.js

you misspelled. json not js

@stefanCCS
Copy link
Author

oops :-(

@bertsky
Copy link
Owner

bertsky commented Dec 5, 2022

But does it work now with the processor, i.e. has the bitflip been corrected by redownloading?

This is still strange for me, as I would expect to get the unzipped-yaml file (which should a be very small text file)

indeed, it should. Trying to reproduce with most recent version of ocrd_detectron2 (or did you say most recent version of ocrd_all?)...

@stefanCCS
Copy link
Author

But does it work now with the processor, i.e. has the bitflip been corrected by redownloading?

This is still strange for me, as I would expect to get the unzipped-yaml file (which should a be very small text file)

indeed, it should. Trying to reproduce with most recent version of ocrd_detectron2 (or did you say most recent version of ocrd_all?)...

Most recent version of ocrd_all (NOT your new one of ocrd-detectron2-segement)

@stefanCCS
Copy link
Author

ocrd-detectron2-segment -I OCR-D-BIN -O ORD-D-REG-DETECTRON2 -p /home/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_NewspaperNavigator_R50.js

you misspelled. json not js

Sorry, next try - but still not working:

ocrd-detectron2-segment -I OCR-D-BIN -O ORD-D-REG-DETECTRON2 -p  /home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_NewspaperNavigator_R50.json
15:12:37.763 INFO processor.Detectron2Segment - Using compute device cpu
15:12:37.763 INFO processor.Detectron2Segment - Loading config '/home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/NewspaperNavigator_R_50_PFPN_3x.yaml'
Traceback (most recent call last):
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/bin/ocrd-detectron2-segment", line 8, in <module>
    sys.exit(ocrd_detectron2_segment())
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/cli.py", line 9, in ocrd_detectron2_segment
    return ocrd_cli_wrap_processor(Detectron2Segment, *args, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 117, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 82, in run_processor
    parameter=parameter
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 91, in __init__
    self.setup()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 116, in setup
    cfg.merge_from_file(temp_config)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/config/config.py", line 46, in merge_from_file
    loaded_cfg = self.load_yaml_with_base(cfg_filename, allow_unsafe=allow_unsafe)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/fvcore/common/config.py", line 61, in load_yaml_with_base
    cfg = yaml.safe_load(f)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/__init__.py", line 125, in safe_load
    return load(stream, SafeLoader)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/__init__.py", line 81, in load
    return loader.get_single_data()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/constructor.py", line 49, in get_single_data
    node = self.get_single_node()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/composer.py", line 36, in get_single_node
    document = self.compose_document()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/composer.py", line 58, in compose_document
    self.get_event()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/parser.py", line 118, in get_event
    self.current_event = self.state()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/parser.py", line 193, in parse_document_end
    token = self.peek_token()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/scanner.py", line 129, in peek_token
    self.fetch_more_tokens()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/scanner.py", line 223, in fetch_more_tokens
    return self.fetch_value()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/yaml/scanner.py", line 579, in fetch_value
    self.get_mark())
yaml.scanner.ScannerError: mapping values are not allowed here
  in "/tmp/tmp_jjbvt9h/configs/NewspaperNavigator_R_50_PFPN_3x.yaml", line 19, column 28

@bertsky
Copy link
Owner

bertsky commented Dec 5, 2022

I can confirm the extraction of the zip-file does not work with resmgr in core v2.43. There's the correct path_in_archive setting, but nothing gets extracted, the file just gets renamed. @kba perhaps my assumption that non-empty path_in_archive would imply type=archive does not hold?

@bertsky
Copy link
Owner

bertsky commented Dec 5, 2022

in "/tmp/tmp_jjbvt9h/configs/NewspaperNavigator_R_50_PFPN_3x.yaml", line 19, column 28

sry, the URL does not work with wget. It seems Dropbox forces you to interact with the download button, which yields a temporary download link. Too bad. What should we do?

@stefanCCS
Copy link
Author

in "/tmp/tmp_jjbvt9h/configs/NewspaperNavigator_R_50_PFPN_3x.yaml", line 19, column 28

sry, the URL does not work with wget. It seems Dropbox forces you to interact with the download button, which yields a temporary download link. Too bad. What should we do?

I will try out /home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_R50.jsoninstead ...

@bertsky
Copy link
Owner

bertsky commented Dec 5, 2022

in "/tmp/tmp_jjbvt9h/configs/NewspaperNavigator_R_50_PFPN_3x.yaml", line 19, column 28

sry, the URL does not work with wget. It seems Dropbox forces you to interact with the download button, which yields a temporary download link. Too bad. What should we do?

me bad. The problem was that I misspelled the URL in the tool json (& instead of ? for the URL args).

Fixed on master.

@bertsky
Copy link
Owner

bertsky commented Dec 5, 2022

@stefanCCS can you please try again (both examples) after updating (in the usual way, i.e. git pull of the submodule, then remake ocrd-detectron2-segment in the main module)?

@stefanCCS
Copy link
Author

With ocrd-detectron2-segment I get

...
15:42:19.227 INFO ocrd.cli.resmgr - Use in parameters as 'PubLayNet_R_50_FPN_3x_JPLeoRX.pth'
15:42:22.635 INFO processor.Detectron2Segment - Using compute device cpu
15:42:22.636 ERROR ocrd.ocrd-detectron2-segment.resolve_resource - Could not find resource 'PubLayNet_R_50_FPN_3x.yaml' for ...

Which might be related to:
From ocrd resmgr list-available -e ocrd-detectron2-segment I get:

- PubLayNet_R_50_FPN_3x_JPLeoRX.yaml  (https://github.com/facebookresearch/detectron2/raw/main/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml)
  PubLayNet via JPLeoRX R50-FPN config
- PubLayNet_R_50_FPN_3x_JPLeoRX.pth  (https://keybase.pub/jpleorx/detectron2-publaynet/mask_rcnn_R_50_FPN_3x/model_final.pth)
  PubLayNet via JPLeoRX R50-FPN weights
...

Which might be not exactly the same name as for the JSON: presets_PubLayNet_R50.json?

@kba
Copy link
Contributor

kba commented Dec 5, 2022

@kba perhaps my assumption that non-empty path_in_archive would imply type=archive does not hold?

No, it does not, the type defaults to file. This assumption had not occured to me, it might be possible to hack that in, but I think being explicit about the type is better in any case.

@bertsky
Copy link
Owner

bertsky commented Dec 5, 2022

@kba perhaps my assumption that non-empty path_in_archive would imply type=archive does not hold?

No, it does not, the type defaults to file. This assumption had not occured to me, it might be possible to hack that in, but I think being explicit about the type is better in any case.

Yes, and it was ill-conceived to begin with. Just because the URL is a zip-file does not mean the resource itself must be. On the contrary, path_in_archive is simply one level deeper, so theoretically it could be an archive within an archive.

But that changes the question to: why did resmgr download not extract the file in the first place??

@bertsky
Copy link
Owner

bertsky commented Dec 5, 2022

Yes, and it was ill-conceived to begin with. Just because the URL is a zip-file does not mean the resource itself must be. On the contrary, path_in_archive is simply one level deeper, so theoretically it could be an archive within an archive.

Or am I getting even more confused now? What should resmgr care if a file is an archive or not, except for the purpose of extracting it at install-time?

@kba
Copy link
Contributor

kba commented Dec 5, 2022

No, it does not, the type defaults to file. This assumption had not occured to me, it might be possible to hack that in, but I think being explicit about the type is better in any case.

Yes, and it was ill-conceived to begin with. Just because the URL is a zip-file does not mean the resource itself must be. On the contrary, path_in_archive is simply one level deeper, so theoretically it could be an archive within an archive.

But that changes the question to: why did resmgr download not extract the file in the first place??

Because it did not know that it was an archive, so downloaded it and was done for the day.

Or am I getting even more confused now? What should resmgr care if a file is an archive or not, except for the purpose of extracting it at install-time?

The type attribute is semantically imprecise. The file vs. directory distinction is relevant for listing the resources on the disk, the file/directory vs archive is relevant for installation. It would have been better to distinguish "source type" (what is it we're downloading/copying) and "target type" (how should it be stored and listed).

@bertsky
Copy link
Owner

bertsky commented Dec 5, 2022

ok, thanks for clarification! So it is correct now (on master).

Alas:

  [------------------------------------]    0%16:57:25.801 INFO ocrd.resource_manager._download_impl - Downloading https://layoutlm.blob.core.windows.net/docbank/model_zoo/X101.zip to download.tar.xx
  [####################################]  100%17:01:19.093 INFO ocrd.resource_manager.download - Extracting archive to /tmp/tmpnm6kkfv9/out

...
tarfile.ReadError: file could not be opened successfully

Also ambiguous: the size parameter. It is not clear whether this applies to the (extracted) file or the (zipped) download. The implemented progress bar seems to indicate the latter, but the documentation just says size of the resource in bytes.

@bertsky
Copy link
Owner

bertsky commented Dec 5, 2022

Downloading https://layoutlm.blob.core.windows.net/docbank/model_zoo/X101.zip to download.tar.xx

@kba it seems that zip files have never been supported in resmgr to date. Should I open an issue?

@kba
Copy link
Contributor

kba commented Dec 6, 2022

@kba it seems that zip files have never been supported in resmgr to date. Should I open an issue?

Yeah, that's why the type was originally tarball. I have opened OCR-D/core#963 for this.

Also ambiguous: the size parameter. It is not clear whether this applies to the (extracted) file or the (zipped) download. The implemented progress bar seems to indicate the latter, but the documentation just says size of the resource in bytes.

It's only used for the download bar. OCR-D/spec#233

@stefanCCS
Copy link
Author

After updating detectron2 module, I could let run ocrd-detectron2-segment without any errors, using model "TableBank_X152".

I still have troubles using model "PubLayNet_R_50_FPN_3x_JPLeoRX", where I get the following error (some name mismatching):

15:15:44.203 INFO ocrd.cli.resmgr - Downloading registered resource 'PubLayNet_R_50_FPN_3x_JPLeoRX.yaml' (https://github.com/facebookresearch/detectron2/raw/main/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml)

15:15:47.575 INFO ocrd.resource_manager.download - https://github.com/facebookresearch/detectron2/raw/main/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml to be downloaded to /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/PubLayNet_R_50_FPN_3x_JPLeoRX.yaml which already exists and overwrite is False
15:15:47.616 INFO ocrd.cli.resmgr - Installed resource https://github.com/facebookresearch/detectron2/raw/main/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml under /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/PubLayNet_R_50_FPN_3x_JPLeoRX.yaml
15:15:47.617 INFO ocrd.cli.resmgr - Use in parameters as 'PubLayNet_R_50_FPN_3x_JPLeoRX.yaml'
15:15:52.324 INFO ocrd.cli.resmgr - Downloading registered resource 'PubLayNet_R_50_FPN_3x_JPLeoRX.pth' (https://keybase.pub/jpleorx/detectron2-publaynet/mask_rcnn_R_50_FPN_3x/model_final.pth)

15:15:55.692 INFO ocrd.resource_manager.download - https://keybase.pub/jpleorx/detectron2-publaynet/mask_rcnn_R_50_FPN_3x/model_final.pth to be downloaded to /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/PubLayNet_R_50_FPN_3x_JPLeoRX.pth which already exists and overwrite is False
15:15:55.732 INFO ocrd.cli.resmgr - Installed resource https://keybase.pub/jpleorx/detectron2-publaynet/mask_rcnn_R_50_FPN_3x/model_final.pth under /home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/PubLayNet_R_50_FPN_3x_JPLeoRX.pth
15:15:55.732 INFO ocrd.cli.resmgr - Use in parameters as 'PubLayNet_R_50_FPN_3x_JPLeoRX.pth'
15:15:58.839 INFO processor.Detectron2Segment - Using compute device cpu
15:15:58.840 ERROR ocrd.ocrd-detectron2-segment.resolve_resource - Could not find resource 'PubLayNet_R_50_FPN_3x.yaml' for executable 'ocrd-detectron2-segment'. Try 'ocrd resmgr download ocrd-detectron2-segment PubLayNet_R_50_FPN_3x.yaml' to download this resource.
ERROR from called application: ExitCode=1

--> Maybe the json-Preset-File is not correct?

@stefanCCS
Copy link
Author

And/or another try:
I made:

 ocrd resmgr download ocrd-detectron2-segment PubLayNet_R_50_FPN_3x_JPLeoRX.yaml
ocrd resmgr download ocrd-detectron2-segment PubLayNet_R_50_FPN_3x_JPLeoRX.pth
ocrd-detectron2-segment -I OCR-D-BIN -O OCR-D-DETECTRON2-PubLayNet_R50_JPLeoRX -p /home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_R50_JPLeoRX.json

And got this:

(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials/Detectron2Test$ ocrd-detectron2-segment -I OCR-D-BIN -O OCR-D-DETECTRON2-PubLayNet_R50_JPLeoRX -p /home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_PubLayNet_R50_JPLeoRX.json
15:42:19.806 INFO processor.Detectron2Segment - Using compute device cpu
15:42:19.806 INFO processor.Detectron2Segment - Loading config '/home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/PubLayNet_R_50_FPN_3x_JPLeoRX.yaml'
Traceback (most recent call last):
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/bin/ocrd-detectron2-segment", line 8, in <module>
    sys.exit(ocrd_detectron2_segment())
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/cli.py", line 9, in ocrd_detectron2_segment
    return ocrd_cli_wrap_processor(Detectron2Segment, *args, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 117, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 82, in run_processor
    parameter=parameter
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 92, in __init__
    self.setup()
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/segment.py", line 117, in setup
    cfg.merge_from_file(temp_config)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/config/config.py", line 46, in merge_from_file
    loaded_cfg = self.load_yaml_with_base(cfg_filename, allow_unsafe=allow_unsafe)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/fvcore/common/config.py", line 103, in load_yaml_with_base
    base_cfg = _load_with_base(base_cfg_file)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/fvcore/common/config.py", line 93, in _load_with_base
    return cls.load_yaml_with_base(base_cfg_file, allow_unsafe=allow_unsafe)
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/fvcore/common/config.py", line 59, in load_yaml_with_base
    with cls._open_cfg(filename) as f:
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/config/config.py", line 34, in _open_cfg
    return PathManager.open(filename, "r")
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/iopath/common/file_io.py", line 1012, in open
    bret = handler._open(path, mode, buffering=buffering, **kwargs)  # type: ignore
  File "/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/iopath/common/file_io.py", line 612, in _open
    opener=opener,
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp026q1q7v/Base-RCNN-FPN.yaml'

--> Please clarify.

@stefanCCS
Copy link
Author

I get the same, if I manually unzip "DocBank_X101" and copy the files to:
/home/ocrdadmin/.local/share/ocrd-resources/ocrd-detectron2-segment/DocBank_X101.pth resp. .yaml
while using

 /home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/ocrd_detectron2/presets_DocBank_X101.json

@kba
Copy link
Contributor

kba commented Dec 8, 2022

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp026q1q7v/Base-RCNN-FPN.yaml'

This seems to be an issue with PubLayNet_R_50_FPN_3x_JPLeoRX.yaml, which has this

_BASE_: "../Base-RCNN-FPN.yaml"

which should be

_BASE_: "../configs/Base-RCNN-FPN.yaml"

I think. That solves the FileNotFoundError for me.

Unfortunately, this still gives me AssertionError: The chosen model's number of classes 80 does not match the given list of categories.

So I think this is an issue with the third-party models themselves, not ocrd_detectron2.

@stefanCCS
Copy link
Author

@kba: Concerning Base-RCNN-FPN.yaml.
If I search for it, I find it 5 times.
In which path it is search for? (I just want to put a softlink there...)

(ocrd-3.7) ocrdadmin@ocrd-03:~$ find . -name Base-RCNN-FPN.yaml
./ocrd-3.7_rel_2022-11-10/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
./ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
./ocrd-3.7_rel_2022-11-24/sub-venv/headless-tf1/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
./ocrd-3.7_rel_2022-11-24/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
./ocrd-3.7_rel_2022-11-24/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml

@stefanCCS
Copy link
Author

Well, I have put a soft link in all five places - as you can see here:

(ocrd-3.7) ocrdadmin@ocrd-03:/mnt/OCRD/myData/Specials/Detectron2Test$ find ~ -name Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-10/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-10/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-24/sub-venv/headless-tf1/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-24/sub-venv/headless-tf1/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-24/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-24/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-24/lib/python3.7/site-packages/detectron2/model_zoo/configs/Base-RCNN-FPN.yaml
/home/ocrdadmin/ocrd-3.7_rel_2022-11-24/lib/python3.7/site-packages/detectron2/model_zoo/Base-RCNN-FPN.yaml

--> unfortunately, this has no worked :-(

I have create a softlink for a whole config folder in temp like this

(ocrd-3.7) ocrdadmin@ocrd-03:/tmp$ ln -s /home/ocrdadmin/ocrd-3.7_rel_2022-11-10/sub-venv/headless-tf1/lib/python3.7/site-packages/detectron2/model_zoo/configs/ configs

--> I have made this, because my error I have got now always was this:

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/configs/Base-RCNN-FPN.yaml'

--> this has worked, but I do not understand, if this is always the case, or in general, what is the logic behind.
--> especially, if I look up in this issue where the error was like that (with a random path !):

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp026q1q7v/Base-RCNN-FPN.yaml'

--> So, what is the general solution?

@bertsky
Copy link
Owner

bertsky commented Jan 14, 2023

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp026q1q7v/Base-RCNN-FPN.yaml'

This seems to be an issue with PubLayNet_R_50_FPN_3x_JPLeoRX.yaml, which has this

_BASE_: "../Base-RCNN-FPN.yaml"

which should be

_BASE_: "../configs/Base-RCNN-FPN.yaml"

I think. That solves the FileNotFoundError for me.

Yes, some model providers make crazy assumptions on where in the original Detectron2 repo your CWD is. That's why I already have to do a temporary shutil.copytree into the Detectron2 distribution. I'll make an additional workaround for this case in the loader code, so we won't have to manually fix the config files (which I did for myself in the past).

Unfortunately, this still gives me AssertionError: The chosen model's number of classes 80 does not match the given list of categories.

So I think this is an issue with the third-party models themselves, not ocrd_detectron2.

Indeed, this particular config is even worse. I took it from https://github.com/JPLeoRX/detectron2-publaynet. They help themselves by using the vanilla COCO config (which is for photo scenery, not for PubLayNet document images), but overriding the NUM_CLASSES at runtime.

Since this applies to all JPLeoRX's models, and they are trained on PubLayNet no other than hpanwar08's, I think it would suffice to just switch over to those configs. I'll make a fix.

@bertsky
Copy link
Owner

bertsky commented Jan 15, 2023

the PubLayNet/JPLeoRX models should be fixed with 07fbdbf now. @stefanCCS could you please reinstall, redownload and try again?

@stefanCCS
Copy link
Author

@kba , @bertsky :
As usual, I would prefer just to have a new release of ocrd_all.
Having fixed this issue #15 and and also #18 and also something related to OCR-D/core#970
Will this be available in the next future?

@bertsky
Copy link
Owner

bertsky commented Jan 16, 2023

Sure, the next ocrd_all will certainly update ocrd_detectron2 to 0.1.5. (You'll still need to run sudo make deps-ubuntu again though, since I doubt we will switch to Python 3.7 / Ubuntu 20.04 so quickly.)

@bertsky bertsky closed this as completed Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants