-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idea: add page filtering #2
Comments
Good idea. We could probably reuse the @M3ssman Besides front and back cover, what kind of pages do you exclude from processing with OCR-D? |
For logical, I would also recommend |
In Addition, one should consider the attribute values of AFAICS at ULB it's only used on page containers, but with respect to recent METS Specs they may appear too within logical structs. We should consider this, since non-DFG-METS doesn't really care for explicit physical structs. |
@M3ssman you mean something like (I have no experience and no data to grub.) |
@bertsky Yes, like this. |
Got it, thanks! Most interesting. Like you said, it would depend on the particular rules of each digitisation process/institution, plus unintended deviations (typos, brackets). So IMO your approach of making this configurable is the only adequate solution. The mechanism (config file, envvar or CLI param) should be discussed for OCR-D, though. IMO we need something to prevent unnecessary downloads and processing. But some dummy fallback output even for filtered pages is actually preferable. (So in #1, I would not filter out these pages in a separate pipeline step, but rather have the filter behave like a processing error.) |
Anyway, I wonder how digital object are structured at your own houses? @bertsky According to SLUB OAI-API exist +44k records in Dresden @kba According to SBB OAI-API reside +32k at Berlin |
@bertsky Concerning unintended deviations: AFAICS, the annotation of content related information like |
Indeed, I can easily research this myself – thanks! In fact, I did (for 15th-18th c. prints), using metha. It looks like other than the obvious Strukturdatenset choices ( |
In the parallel case, when computing the page range expression, we could add a filter to remove empty or cover pages from the processing pipeline (possibly also just creating an empty annotation for them via ocrd-dummy).
The text was updated successfully, but these errors were encountered: