Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicated IDs in ALTO XML when multiple pages are present #4386

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jankal
Copy link

@jankal jankal commented Jan 25, 2025

When running tesseract on a list of images and validating the alto output using the schema (https://www.loc.gov/alto/v3/alto-3-0.xsd), I had the following error:

 Element '{http://www.loc.gov/standards/alto/ns-v3#}ComposedBlock', attribute 'ID': 'cblock_0' is not a valid value of the atomic type '{http://www.loc.gov/standards/alto/ns-v3#}BlockTypeID'.

Digging into the type of BlockTypeID, I found a xsd:ID restriction - meaning the BlockTypeID can only exist once in the whole document / is unique per document.

I incorporated the current page_number into the IDs thus resolving the issue.

@jankal jankal changed the title Fix alto xml duplicates IDs when multiple pages are present Fix duplicates IDs in ALTO XML when multiple pages are present Jan 25, 2025
@jankal jankal changed the title Fix duplicates IDs in ALTO XML when multiple pages are present Fix duplicated IDs in ALTO XML when multiple pages are present Jan 25, 2025
@stweil
Copy link
Member

stweil commented Jan 25, 2025

Thanks for the fix. I just wonder whether we should try to keep the current IDs for the most common use case where only a single page is processed. This could be achieved for example by

  • conditional code
  • using counters (bcnt, ...) which are not reset for each additonal image
  • using an index of (page_number - 1) * 1000 + bcnt ... in the ID strings

…nt pge

This will ensure, validated ALTO XML output is generated while keeping IDs for the first page consistent as before.
@jankal
Copy link
Author

jankal commented Jan 26, 2025

Thanks for your suggestion @stweil!

I added a function GetID to generate the ID. The new GetID function keeps the current ID structure for the first page and starts to add the page_number from the second page on.

Disclosure: I haven't written CPP before, so pls check for rookie mistakes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants