Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

any chance that the rows and cells in the Abbyy file would be kept? #64

Open
kosloot opened this issue Aug 10, 2021 · 3 comments
Open
Assignees

Comments

@kosloot
Copy link
Contributor

kosloot commented Aug 10, 2021

Just quick question: any chance that the rows and cells in the Abbyy file would be kept by the converter?

Originally posted by @pirolen in #62 (comment)

@kosloot
Copy link
Contributor Author

kosloot commented Aug 10, 2021

I would have to look into that. There are FoLiA constructions for <row> and <cell> so it would be doable.
(nb: your example file has rows with 1 cell. quite odd)

@kosloot kosloot self-assigned this Aug 10, 2021
@pirolen
Copy link

pirolen commented Aug 10, 2021

Nice, thank you very much for investigating it; I was just wondering.

The single cell oddity is due to the fact that these tables actually hold data from registries that have entries that can span several lines in a weakly structured way (e.g. using indentation levels).
The paragraphs that Abbyy thinks to recognize are not properly capturing the entry boundaries, since the entry structuring logic of the printed pages is often complex.
The 'table cells' can keep the lines together; so the table format is simply a workaround that the Abbyy OCR postcorrection app allows, i.e. using the app, human correctors manually separate the entries from each other by drawing a table around them.

@pirolen
Copy link

pirolen commented Aug 11, 2021

Please find attached a proper table example in Abbyy XML, for the printed original please see the png.

b1_3_1_mwtext_ostpreuss_pp109_277_036

b1_3_1_mwtext_ostpreuss_pp109_277_036.table.xml.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants