Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[foliatextcontent] Implement adding markup information in the text that points to the substrings #23

Closed
proycon opened this issue Dec 3, 2020 · 4 comments
Assignees
Labels
enhancement New feature or request ready

Comments

@proycon
Copy link
Owner

proycon commented Dec 3, 2020

This is needed for proycon/flat#92 . There is already an option for this in foliatextcontent but it doesn't seem to work yet in all cases , most specifically, the case where the text content is already present rather than generated by foliatextcontent.

@proycon
Copy link
Owner Author

proycon commented Dec 3, 2020

<correction> elements over <str> should be translated to the proper <t-correction> elements.

@pirolen
Copy link

pirolen commented Dec 4, 2020

Awesome, thanks so much for making this enhancement!

Just wondering (not requesting), would this enable manual correction operations, at least partly, such as:
when a superscript number ("17") was misrecognized as apostrophe:
<t class="OCR" offset="1086">Materialien'</t>

-->

  • make a delete operation of this token
  • add new token with "Materialien"
  • add new token with "17"
  • annotate the "17" token as superscript style.

(Once the PAGE-XML to FoLiA converter is there, I could use ucto and generate a test file --please let me know I could do st more.)

@proycon
Copy link
Owner Author

proycon commented Dec 4, 2020

(I don't think this relates directly to this issue, which is about substrings (arbitrary references on untokenised text))

I assume you refer to manual annotation in FLAT, and editing corrections in FLAT indeed only works on the token-level. If a tokenised document is available with all the markup information present then the procedure you described would work for the first three steps yes, but the fourth is still an issue as FLAT doesn't support annotating markup (e.g. style) yet, the markup support in FLAT is limited to viewing currently.

The other caveat is preserving all the markup information after tokenisation, ucto currently doesn't do that. You're currently stuck with the markup information on mostly the paragraph level. Neither TICCL nor ucto propagate it to deeper levels, which is what you need if you want to correct it in FLAT. I had already opened a related issue to implement this specific functionality in foliatextcontent: #19 .. The good news is that this should all be automatically resolvable.

@pirolen
Copy link

pirolen commented Dec 4, 2020

Awesome, thanks! Sorry about commenting at the wrong issue, I meant indeed the functionality of FLAT.

@proycon proycon added the ready label Dec 7, 2020
@proycon proycon closed this as completed Jan 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ready
Projects
None yet
Development

No branches or pull requests

2 participants