-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How about the HTR-United framework? #22
Comments
It's a good idea to base the structure of our work on something existing. So, in each folder, we would find 4 subfolders, one for the models created, one for the PDF of the incununabula, one for the training corpus in which we can find the XML alto exported from escriptorium and the .txt, and a last subfolder containing the ground truth. Is that what you are proposing? |
The idea is to have clear image of the two corpus and the models. So I propose three folders, one including the models, one with the training corpus (pdfs with their corresponding xml - automatically created via the XML ALTO export including images - plus the txt file for more clarity ) and same structure for the verification corpus(cf. Inc59 folder). This way it is much easier to review and correct a file that has an error by deleting or and replacing it directly. |
Do we also need to share the correction of the verification corpus, or maybe can we let it for the part of the project concerning a joining member? |
Good question. As for my part, the verification corpus is the outcome of the results, and the original readings of the model are not included. Some thoughts (not exhaustive of course ) on the joining member or anyone that would like to test our ground truths. As the models (as efficient as they are) as well as the source files are uploaded in the repository, they can serve to reproduce the experiment and confirm the accuracy of the model and the results that it gives to the verification corpus. I don't know if it would be useful to upload as well the first readings of the machine, this time to another test page, as the rectified verification corpus has been used to retrain the machine so its not "unseen" anymore. Let me know what you think about it. |
I have an idea regarding the framework of the project.
Since this project is (in theory) already part of the ENC and Marco's biannual project funded by the school, it would be interesting (let alone practical) to try and link our work with the https://htr-united.github.io HTR-United initiative, that (sic) ... vise à mettre en commun les transcriptions HTR/OCR de textes de toutes périodes et de tout style, principalement en français mais de manière non restricive. Elle est née du simple besoin - pour des projets - d'avoir de potentiels vérités de terrain pour entraîner des modèles rapidement sur des corpus plus petits(sic.). More information found here in their recent article : Alix Chagué, Thibault Clérice, Laurent Romary. HTR-United : Mutualisons la vérité de terrain !. 2021. https://hal.archives-ouvertes.fr/hal-03398740/document
The fact that they have already produced detailed guidelines and workflows that ensure the control of the quality of the data sets and the ground truths, facilitates the interoperability of the data, that can be shared, verified, and ameliorated in the long term, guaranteeing their sustainability.
This entails for our project several things:
a) transcripts aligned with images, in a standard format such as XML PAGE or XML ALTO ;
b) The structure of our repository, should consist of two separate subfolders in each folder, containing respectively the training corpus and the ground truth, as well as the respective sources, sc. the pdf images of the incunabula (just the IIIF reference is also acceptable but it's easier if they are directly uploaded in the folders) ;
c) To include in our README.md file the description of the repository, as well as any important information about our procedure.
d) The creation of a YAML document named (strictly) htr-united.yml document containing all the metadata regarding the ground truths produced by the repository (accessible through https://htr-united.github.io/document-your-data.html )
e) The creation of a CITATION.cff file to cite the repository.
Let me know what you think.
The text was updated successfully, but these errors were encountered: