diff --git a/annexes/bibliography.md b/annexes/bibliography.md index 98f962d..00fd29e 100644 --- a/annexes/bibliography.md +++ b/annexes/bibliography.md @@ -15,6 +15,12 @@ In this paper, the authors present a new way to exploit the ground truth created ---------- +**Balci B., Saadati D., Shiferaw D., *Handwritten Text Recognition using Deep Learning* [http://vision.stanford.edu/teaching/cs231n/reports/2017/pdfs/810.pdf](http://vision.stanford.edu/teaching/cs231n/reports/2017/pdfs/810.pdf)** + +In this article, the authors present the work they have done to execute a handwritten text recognition with deep learning, or more exactly a handwritten character recognition as they are using a method of segmentation and recognition where the model learns the character individually and not the word. After a little SOTA about the evolution of automatic recognition, they present the preprocessing thy choose to do as well as the methods they used for vocabulary size, classification, training and segmentation. After giving the results of their experiments, they mention the issue that is handwritten for character recognition and hypothesis that a larger corpus might help them get better results. + +---------- + **Boschetti, Federico, Matteo Romanello, Alison Babeu, David Bamman, et Gregory Crane. 2009. *Improving OCR Accuracy for Classical Critical Editions*. In Research and Advanced Technology for Digital Libraries, édité par Maristella Agosti, José Borbinha, Sarantos Kapidakis, Christos Papatheodorou, et Giannis Tsakonas, 5714:156‑67. Berlin, Heidelberg: Springer Berlin Heidelberg. [https://doi.org/10.1007/978-3-642-04346-8_17](https://doi.org/10.1007/978-3-642-04346-8_17).** In this paper, the authors compared different software used for their classical editions, in Greek and Latin, and the results they obtain. They use equations, regex and outputs to check the accuracy of the OCR software and search which is the best and how to improve everything. @@ -74,6 +80,12 @@ In this paper, the authors expose the system they established to do HTR. After i ---------- +**Lin, Junxia, Ledolter, Johannes. (2021). *A Simple and Practical Approach to Improve Misspellings in OCR Text.* [https://arxiv.org/abs/2106.12030](https://arxiv.org/abs/2106.12030)** + +In this article, the authors work on the identification and correction of non-word errors in OCR text. They start by presenting the various types of OCR errors that can be found, detailing them one after the other and explaining what they are made of. Then, they do the state-of-the-art in OCR misspellings corrections, followed by a presentation of the dataset used for their work and an explaination on the kind of words included in their work. The big part of their article talks about their proposed method, which they detailed in different steps, and the results it shows, which are pretty good according to the authors. + +---------- + **U. -. Marti and H. Bunke, "On the influence of vocabulary size and language models in unconstrained handwritten text recognition," Proceedings of Sixth International Conference on Document Analysis and Recognition, 2001, pp. 260-265, doi: [10.1109/ICDAR.2001.953795](https://doi.org/10.1109/ICDAR.2001.953795)** In this paper, the authors present an experiment of text recognition on handwritten text that takes into account the size of the vocabulary given to recognize the text. After introducing the steps done to the images and the text to help process it, they present the perplexity results of their experiments. They present the difference between using language models or not and the efficiency of recognition with the vocabulary size. They conclude by insisting on the importance of language models and ways needed to improve it. @@ -92,6 +104,18 @@ In this paper, the authors dig deeper into OCR errors, by presenting the common ---------- +**Nguyen, Thi Tuyet Hai, Adam Jatowt, Mickael Coustaty, et Antoine Doucet. 2022. *Survey of Post-OCR Processing Approaches*. ACM Computing Surveys 54 (6):1‑37. [https://doi.org/10.1145/3453476](https://doi.org/10.1145/3453476).** + +In this article, the authors mentions the various post-OCR processing approaches that can exist. First, they are doing a little history on OCR and digitization, and the problem that arised with the poor OCR results provided for most of the digitized texts at the time and the consequences it had on tasks such as information retrieval or natural language processing. They reflect on what can be observed and modified in order to fix those issues. The author then explain mathematically what is post-OCR processing problem. The authors started by presenting some projects that used full manual post-OCR processing. Then, the authors mentions semi-automatic approaches, divided in type of approaches. They develop a lot about lexical approaches and error models, which are isolated-word approaches, which mainly is made of error detection and error correction, with list of candidates for correction. They then present other types of models such as feature-based machine learning, string-to-string transformation, and neural machine translation (NMT). Finally, they present the last type of post-ocr processing which focuses on neural networks and language models. In the next part, there are introducing important resources for post-ocr processing that are freely accessible, such as the metrics, that they are explaining thoroughly, the datasets that are used to test and approve the post-ocr tools, which they present in details and the languages resources that can help, which they detailed by language. They end their article by talking about the various evolutions of the methods mentioned all along the paper. + +---------- + +**Plamondon, Réjean, Sargur N. Srihari. *“On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey.”* IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000): 63-84. [https://ieeexplore.ieee.org/document/824821](https://ieeexplore.ieee.org/document/824821)** + +This article presents various information about online and offline handwritten recognition. It begins by presenting the various elements of handwritten text recognition, what to look for, how can that work, mostly with the online version. It then presents what constitutes online hanwritten text recognition and the various uses of it. In a second part, it focuses on offline HTR and presents some preprocessing steps that can help with the recognition and then gives examples and study cases of situations where HTR can be used (postal service or signature verification). It finishes by proposing some preprocessing/additional steps to help the recognition such as using word recognizer, n-gram class, lexical techniques, etc.). It concludes by saying that there are good progress in online HTR but it is still mainly in the research area for offline, while hoping that the techniques will improve, in English but also in other languages. + +---------- + **Reul Christian, Wick Christoph, Noeth Maximilian, Buettner Andreas, Wehner Maximilian, and Springmann Uwe. 2021. "Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning". In *The 6th International Workshop on Historical Document Imaging and Processing* (*HIP '21*). Association for Computing Machinery, New York, NY, USA, 7–12. DOI:[https://doi.org/10.1145/3476887.3476910](https://doi.org/10.1145/3476887.3476910)** In this paper, the authors present their project of building a mixed model for a category of historical printings with polyfonts. After exposing previous works in that area, the authors expose their methodology, training and evaluation data, and their transcription guidelines. Then, they describe the various ways they did their experiments, with one option or another, as well as the errors they mostly encountered. Finally, they present the idea of finetuning the work rather than doing it from scratch, before concluding on those different results and what they can hope to do from there. diff --git a/annexes/glossary.md b/annexes/glossary.md index e83a099..570a30e 100644 --- a/annexes/glossary.md +++ b/annexes/glossary.md @@ -11,6 +11,8 @@ title: "Glossary" Formula: CER = Substitution(s) + Insertion(s) + Deletion(s) / Number of characters in the GT \\ The lower the CER value (with 0 being a perfect score), the better the performance of the OCR model. +**Convolutional Neural Network (CNN)**: CNNs are deep learning models used for processing and analysing visual data. They leverage filters and layers to recognise patterns and features within images. + **Dropout**: Method to counter the overfitting. At each epoch, some neurons in the network will be deactivated aleatory (so that it is not the same each time), which means that the model will be trained with a different neuron configuration each time, which will produce slightly different models each time. **"Gold" corpus**: Data exclusively created and verified by humans, to obtain a perfect transcription. diff --git a/annexes/tools.md b/annexes/tools.md index e7c9dfc..e360ef7 100644 --- a/annexes/tools.md +++ b/annexes/tools.md @@ -4,6 +4,7 @@ title: "Tools" --- ##### KaMi App +[https://huggingface.co/spaces/lterriel/kami-app](https://huggingface.co/spaces/lterriel/kami-app) KaMi stands for Kraken Model Inspector This tool evaluates the success of a transcription task comparing a correct transcription (reference) and a prediction. The results are then the Levenshtein distance of this evaluation, the Word Error Rate (WER), the Character Error Rate (CER), the Word Accuracy (Wacc), as well as some others statistics taken from the Speech Recognition domain. KaMi also offers the possibility to ignore some specificities from the transcription to obtain a more accurate analysis. Thus, it is possible to choose to ignore digits, punctuation, diacritics and cases from the transcription. The statistics at the end will be given with and without what was chosen before initializing the comparison. For example, if everything is selected, the statistics given will be one with a complete comparison, no specificities taken into account, then one with the digits ignored, one with the punctuation ignored, etc. and at the end, one with all the options ignored combined. @@ -12,7 +13,7 @@ KaMi also offers the possibility to ignore some specificities from the transcrip ---------- ##### eScriptorium -[https://escriptorium.paris.inria.fr/](https://escriptorium.paris.inria.fr/) +[https://escriptorium.paris.inria.fr/](https://escriptorium.inria.fr/) eScriptorium is a digital text production pipeline for print and handwritten texts using machine learning techniques. Completely open-source and well-documented, it offers multiple ways to work on the transcription of corpora. First of all, eScriptorium offers the possibility to import and export many things. The images can be imported in different formats, such as TIFF, JPG or PDF, but also with their IIIF links, provided that the manifest is on hand. In the case where a transcription/segmentation is already available in an ALTO/PAGE XML format, it can also be imported into eScriptorium. Finally, models can be imported into eScriptorium to do segmentation, transcription or training. The interface also allows us to export from it, whether it is the transcription once it is done (in PAGE XML, ALTO XML or text format), the images or the new models. In summary, everything that has been imported can usually be exported right back. With eScriptorium, it is possible to do automatic and manual transcription. If a model is available to segment and/or transcribe the corpus, it can be applied to the images after being imported on the interface. Once it is done, it is still possible to modify this transcription manually, as the production of the model (the transcription) is available and modifiable. If no model can be used, eScriptorium also proposes manual transcription, line by line, according to the segmentation of the images.