From 397df103111ead5220a51aee80a3738c99102c15 Mon Sep 17 00:00:00 2001 From: "Handi (Andy) Xie" Date: Thu, 15 Feb 2024 14:53:34 -0500 Subject: [PATCH 1/2] Update Technical Project Document Format Updating Technical Project Document Format, thank you William --- Spring2024/Technical Project Document | 34 ++++++++++++++++++++------- 1 file changed, 26 insertions(+), 8 deletions(-) diff --git a/Spring2024/Technical Project Document b/Spring2024/Technical Project Document index c6a6c19..1e1ba57 100644 --- a/Spring2024/Technical Project Document +++ b/Spring2024/Technical Project Document @@ -1,48 +1,66 @@ # Technical Project Document -George Trammell, Max Karambelas, Andy Xie -2024-Feb-8 0.0.0-dev -# Overview +### George Trammell, Max Karambelas, Andy Xie - 2024-Feb-8 0.0.1-dev +## Overview In this document, based on the available project outline and summary of the project pitch, to the best of your abilities, you will come up with the technical plan or goals for implementing the project such that it best meets the stakeholder requirements. + A. Provide a solution in terms of human actions to confirm if the task is within the scope of automation through AI. Manually identifying and segmenting the label from the herbarium sheet. Reading and transcribing the text from the label, which includes taxon, geography, collection code, barcode, location, date collected, collector name, collector number, and habitat. Entering the transcribed data into a database. Validating the accuracy of the transcription against known data. + B. Problem Statement: The project aims to automate the transcription of handwritten labels from herbarium specimens into a digital format. Specifically, it is a machine learning problem that involves developing and improving OCR (Optical Character Recognition) models, with a focus on LSTM-RNN and Transformer-based deep learning models, to accurately recognize and transcribe text from images of specimen labels. This includes enhancing OCR functionality for Chinese characters and integrating metadata and contextual information to improve accuracy. + C. Checklist for project completion Provide a bulleted list to the best of your current understanding, of the concrete technical goals and artifacts that, when complete, define the completion of the project. This checklist will likely evolve as your project progresses. Develop an improved OCR model capable of handling Chinese characters. Test and validate the OCR model's accuracy on a dataset of pre-1940 plant specimen images. Incorporate metadata and contextual information into the model to enhance accuracy. Create clean code and thorough documentation for the project. + D. Outline a path to operationalization. For this refined project focusing on the improvement of OCR functionality for digitizing natural history specimens, particularly with an emphasis on Chinese characters, and building a public repository, operationalization involves specific technological solutions and collaboration strategies. The project aims to enhance OCR accuracy by incorporating advanced deep learning models such as LSTM-RNN and Transformer models, while also considering the use of metadata and contextual information (e.g., location, collector details) as knowledge priors to improve classification processes. This necessitates a multi-faceted approach involving data gathering from specified sources, model refinement, and the creation of a publicly accessible repository for disseminating the results. To make the project's outcomes accessible and usable beyond a Jupyter notebook or initial proof of concept, a web-based platform or API could be developed, allowing researchers and the public to upload herbarium images for OCR processing. This platform could be hosted on cloud services like AWS, Google Cloud, or Azure, providing scalable resources for processing and storage. GitHub will serve as the repository for both the codebase and the dataset, facilitating collaboration and open-source contributions. Technologies like Docker could be employed to containerize the application, ensuring ease of deployment and compatibility across different environments. Additionally, integrating the project's outputs into existing databases or platforms frequented by climate change scientists and biodiversity researchers, such as the GBIF, could further extend its impact and utility. - - -# Resources -# Data Sets +## Resources +### Data Sets CNH Portal: https://portal.neherbaria.org/portal/ + Pre-1940 plant specimen images in GBIF: https://www.gbif.org/occurrence/gallery?basis_of_record=PRESERVED_SPECIMEN&media_ty[…]axon_key=6&year=1000,1941&advanced=1&occurrence_status=present + International Plant Names Index: https://www.gbif.org/dataset/046bbc50-cae2-47ff-aa43-729fbf53f7c5#dataDescription + Use for synonyms (GBIF is recommended): GBIF: https://hosted-datasets.gbif.org/datasets/backbone/current/ + IPNI: https://storage.cloud.google.com/ipni-data/ + CVIT: https://cvit.iiit.ac.in/research/projects/cvit-projects/matchdocimgs + IAM: https://fki.tic.heia-fr.ch/databases/iam-handwriting-database -# References + +### References CRAFT (text detection): https://arxiv.org/abs/1904.01941 + TrOCR: https://arxiv.org/abs/2109.10282 + "What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis," 10.1109/ICCV.2019.00481 + Kubeflow: https://www.kubeflow.org/docs/ + Hugging Face Spaces: https://huggingface.co/docs/hub/spaces + GCP Vertex AI: https://cloud.google.com/vertex-ai/docs + AWS SageMaker: https://docs.aws.amazon.com/sagemaker/index.html + TensorFlow Serving: https://github.com/tensorflow/serving + TorchServe: https://github.com/pytorch/serve + # Weekly Meeting Updates + Keep track of ongoing meetings in the Project Description document prepared by Spark staff for your project. Note: Once this markdown is finalized and merge, the contents of this should also be appended to the Project Description document. From e348e5b3065c19fffe647198a1d2b2d6d64ca74a Mon Sep 17 00:00:00 2001 From: "Handi (Andy) Xie" Date: Fri, 16 Feb 2024 18:17:21 -0500 Subject: [PATCH 2/2] Update and rename Technical Project Document to Technical Project Document.md --- ...echnical Project Document => Technical Project Document.md} | 3 +++ 1 file changed, 3 insertions(+) rename Spring2024/{Technical Project Document => Technical Project Document.md} (96%) diff --git a/Spring2024/Technical Project Document b/Spring2024/Technical Project Document.md similarity index 96% rename from Spring2024/Technical Project Document rename to Spring2024/Technical Project Document.md index 1e1ba57..b7f6bb5 100644 --- a/Spring2024/Technical Project Document +++ b/Spring2024/Technical Project Document.md @@ -64,3 +64,6 @@ TorchServe: https://github.com/pytorch/serve Keep track of ongoing meetings in the Project Description document prepared by Spark staff for your project. Note: Once this markdown is finalized and merge, the contents of this should also be appended to the Project Description document. +## Temp Link +https://docs.google.com/document/d/1AkQW9WFcBbHqGl8Js3KIth1u3vtOKAgWTyO3nsYgzYI/edit?usp=sharing +Will update to github repo at the end of semester.