edited OCR readme

CDCgov · Jan 9, 2025 · 3c36b62 · 3c36b62
1 parent bc27a73
commit 3c36b62
Show file tree

Hide file tree

Showing 2 changed files with 113 additions and 49 deletions.
diff --git a/OCR/README.md b/OCR/README.md
@@ -1,44 +1,102 @@
-## OCR
+# OCR Layer - ReportVision
+
+The **OCR Layer** in the ReportVision project processes document images, performs segmentation and optical character recognition (OCR), and computes accuracy metrics by comparing OCR outputs to ground truth data.
+
+---
+
+## Table of Contents
+1. [Introduction](#introduction)
+2. [Installation](#installation)
+3. [Running the Application](#running-the-application)
+4. [Testing](#testing)
+5. [End-to-End Benchmarking](#end-to-end-benchmarking)
+6. [Dockerized Development](#dockerized-development)
+7. [Development Tools](#development-tools)
+8. [Contributing](#contributing)
+
+---
+
+## Introduction
+
+The OCR layer uses **Poetry** for dependency management and virtual environment setup. It provides:
+- An API for performing OCR operations.
+- Support for benchmarking OCR accuracy.
+- Configuration for different OCR models and segmentation templates.
+
+
 
 ### Installation
 
+### Prerequisites
+- Python 3.9 or later
+- [Poetry](https://python-poetry.org/) for dependency management
+- Docker (optional for containerized development)
+
 ```shell
 pipx install poetry
 ```
 
+### Running The Application
 Activate the virtual environment and install dependencies, all subsequent commands assume you are in the virtual env
 
 ```shell
 poetry shell
 poetry install
 ```
 
+```shell
+fastapi dev ocr/api.py
+```
+
+### Testing
+
 Run unit tests
 
-```shell
+```shell 
 poetry run pytest
 ```
 
-Run benchmark tests
+### End to End Benchmarking
 
-```shell
-cd tests
-poetry run pytest benchmark_test.py -v
-```
 
-poetry run pytest bench_test.py -v
+#### Overview
+End-to-end benchmarking evaluates OCR accuracy by:
 
-Run main, hoping to convert this to a cli at some point
+End-to-end benchmarking scripts can:
 
-```shell
-poetry run main
-```
+1. Segment and run OCR on a folder of images using given segmentation template and labels file.
+2. Compare OCR outputs to ground truth data based on matching file names.
+3. Write metrics (confidence, raw distance, Hamming distance, Levenshtein distance) as well as total metrics to a CSV file.
 
-To build the OCR service into an executable artifact
 
-```shell
-poetry run build
-```
+To run benchmarking:
+
+1. Locate file `benchmark_main.py`
+2. Ensure all the paths/folders exist by downloading from [Google Drive for all segmentation/label files](https://drive.google.com/drive/folders/1WS2FYn0BTxWv0juh7lblzdMaFlI7zbDd?usp=sharing)
+3. Ensure `ground_truth` folder and files exist
+4. Ensure `labels.json` is in the correct format (see `tax_form_segmented_labels.json` as an example) 
+5. When running make sure to pass arguments in this order:
+
+* `/path/to/image/folder` (path to the original image files which we need to run OCR on)
+* `/path/to/segmentation_template.png` (single file)
+* `/path/to/labels.json` (single file)
+* `/path/to/output/folder` (path to folder where the output would be. This should exist but can be empty)
+* `/path/to/ground/truth_folder` (path to folder for metrics that we would compare against)
+* `/path/to/csv_out_folder` (path to folder where all metrics would be. This should exist but can be empty)
+
+By default, segmentation, OCR, and metrics computation are all run together. To disable one or the other, pass the `--no-ocr` or `--no-metrics` flags. You can change the backend model by passing `--model=...` as well.
+
+Run notes:
+* Benchmark takes one second per segment for OCR using the default `trocr` model. Please be patient or set a counter to limit the number of files processed.
+* Only one segment can be input at a time
+
+
+### Test Data Sets
+
+You can  run the script `pytest run reportvision-dataset-1/medical_report_import.py` to pull in all relevant data.
+
+
+### Development Tools
 
 Adding new dependencies
 
@@ -82,44 +140,18 @@ To run the API in prod mode
 poetry run api
 ```
 
-### Test Data Sets
-
-You can also run the script `pytest run reportvision-dataset-1/medical_report_import.py` to pull in all relevant data.
-
-
-### Run end-to-end benchmarking
-
-End-to-end benchmarking scripts can:
-
-1. Segment and run OCR on a folder of images using given segmentation template and labels file.
-2. Compare OCR outputs to ground truth data based on matching file names.
-3. Write metrics (confidence, raw distance, Hamming distance, Levenshtein distance) as well as total metrics to a CSV file.
-
 
-To run benchmarking:
-
-1. Locate file `benchmark_main.py`
-2. Ensure all the paths/folders exist by downloading from [Google Drive for all segmentation/label files](https://drive.google.com/drive/folders/1WS2FYn0BTxWv0juh7lblzdMaFlI7zbDd?usp=sharing)
-3. Ensure `ground_truth` folder and files exist
-4. Ensure `labels.json` is in the correct format (see `tax_form_segmented_labels.json` as an example) 
-5. When running make sure to pass arguments in this order:
+To build the OCR service into an executable artifact
 
-* `/path/to/image/folder` (path to the original image files which we need to run OCR on)
-* `/path/to/segmentation_template.png` (single file)
-* `/path/to/labels.json` (single file)
-* `/path/to/output/folder` (path to folder where the output would be. This should exist but can be empty)
-* `/path/to/ground/truth_folder` (path to folder for metrics that we would compare against)
-* `/path/to/csv_out_folder` (path to folder where all metrics would be. This should exist but can be empty)
+```shell
+poetry run build
+```
 
-By default, segmentation, OCR, and metrics computation are all run together. To disable one or the other, pass the `--no-ocr` or `--no-metrics` flags. You can change the backend model by passing `--model=...` as well.
 
-Run notes:
-* Benchmark takes one second per segment for OCR using the default `trocr` model. Please be patient or set a counter to limit the number of files processed.
-* Only one segment can be input at a time
 
 ### Dockerized Development
 
-It is also possible to run the entire project in a collection of docker containers. This is useful for development and testing purposes as it doesn't require any additional dependencies to be installed on your local machine.
+It is also possible to run the project in a collection of docker containers. This is useful for development and testing purposes as it doesn't require any additional dependencies to be installed.
 
 To start the containers, run the following command:
 
@@ -132,6 +164,38 @@ This will start the following containers:
 - ocr: The OCR service container
 - frontend: The frontend container
 
-The frontend container will automatically reload when changes are made to the frontend code. To access the frontend, navigate to http://localhost:5173 in your browser.
+The frontend container will automatically reload when changes are made to the frontend. To access the frontend, navigate to http://localhost:5173 in your browser.
 
 The OCR service container will restart automatically when changes are made to the OCR code. To access the API, navigate to http://localhost:8000/ in your browser.
+
+
+## Project Architecture
+
+The OCR Layer is organized as follows:
+
+- **`ocr/`**:
+  - **`api.py`**: Defines the API for the OCR service.
+  - **`main.py`**: Entry point script to run the OCR service.
+  - **`segmenter.py`**: Handles image segmentation based on templates and labels.
+  - **`ocr_engine.py`**: OCR logic using the specified OCR models.
+  - **`metrics.py`**: Computes metrics (e.g., confidence, Levenshtein distance) by comparing OCR results with ground truth.
+  - **`config.py`**: Contains configuration files for paths, environment variables, and model settings.
+
+- **`tests/`**: Contains unit tests, integration tests, and benchmarking scripts.
+  - **`benchmark_test.py`**: Tests benchmarking logic for OCR and metrics.
+  - **`unit_test.py`**: Includes unit tests for individual components of the OCR service.
+  - **`benchmark_main.py`**: Main script for running end-to-end benchmarking, including segmentation, OCR, and metrics computation.
+
+- **`data/`**: location of segmentation templates, labels, ground truth, and test datasets (not included in the repository by default).
+
+- **`reportvision-dataset-1/`**: Example dataset folder for running benchmarks and tests.
+  - **`medical_report_import.py`**: Script to import and prepare medical reports for testing.
+
+- **`Dockerfile`**: Defines the container for running the OCR service in a Dockerized environment.
+
+- **`dev-env.yaml`**: Docker Compose file for setting up a development environment with containers for the OCR service and frontend.
+
+- **`pyproject.toml`**: Poetry configuration file specifying project dependencies and settings.
+
+- **`poetry.lock`**: Lock file generated by Poetry to ensure dependency consistency.
+
diff --git a/user_guide.md b/user_guide.md
@@ -18,7 +18,7 @@ ReportVision is a tool that automates the reading and extracting of labs from PD
 3. [Tesseract5.5](https://formulae.brew.sh/formula/tesseract) (brew install tesseract)
 4. [Java21](https://www.oracle.com/java/technologies/downloads/)
 5. [PostgreSQL](https://www.postgresql.org/)
-6. [Docker](https://www.docker.com/) (required for DB and middleware set up
+6. [Docker](https://www.docker.com/) (required for DB and middleware set up) 
 
 ### Installation and Development Guides