Skip to content

Commit

Permalink
edited OCR readme
Browse files Browse the repository at this point in the history
  • Loading branch information
arinkulshi-skylight committed Jan 9, 2025
1 parent bc27a73 commit 3c36b62
Show file tree
Hide file tree
Showing 2 changed files with 113 additions and 49 deletions.
160 changes: 112 additions & 48 deletions OCR/README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,102 @@
## OCR
# OCR Layer - ReportVision

The **OCR Layer** in the ReportVision project processes document images, performs segmentation and optical character recognition (OCR), and computes accuracy metrics by comparing OCR outputs to ground truth data.

---

## Table of Contents
1. [Introduction](#introduction)
2. [Installation](#installation)
3. [Running the Application](#running-the-application)
4. [Testing](#testing)
5. [End-to-End Benchmarking](#end-to-end-benchmarking)
6. [Dockerized Development](#dockerized-development)
7. [Development Tools](#development-tools)
8. [Contributing](#contributing)

---

## Introduction

The OCR layer uses **Poetry** for dependency management and virtual environment setup. It provides:
- An API for performing OCR operations.
- Support for benchmarking OCR accuracy.
- Configuration for different OCR models and segmentation templates.



### Installation

### Prerequisites
- Python 3.9 or later
- [Poetry](https://python-poetry.org/) for dependency management
- Docker (optional for containerized development)

```shell
pipx install poetry
```

### Running The Application
Activate the virtual environment and install dependencies, all subsequent commands assume you are in the virtual env

```shell
poetry shell
poetry install
```

```shell
fastapi dev ocr/api.py
```

### Testing

Run unit tests

```shell
```shell
poetry run pytest
```

Run benchmark tests
### End to End Benchmarking

```shell
cd tests
poetry run pytest benchmark_test.py -v
```

poetry run pytest bench_test.py -v
#### Overview
End-to-end benchmarking evaluates OCR accuracy by:

Run main, hoping to convert this to a cli at some point
End-to-end benchmarking scripts can:

```shell
poetry run main
```
1. Segment and run OCR on a folder of images using given segmentation template and labels file.
2. Compare OCR outputs to ground truth data based on matching file names.
3. Write metrics (confidence, raw distance, Hamming distance, Levenshtein distance) as well as total metrics to a CSV file.

To build the OCR service into an executable artifact

```shell
poetry run build
```
To run benchmarking:

1. Locate file `benchmark_main.py`
2. Ensure all the paths/folders exist by downloading from [Google Drive for all segmentation/label files](https://drive.google.com/drive/folders/1WS2FYn0BTxWv0juh7lblzdMaFlI7zbDd?usp=sharing)
3. Ensure `ground_truth` folder and files exist
4. Ensure `labels.json` is in the correct format (see `tax_form_segmented_labels.json` as an example)
5. When running make sure to pass arguments in this order:

* `/path/to/image/folder` (path to the original image files which we need to run OCR on)
* `/path/to/segmentation_template.png` (single file)
* `/path/to/labels.json` (single file)
* `/path/to/output/folder` (path to folder where the output would be. This should exist but can be empty)
* `/path/to/ground/truth_folder` (path to folder for metrics that we would compare against)
* `/path/to/csv_out_folder` (path to folder where all metrics would be. This should exist but can be empty)

By default, segmentation, OCR, and metrics computation are all run together. To disable one or the other, pass the `--no-ocr` or `--no-metrics` flags. You can change the backend model by passing `--model=...` as well.

Run notes:
* Benchmark takes one second per segment for OCR using the default `trocr` model. Please be patient or set a counter to limit the number of files processed.
* Only one segment can be input at a time


### Test Data Sets

You can run the script `pytest run reportvision-dataset-1/medical_report_import.py` to pull in all relevant data.


### Development Tools

Adding new dependencies

Expand Down Expand Up @@ -82,44 +140,18 @@ To run the API in prod mode
poetry run api
```

### Test Data Sets

You can also run the script `pytest run reportvision-dataset-1/medical_report_import.py` to pull in all relevant data.


### Run end-to-end benchmarking

End-to-end benchmarking scripts can:

1. Segment and run OCR on a folder of images using given segmentation template and labels file.
2. Compare OCR outputs to ground truth data based on matching file names.
3. Write metrics (confidence, raw distance, Hamming distance, Levenshtein distance) as well as total metrics to a CSV file.


To run benchmarking:

1. Locate file `benchmark_main.py`
2. Ensure all the paths/folders exist by downloading from [Google Drive for all segmentation/label files](https://drive.google.com/drive/folders/1WS2FYn0BTxWv0juh7lblzdMaFlI7zbDd?usp=sharing)
3. Ensure `ground_truth` folder and files exist
4. Ensure `labels.json` is in the correct format (see `tax_form_segmented_labels.json` as an example)
5. When running make sure to pass arguments in this order:
To build the OCR service into an executable artifact

* `/path/to/image/folder` (path to the original image files which we need to run OCR on)
* `/path/to/segmentation_template.png` (single file)
* `/path/to/labels.json` (single file)
* `/path/to/output/folder` (path to folder where the output would be. This should exist but can be empty)
* `/path/to/ground/truth_folder` (path to folder for metrics that we would compare against)
* `/path/to/csv_out_folder` (path to folder where all metrics would be. This should exist but can be empty)
```shell
poetry run build
```

By default, segmentation, OCR, and metrics computation are all run together. To disable one or the other, pass the `--no-ocr` or `--no-metrics` flags. You can change the backend model by passing `--model=...` as well.

Run notes:
* Benchmark takes one second per segment for OCR using the default `trocr` model. Please be patient or set a counter to limit the number of files processed.
* Only one segment can be input at a time

### Dockerized Development

It is also possible to run the entire project in a collection of docker containers. This is useful for development and testing purposes as it doesn't require any additional dependencies to be installed on your local machine.
It is also possible to run the project in a collection of docker containers. This is useful for development and testing purposes as it doesn't require any additional dependencies to be installed.

To start the containers, run the following command:

Expand All @@ -132,6 +164,38 @@ This will start the following containers:
- ocr: The OCR service container
- frontend: The frontend container

The frontend container will automatically reload when changes are made to the frontend code. To access the frontend, navigate to http://localhost:5173 in your browser.
The frontend container will automatically reload when changes are made to the frontend. To access the frontend, navigate to http://localhost:5173 in your browser.

The OCR service container will restart automatically when changes are made to the OCR code. To access the API, navigate to http://localhost:8000/ in your browser.


## Project Architecture

The OCR Layer is organized as follows:

- **`ocr/`**:
- **`api.py`**: Defines the API for the OCR service.
- **`main.py`**: Entry point script to run the OCR service.
- **`segmenter.py`**: Handles image segmentation based on templates and labels.
- **`ocr_engine.py`**: OCR logic using the specified OCR models.
- **`metrics.py`**: Computes metrics (e.g., confidence, Levenshtein distance) by comparing OCR results with ground truth.
- **`config.py`**: Contains configuration files for paths, environment variables, and model settings.

- **`tests/`**: Contains unit tests, integration tests, and benchmarking scripts.
- **`benchmark_test.py`**: Tests benchmarking logic for OCR and metrics.
- **`unit_test.py`**: Includes unit tests for individual components of the OCR service.
- **`benchmark_main.py`**: Main script for running end-to-end benchmarking, including segmentation, OCR, and metrics computation.

- **`data/`**: location of segmentation templates, labels, ground truth, and test datasets (not included in the repository by default).

- **`reportvision-dataset-1/`**: Example dataset folder for running benchmarks and tests.
- **`medical_report_import.py`**: Script to import and prepare medical reports for testing.

- **`Dockerfile`**: Defines the container for running the OCR service in a Dockerized environment.

- **`dev-env.yaml`**: Docker Compose file for setting up a development environment with containers for the OCR service and frontend.

- **`pyproject.toml`**: Poetry configuration file specifying project dependencies and settings.

- **`poetry.lock`**: Lock file generated by Poetry to ensure dependency consistency.

2 changes: 1 addition & 1 deletion user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ ReportVision is a tool that automates the reading and extracting of labs from PD
3. [Tesseract5.5](https://formulae.brew.sh/formula/tesseract) (brew install tesseract)
4. [Java21](https://www.oracle.com/java/technologies/downloads/)
5. [PostgreSQL](https://www.postgresql.org/)
6. [Docker](https://www.docker.com/) (required for DB and middleware set up
6. [Docker](https://www.docker.com/) (required for DB and middleware set up)

### Installation and Development Guides

Expand Down

0 comments on commit 3c36b62

Please sign in to comment.