Refactor PDF Processing: Introduce PdfManager with Enhanced Extraction Methods #3

slonyator · 2023-12-05T14:07:28Z

Overview

This pull request represents a comprehensive refactor of the initial PDF processing notebook. The goal is to enhance code readability, maintainability, and functionality. The changes introduce a new class-based architecture, offering a more modular and scalable approach to PDF text, image, and table extraction.

Major Changes

Introduction of PdfManager Class: Centralizes the PDF processing logic, providing a cleaner and more organized code structure. The PdfManager class orchestrates the extraction of text, images, and tables from PDF documents.
Class-Based Extractors: Developed PdfTextExtractor, PdfImageExtractor, and PdfTableExtractor classes. Each class focuses on a specific aspect of PDF processing (text, images, tables), adhering to the Single Responsibility Principle.
Enhanced Text Extraction: Implemented comprehensive text extraction, including formatting details, using the PdfTextExtractor.
Image Processing and OCR: The PdfImageExtractor handles image extraction from PDFs and employs OCR to extract text from images.
Table Extraction and Formatting: The PdfTableExtractor is responsible for extracting tables and converting them to a user-friendly string format.
Customizable Content Retrieval: The process_pdf method in PdfManager allows users to retrieve either the full text from all PDF pages or text from a specific page.
Clean-Up Functionality: Added a clean-up method within PdfManager to handle the removal of temporary files created during processing.
Switch to Poetry for Dependency Management: Transitioned from using pip to Poetry for more efficient dependency management and project reproducibility.

Benefits

Improved Code Readability: By breaking down the functionality into classes and methods, the code is more readable and easier to understand.
Increased Maintainability: The modular class-based design simplifies updates and maintenance.
Flexible and User-Friendly: Users can easily extract specific content from PDFs with minimal code, suitable for a variety of applications.
Enhanced Dependency Management: With Poetry, dependencies are managed more effectively, and the project setup becomes more straightforward and reproducible.

Conclusion

This refactor significantly improves the structure and capabilities of the initial PDF processing approach. It's a step forward in making PDF data extraction more accessible and efficient for various use cases and enhances the overall management of project dependencies.

For Text, Tables and Images a respective class was created in order to have a better overview.

In order to increase code readability the code for the "main" class was also bundled in a calls with helper functions.

Enhanced the process_pdf method in PdfManager to support returning extracted text from either all pages as a single string, or from a specific page. This update improves usability and flexibility for users working with PDF text extraction.

Refactor PDF Processing: Introduce PdfManager with Enhanced Extraction Methods

refactor: All classes into a single file temp: Changed PDF doc feat: Remove text duplicates added feat: fuzzy-wuzzy check f/ duplicate-detection In order to check, weather a text block is already extracted (due to the fact that text sometimes gets recognized as an image and also as text and therefore gets extracted twice) we now use fuzzy logic which takes an NLP approach and gives a confidence score estimating weather two text blocks are identical. By applying this logic we want to tackle the issue of text duplicates. style: Code blacked style: variable renamed refactor: Methods changed to static doc: Doc-Strings added style: File blacked

No real speedimprovement so far, just a tiny bit.

fuzzy-wuzzy replaced by rapidfuzz

Replaced the existing string matching method with Jaro-Winkler similarity. Updated the remove_duplicated_text method in PdfManager to utilize Jaro-Winkler for assessing text similarity.

Jaro-Winkler for String Comparison

# Conflicts: # pdf_image_extractor.py # pdf_manager.py # pdf_table_extractor.py # pdf_text_extractor.py

No duplicates

slonyator and others added 21 commits December 5, 2023 09:42

feat: Switched to poetry

8f9e12f

feat: Init f/ PDF Manager

196fd25

refactor: Code bundled into classes

442058c

For Text, Tables and Images a respective class was created in order to have a better overview.

maintenance: packages updated

85a18ef

.gitignore added

d89502c

style: Type Annotations added

8d3e620

fix: Typo

ab4fa23

refactor: Bundled main functionality in a class

d7d0719

In order to increase code readability the code for the "main" class was also bundled in a calls with helper functions.

feat: extract text from all/specific PDF pages

687b59e

Enhanced the process_pdf method in PdfManager to support returning extracted text from either all pages as a single string, or from a specific page. This update improves usability and flexibility for users working with PDF text extraction.

style: Code blacked

e85db0b

Merge pull request #1 from slonyator/code-refactoring

be8ee7b

Refactor PDF Processing: Introduce PdfManager with Enhanced Extraction Methods

feat: fuzzy-wuzzy replaced by speedfuzz

c9773c4

No real speedimprovement so far, just a tiny bit.

Merge pull request #2 from slonyator/speedup-fuzzy-comparison

4d70082

fuzzy-wuzzy replaced by rapidfuzz

feat: Jaro-Winkler f/ string comparison

aa8a204

Replaced the existing string matching method with Jaro-Winkler similarity. Updated the remove_duplicated_text method in PdfManager to utilize Jaro-Winkler for assessing text similarity.

chore: Unneccessary packages removed

8f5fb43

Merge pull request #3 from slonyator/add-jaro-winkler

e3a1fd0

Jaro-Winkler for String Comparison

chore: Unneccessary files removed

b94f126

Merge branch 'main' into no-duplicates

57e3637

# Conflicts: # pdf_image_extractor.py # pdf_manager.py # pdf_table_extractor.py # pdf_text_extractor.py

feat: Minor changes f/ table display

5cb4e58

Merge pull request #4 from slonyator/no-duplicates

6a22c78

No duplicates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor PDF Processing: Introduce PdfManager with Enhanced Extraction Methods #3

Refactor PDF Processing: Introduce PdfManager with Enhanced Extraction Methods #3

slonyator commented Dec 5, 2023

Refactor PDF Processing: Introduce PdfManager with Enhanced Extraction Methods #3

Are you sure you want to change the base?

Refactor PDF Processing: Introduce PdfManager with Enhanced Extraction Methods #3

Conversation

slonyator commented Dec 5, 2023

Overview

Major Changes

Benefits

Conclusion