Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor PDF Processing: Introduce PdfManager with Enhanced Extraction Methods #3

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

slonyator
Copy link

Overview

This pull request represents a comprehensive refactor of the initial PDF processing notebook. The goal is to enhance code readability, maintainability, and functionality. The changes introduce a new class-based architecture, offering a more modular and scalable approach to PDF text, image, and table extraction.

Major Changes

  • Introduction of PdfManager Class: Centralizes the PDF processing logic, providing a cleaner and more organized code structure. The PdfManager class orchestrates the extraction of text, images, and tables from PDF documents.
  • Class-Based Extractors: Developed PdfTextExtractor, PdfImageExtractor, and PdfTableExtractor classes. Each class focuses on a specific aspect of PDF processing (text, images, tables), adhering to the Single Responsibility Principle.
  • Enhanced Text Extraction: Implemented comprehensive text extraction, including formatting details, using the PdfTextExtractor.
  • Image Processing and OCR: The PdfImageExtractor handles image extraction from PDFs and employs OCR to extract text from images.
  • Table Extraction and Formatting: The PdfTableExtractor is responsible for extracting tables and converting them to a user-friendly string format.
  • Customizable Content Retrieval: The process_pdf method in PdfManager allows users to retrieve either the full text from all PDF pages or text from a specific page.
  • Clean-Up Functionality: Added a clean-up method within PdfManager to handle the removal of temporary files created during processing.
  • Switch to Poetry for Dependency Management: Transitioned from using pip to Poetry for more efficient dependency management and project reproducibility.

Benefits

  • Improved Code Readability: By breaking down the functionality into classes and methods, the code is more readable and easier to understand.
  • Increased Maintainability: The modular class-based design simplifies updates and maintenance.
  • Flexible and User-Friendly: Users can easily extract specific content from PDFs with minimal code, suitable for a variety of applications.
  • Enhanced Dependency Management: With Poetry, dependencies are managed more effectively, and the project setup becomes more straightforward and reproducible.

Conclusion

This refactor significantly improves the structure and capabilities of the initial PDF processing approach. It's a step forward in making PDF data extraction more accessible and efficient for various use cases and enhances the overall management of project dependencies.

slonyator and others added 21 commits December 5, 2023 09:42
For Text, Tables and Images a respective class was created in order to have a
better overview.
In order to increase code readability the code for the "main" class was also
bundled in a calls with helper functions.
Enhanced the process_pdf method in PdfManager to support returning
extracted text from either all pages as a single string, or from a
specific page. This update improves usability and flexibility for
users working with PDF text extraction.
Refactor PDF Processing: Introduce PdfManager with Enhanced Extraction Methods
refactor: All classes into a single file

temp: Changed PDF doc

feat: Remove text duplicates added

feat: fuzzy-wuzzy check f/ duplicate-detection

In order to check, weather a text block is already extracted (due to the fact
that text sometimes gets recognized as an image and also as text and therefore
gets extracted twice) we now use fuzzy logic which takes an NLP approach and
gives a confidence score estimating weather two text blocks are identical. By
applying this logic we want to tackle the issue of text duplicates.

style: Code blacked

style: variable renamed

refactor: Methods changed to static

doc: Doc-Strings added

style: File blacked
No real speedimprovement so far, just a tiny bit.
Replaced the existing string matching method with Jaro-Winkler
similarity.
Updated the remove_duplicated_text method in PdfManager to utilize
Jaro-Winkler for assessing text similarity.
Jaro-Winkler for String Comparison
# Conflicts:
#	pdf_image_extractor.py
#	pdf_manager.py
#	pdf_table_extractor.py
#	pdf_text_extractor.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant