-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor PDF Processing: Introduce PdfManager with Enhanced Extraction Methods #3
Open
slonyator
wants to merge
21
commits into
g-stavrakis:main
Choose a base branch
from
slonyator:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
For Text, Tables and Images a respective class was created in order to have a better overview.
In order to increase code readability the code for the "main" class was also bundled in a calls with helper functions.
Enhanced the process_pdf method in PdfManager to support returning extracted text from either all pages as a single string, or from a specific page. This update improves usability and flexibility for users working with PDF text extraction.
Refactor PDF Processing: Introduce PdfManager with Enhanced Extraction Methods
refactor: All classes into a single file temp: Changed PDF doc feat: Remove text duplicates added feat: fuzzy-wuzzy check f/ duplicate-detection In order to check, weather a text block is already extracted (due to the fact that text sometimes gets recognized as an image and also as text and therefore gets extracted twice) we now use fuzzy logic which takes an NLP approach and gives a confidence score estimating weather two text blocks are identical. By applying this logic we want to tackle the issue of text duplicates. style: Code blacked style: variable renamed refactor: Methods changed to static doc: Doc-Strings added style: File blacked
No real speedimprovement so far, just a tiny bit.
fuzzy-wuzzy replaced by rapidfuzz
Replaced the existing string matching method with Jaro-Winkler similarity. Updated the remove_duplicated_text method in PdfManager to utilize Jaro-Winkler for assessing text similarity.
Jaro-Winkler for String Comparison
# Conflicts: # pdf_image_extractor.py # pdf_manager.py # pdf_table_extractor.py # pdf_text_extractor.py
No duplicates
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This pull request represents a comprehensive refactor of the initial PDF processing notebook. The goal is to enhance code readability, maintainability, and functionality. The changes introduce a new class-based architecture, offering a more modular and scalable approach to PDF text, image, and table extraction.
Major Changes
PdfManager
Class: Centralizes the PDF processing logic, providing a cleaner and more organized code structure. ThePdfManager
class orchestrates the extraction of text, images, and tables from PDF documents.PdfTextExtractor
,PdfImageExtractor
, andPdfTableExtractor
classes. Each class focuses on a specific aspect of PDF processing (text, images, tables), adhering to the Single Responsibility Principle.PdfTextExtractor
.PdfImageExtractor
handles image extraction from PDFs and employs OCR to extract text from images.PdfTableExtractor
is responsible for extracting tables and converting them to a user-friendly string format.process_pdf
method inPdfManager
allows users to retrieve either the full text from all PDF pages or text from a specific page.PdfManager
to handle the removal of temporary files created during processing.Benefits
Conclusion
This refactor significantly improves the structure and capabilities of the initial PDF processing approach. It's a step forward in making PDF data extraction more accessible and efficient for various use cases and enhances the overall management of project dependencies.