PDF Report Processing Script

This script is designed to process a set of PDF reports, extract specific information from them, and store the extracted data in a CSV file. It also moves the processed PDF files to an output folder and appends the first page of each PDF to a new output PDF file. The script uses various custom classes to perform the extraction tasks and relies on external libraries such as pdf2image, PyPDF2, and pandas.

How the script works

Import necessary libraries and modules.
Set the maximum number of image pixels to avoid errors when processing large images.
Get a list of all PDF files in the input directory.
Create an empty PDF object to store the appended first pages of the processed PDFs.
Define a function append_pdf() that reads a PDF file, extracts the first page, and adds it to the output PDF object.
Create instances of custom classes for extracting specific information from the PDFs, such as Name, ID, Date, Time, PaceMaker, HR, VenECT, HRVar, STSegAn, SupEct, Pauses, and Conclusions.
Initialize an empty pandas DataFrame to store the extracted data.
Iterate over the list of PDF files in the input directory: a. Convert the first page of the PDF to an image. b. Save the image as a JPEG file. c. Extract the required information from the image using the custom class instances. d. Create a dictionary containing the extracted data. e. Convert the dictionary to a pandas DataFrame and append it to the output DataFrame. f. Move the processed PDF file to the output folder and rename it with the extracted ID, Name, and Date. g. Call the append_pdf() function to append the first page of the processed PDF to the output PDF object. h. Calculate the time taken to process the PDF and estimate the remaining time to process all PDFs.
Print the first 10 rows of the output DataFrame and its shape.
Save the output DataFrame as a CSV file.

Usage

To use this script, you need to have the required libraries installed and the input PDF files placed in the "input" directory. The script will process the PDFs, extract the required information, and store it in a CSV file named "output.csv". The processed PDF files will be moved to the "output" directory, and the appended first pages will be saved as a new PDF file named "output.pdf".

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
__pycache__		__pycache__
input		input
README.md		README.md
classes.py		classes.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Report Processing Script

How the script works

Usage

About

Releases

Packages

Languages

furkanbr/Report-Scrapper

Folders and files

Latest commit

History

Repository files navigation

PDF Report Processing Script

How the script works

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages