Skip to content
This repository has been archived by the owner on Jan 24, 2025. It is now read-only.

Error when loading from directory/folder #2

Closed
keyvan-najafy opened this issue Oct 16, 2024 · 1 comment
Closed

Error when loading from directory/folder #2

keyvan-najafy opened this issue Oct 16, 2024 · 1 comment

Comments

@keyvan-najafy
Copy link

keyvan-najafy commented Oct 16, 2024

Hi,
Before I begin, thank you for creating such a great package!
I'm trying to load bunch of pdfs from google drive into google Colab and extract their tables.
when I run for a single pdf (thus using load_from_file functionality) everything works great but when I give load_pdfs_images function a directory path I get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-18-7ff012d2aaf5>](https://localhost:8080/#) in <cell line: 1>()
----> 1 images, highres_images, names, text_lines = load_pdfs_images(input_dir)


1 frames
[/usr/local/lib/python3.10/dist-packages/surya/input/load.py](https://localhost:8080/#) in load_from_folder(folder_path, max_pages, start_page, dpi, load_text_lines)
     69             images.extend(image)
     70             names.extend(name)
---> 71             text_lines.extend(text_line)
     72         else:
     73             try:


TypeError: 'NoneType' object is not iterable

In surya/input/load.py , load_from_folder (as showing in above error) calls load_pdf function in the same script to assign a value to text_line variable.
Inside load_pdf function, following codes generate None of text_lines variable:

    if load_text_lines:
        from surya.input.pdflines import get_page_text_lines # Putting import here because pypdfium2 causes warnings if its not the top import
        text_lines = get_page_text_lines(
            pdf_path,
            page_indices,
            [i.size for i in images]
        )

It seems that get_page_text_lines returns None when it reaches empty PDF pages which raises error for the whole process rather than skipping the empty page

@VikParuchuri
Copy link
Owner

Will fix this shortly

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants