Bug with filetype causes null dereference #10

conjuncts · 2024-10-21T04:16:55Z

Hey VikParuchuri, thanks for the great library!

I have a rather strange bug (not really the fault of tabled) where filetype is unable to recognize the extension for a pdf.

!wget -O bulk/3.pdf bulk -q https://www.nature.com/articles/s41467-023-38544-z.pdf

import filetype
out = filetype.guess("./bulk/3.pdf")
print(out) # None

As a result, there is a null dereference:

fileinput.py:13, in load_pdfs_images(input_path, max_pages, start_page)

     [17](~/.../tabled/fileinput.py:17) return images, highres_images, names, text_lines
...
---> [52](~/.../surya/input/load.py:52)     if input_type.extension == "pdf":
     [53](~/.../surya/input/load.py:53)         return load_pdf(input_path, max_pages, start_page, dpi=dpi, load_text_lines=load_text_lines)
     [54](~/.../surya/input/load.py:54)     else:

AttributeError: 'NoneType' object has no attribute 'extension'

Maybe something like this could work?

def load_from_file(input_path, max_pages=None, start_page=None, dpi=settings.IMAGE_DPI, load_text_lines=False):
    input_type = filetype.guess(input_path)
    if input_type and input_type.extension == "pdf" or input_path.endswith(".pdf"):
        return load_pdf(input_path, max_pages, start_page, dpi=dpi, load_text_lines=load_text_lines)
    else:
        return load_image(input_path)

Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug with filetype causes null dereference #10

Bug with filetype causes null dereference #10

conjuncts commented Oct 21, 2024

Bug with filetype causes null dereference #10

Bug with filetype causes null dereference #10

Comments

conjuncts commented Oct 21, 2024