Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: read length must be non-negative or -1 #1474

Closed
tynanseltzer opened this issue Dec 8, 2022 · 11 comments · Fixed by #1477
Closed

ValueError: read length must be non-negative or -1 #1474

tynanseltzer opened this issue Dec 8, 2022 · 11 comments · Fixed by #1477

Comments

@tynanseltzer
Copy link

tynanseltzer commented Dec 8, 2022

Replace this: What happened? What were you trying to achieve?

Environment

Which environment were you using when you encountered the problem?

$ python -m platform

macOS-10.16-x86_64-i386-64bit

$ python -c "import PyPDF2;print(PyPDF2.__version__)"

2.11.2

Code + PDF

This is a minimal, complete example that shows the issue:

def write_public(pdf_name, output_folders):
    split_names = gen_split_names(pdf_name)
    input_pdf = PdfFileReader(open(pdf_name, "rb"))
    num_partitions = len(output_folders)
    for num_person in range(len(split_names)):
        if num_person == len(split_names) - 1:
            end = input_pdf.numPages
        else:
            end = split_names[num_person+1][0]
        output = PdfFileWriter()
        for page_num in range(split_names[num_person][0], end):
            output.addPage(input_pdf.getPage(page_num))
        with open(output_folders[num_person % num_partitions] + "/" + split_names[num_person][1][0]+split_names[num_person][1][
            1]+".pdf",
                  "wb") as outstream:
            output.write(outstream)

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

I (legally) can't share the PDF, but I can say that I ran this code on some 240 pdfs, and it broke on only this one pdf. The only thing I find different about this pdf is a bunch of math equations in LaTeX. Please feel free to close this issue, I understand non reproducible bug reports aren't exactly useful, but figured I'd give it a shot.

Traceback

This is the complete Traceback I see:
File "/Users/tynanseltzer/test2/splitter.py", line 135, in
write_public(PUBLIC_PDF, fls)
File "/Users/tynanseltzer/test2/splitter.py", line 64, in write_public
output.write(outstream)
File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 832, in write
self.write_stream(stream)
File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 805, in write_stream
self._sweep_indirect_references(self._root)
File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 954, in _sweep_indirect_references
data = self._resolve_indirect_object(data)
File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 999, in _resolve_indirect_object
real_obj = data.pdf.get_object(data)
File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_reader.py", line 1222, in get_object
retval = read_object(self.stream, self) # type: ignore
File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/generic/_data_structures.py", line 872, in read_object
stream.read(-20)
ValueError: read length must be non-negative or -1

@pubpub-zz
Copy link
Collaborator

@tynanseltzer
this is a bug quite simple to correct, however it is just before another exception raising.
Can you replace line 872
stream.read(-20) by stream.seek(-20,1)
and rerun your program. You should still get an error starting by "Invalid Elementary..."
can you report the full message you will get?

@tynanseltzer
Copy link
Author

tynanseltzer commented Dec 8, 2022

Here you go. Thank you!

Traceback (most recent call last):
  File "/Users/tynanseltzer/msri/splitter.py", line 135, in <module>
    write_public(PUBLIC_PDF, fls)
  File "/Users/tynanseltzer/msri/splitter.py", line 64, in write_public
    output.write(outstream)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 832, in write
    self.write_stream(stream)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 805, in write_stream
    self._sweep_indirect_references(self._root)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 954, in _sweep_indirect_references
    data = self._resolve_indirect_object(data)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 999, in _resolve_indirect_object
    real_obj = data.pdf.get_object(data)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_reader.py", line 1222, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/generic/_data_structures.py", line 873, in read_object
    raise PdfReadError(
PyPDF2.errors.PdfReadError: Invalid Elementary Object starting with b'e' @17651045: b'\nendobj \n5746 0 obj endobj\nendobj \n5747 0 obj [/ICCBased 5772 0 R]\nendobj \n5748 '

@pubpub-zz
Copy link
Collaborator

This look like a null object ( cf pdf-association/safedocs#2 (comment))
@tynanseltzer can you replace read_object function with this code and confirm the error disappears

def read_object(
    stream: StreamType,
    pdf: Any,  # PdfReader
    forced_encoding: Union[None, str, List[str], Dict[int, str]] = None,
) -> Union[PdfObject, int, str, ContentStream]:
    tok = stream.read(1)
    stream.seek(-1, 1)  # reset to start
    if tok == b"/":
        return NameObject.read_from_stream(stream, pdf)
    elif tok == b"<":
        # hexadecimal string OR dictionary
        peek = stream.read(2)
        stream.seek(-2, 1)  # reset to start

        if peek == b"<<":
            return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
        else:
            return read_hex_string_from_stream(stream, forced_encoding)
    elif tok == b"[":
        return ArrayObject.read_from_stream(stream, pdf, forced_encoding)
    elif tok == b"t" or tok == b"f":
        return BooleanObject.read_from_stream(stream)
    elif tok == b"(":
        return read_string_from_stream(stream, forced_encoding)
    elif tok == b"e" and stream.read(6) == b"endobj" :
        return NullObject.read_from_stream(stream)
    elif tok == b"n":
        return NullObject.read_from_stream(stream)
    elif tok == b"%":
        # comment
        while tok not in (b"\r", b"\n"):
            tok = stream.read(1)
            # Prevents an infinite loop by raising an error if the stream is at
            # the EOF
            if len(tok) <= 0:
                raise PdfStreamError("File ended unexpectedly.")
        tok = read_non_whitespace(stream)
        stream.seek(-1, 1)
        return read_object(stream, pdf, forced_encoding)
    elif tok in b"0123456789+-.":
        # number object OR indirect reference
        peek = stream.read(20)
        stream.seek(-len(peek), 1)  # reset to start
        if IndirectPattern.match(peek) is not None:
            return IndirectObject.read_from_stream(stream, pdf)
        else:
            return NumberObject.read_from_stream(stream)
    else:
        stream.seek(-20,1)
        raise PdfReadError(
            f"Invalid Elementary Object starting with {tok!r} @{stream.tell()}: {stream.read(80).__repr__()}"
        )

@tynanseltzer
Copy link
Author

Error is now

Traceback (most recent call last):
  File "/Users/tynanseltzer/msri/splitter.py", line 135, in <module>
    write_public(PUBLIC_PDF, fls)
  File "/Users/tynanseltzer/msri/splitter.py", line 64, in write_public
    output.write(outstream)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 832, in write
    self.write_stream(stream)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 805, in write_stream
    self._sweep_indirect_references(self._root)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 954, in _sweep_indirect_references
    data = self._resolve_indirect_object(data)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 999, in _resolve_indirect_object
    real_obj = data.pdf.get_object(data)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_reader.py", line 1222, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/generic/_data_structures.py", line 902, in read_object
    return NullObject.read_from_stream(stream)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/generic/_base.py", line 93, in read_from_stream
    raise PdfReadError("Could not read Null object")
PyPDF2.errors.PdfReadError: Could not read Null object

at what I believe is the identical crash point.

@pubpub-zz
Copy link
Collaborator

new attempt :

    stream: StreamType,
    pdf: Any,  # PdfReader
    forced_encoding: Union[None, str, List[str], Dict[int, str]] = None,
) -> Union[PdfObject, int, str, ContentStream]:
    tok = stream.read(1)
    stream.seek(-1, 1)  # reset to start
    if tok == b"/":
        return NameObject.read_from_stream(stream, pdf)
    elif tok == b"<":
        # hexadecimal string OR dictionary
        peek = stream.read(2)
        stream.seek(-2, 1)  # reset to start

        if peek == b"<<":
            return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
        else:
            return read_hex_string_from_stream(stream, forced_encoding)
    elif tok == b"[":
        return ArrayObject.read_from_stream(stream, pdf, forced_encoding)
    elif tok == b"t" or tok == b"f":
        return BooleanObject.read_from_stream(stream)
    elif tok == b"(":
        return read_string_from_stream(stream, forced_encoding)
    elif tok == b"e" and stream.read(6) == b"endobj" :
        stream.seek(-6,1)
        return NullObject()
    elif tok == b"n":
        return NullObject.read_from_stream(stream)
    elif tok == b"%":
        # comment
        while tok not in (b"\r", b"\n"):
            tok = stream.read(1)
            # Prevents an infinite loop by raising an error if the stream is at
            # the EOF
            if len(tok) <= 0:
                raise PdfStreamError("File ended unexpectedly.")
        tok = read_non_whitespace(stream)
        stream.seek(-1, 1)
        return read_object(stream, pdf, forced_encoding)
    elif tok in b"0123456789+-.":
        # number object OR indirect reference
        peek = stream.read(20)
        stream.seek(-len(peek), 1)  # reset to start
        if IndirectPattern.match(peek) is not None:
            return IndirectObject.read_from_stream(stream, pdf)
        else:
            return NumberObject.read_from_stream(stream)
    else:
        stream.seek(-20,1)
        raise PdfReadError(
            f"Invalid Elementary Object starting with {tok!r} @{stream.tell()}: {stream.read(80).__repr__()}"
        )

@tynanseltzer
Copy link
Author

Ran without error.

@tynanseltzer
Copy link
Author

tynanseltzer commented Dec 8, 2022

Unsure if this should be left open, feel free to close. Thank you deeply for your help!

@pubpub-zz
Copy link
Collaborator

preparing the PR. will be closed once merged

@MartinThoma
Copy link
Member

You're both amazing ❤️

@pubpub-zz The PR was perfect - short, fixed the issue + added a test. Good work 👍

@tynanseltzer Your initial error report + being responsive surely helped a lot to get this fixed. We appreciate that. If you want, I would add you as a contributor to https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

@MartinThoma
Copy link
Member

I will make a release to PyPI this weekend. That release will contain the bugfix.

@tynanseltzer
Copy link
Author

@tynanseltzer Your initial error report + being responsive surely helped a lot to get this fixed. We appreciate that. If you want, I would add you as a contributor to https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

@MartinThoma I'm ok, thank you though. Appreciate you and pubpub working with a bug report that contained "I will not reproduce the pdf that causes the bug."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants