ValueError: read length must be non-negative or -1 #1474

tynanseltzer · 2022-12-08T00:53:11Z

Replace this: What happened? What were you trying to achieve?

Environment

Which environment were you using when you encountered the problem?

$ python -m platform

macOS-10.16-x86_64-i386-64bit

$ python -c "import PyPDF2;print(PyPDF2.__version__)"

2.11.2

Code + PDF

This is a minimal, complete example that shows the issue:

def write_public(pdf_name, output_folders):
    split_names = gen_split_names(pdf_name)
    input_pdf = PdfFileReader(open(pdf_name, "rb"))
    num_partitions = len(output_folders)
    for num_person in range(len(split_names)):
        if num_person == len(split_names) - 1:
            end = input_pdf.numPages
        else:
            end = split_names[num_person+1][0]
        output = PdfFileWriter()
        for page_num in range(split_names[num_person][0], end):
            output.addPage(input_pdf.getPage(page_num))
        with open(output_folders[num_person % num_partitions] + "/" + split_names[num_person][1][0]+split_names[num_person][1][
            1]+".pdf",
                  "wb") as outstream:
            output.write(outstream)

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

I (legally) can't share the PDF, but I can say that I ran this code on some 240 pdfs, and it broke on only this one pdf. The only thing I find different about this pdf is a bunch of math equations in LaTeX. Please feel free to close this issue, I understand non reproducible bug reports aren't exactly useful, but figured I'd give it a shot.

Traceback

This is the complete Traceback I see:
File "/Users/tynanseltzer/test2/splitter.py", line 135, in
write_public(PUBLIC_PDF, fls)
File "/Users/tynanseltzer/test2/splitter.py", line 64, in write_public
output.write(outstream)
File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 832, in write
self.write_stream(stream)
File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 805, in write_stream
self._sweep_indirect_references(self._root)
File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 954, in _sweep_indirect_references
data = self._resolve_indirect_object(data)
File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 999, in _resolve_indirect_object
real_obj = data.pdf.get_object(data)
File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_reader.py", line 1222, in get_object
retval = read_object(self.stream, self) # type: ignore
File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/generic/_data_structures.py", line 872, in read_object
stream.read(-20)
ValueError: read length must be non-negative or -1

pubpub-zz · 2022-12-08T17:58:13Z

@tynanseltzer
this is a bug quite simple to correct, however it is just before another exception raising.
Can you replace line 872
stream.read(-20) by stream.seek(-20,1)
and rerun your program. You should still get an error starting by "Invalid Elementary..."
can you report the full message you will get?

tynanseltzer · 2022-12-08T18:11:59Z

Here you go. Thank you!

Traceback (most recent call last):
  File "/Users/tynanseltzer/msri/splitter.py", line 135, in <module>
    write_public(PUBLIC_PDF, fls)
  File "/Users/tynanseltzer/msri/splitter.py", line 64, in write_public
    output.write(outstream)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 832, in write
    self.write_stream(stream)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 805, in write_stream
    self._sweep_indirect_references(self._root)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 954, in _sweep_indirect_references
    data = self._resolve_indirect_object(data)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 999, in _resolve_indirect_object
    real_obj = data.pdf.get_object(data)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_reader.py", line 1222, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/generic/_data_structures.py", line 873, in read_object
    raise PdfReadError(
PyPDF2.errors.PdfReadError: Invalid Elementary Object starting with b'e' @17651045: b'\nendobj \n5746 0 obj endobj\nendobj \n5747 0 obj [/ICCBased 5772 0 R]\nendobj \n5748 '

pubpub-zz · 2022-12-08T21:15:38Z

This look like a null object ( cf pdf-association/safedocs#2 (comment))
@tynanseltzer can you replace read_object function with this code and confirm the error disappears

def read_object(
    stream: StreamType,
    pdf: Any,  # PdfReader
    forced_encoding: Union[None, str, List[str], Dict[int, str]] = None,
) -> Union[PdfObject, int, str, ContentStream]:
    tok = stream.read(1)
    stream.seek(-1, 1)  # reset to start
    if tok == b"/":
        return NameObject.read_from_stream(stream, pdf)
    elif tok == b"<":
        # hexadecimal string OR dictionary
        peek = stream.read(2)
        stream.seek(-2, 1)  # reset to start

        if peek == b"<<":
            return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
        else:
            return read_hex_string_from_stream(stream, forced_encoding)
    elif tok == b"[":
        return ArrayObject.read_from_stream(stream, pdf, forced_encoding)
    elif tok == b"t" or tok == b"f":
        return BooleanObject.read_from_stream(stream)
    elif tok == b"(":
        return read_string_from_stream(stream, forced_encoding)
    elif tok == b"e" and stream.read(6) == b"endobj" :
        return NullObject.read_from_stream(stream)
    elif tok == b"n":
        return NullObject.read_from_stream(stream)
    elif tok == b"%":
        # comment
        while tok not in (b"\r", b"\n"):
            tok = stream.read(1)
            # Prevents an infinite loop by raising an error if the stream is at
            # the EOF
            if len(tok) <= 0:
                raise PdfStreamError("File ended unexpectedly.")
        tok = read_non_whitespace(stream)
        stream.seek(-1, 1)
        return read_object(stream, pdf, forced_encoding)
    elif tok in b"0123456789+-.":
        # number object OR indirect reference
        peek = stream.read(20)
        stream.seek(-len(peek), 1)  # reset to start
        if IndirectPattern.match(peek) is not None:
            return IndirectObject.read_from_stream(stream, pdf)
        else:
            return NumberObject.read_from_stream(stream)
    else:
        stream.seek(-20,1)
        raise PdfReadError(
            f"Invalid Elementary Object starting with {tok!r} @{stream.tell()}: {stream.read(80).__repr__()}"
        )

tynanseltzer · 2022-12-08T21:23:03Z

Error is now

Traceback (most recent call last):
  File "/Users/tynanseltzer/msri/splitter.py", line 135, in <module>
    write_public(PUBLIC_PDF, fls)
  File "/Users/tynanseltzer/msri/splitter.py", line 64, in write_public
    output.write(outstream)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 832, in write
    self.write_stream(stream)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 805, in write_stream
    self._sweep_indirect_references(self._root)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 954, in _sweep_indirect_references
    data = self._resolve_indirect_object(data)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_writer.py", line 999, in _resolve_indirect_object
    real_obj = data.pdf.get_object(data)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/_reader.py", line 1222, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/generic/_data_structures.py", line 902, in read_object
    return NullObject.read_from_stream(stream)
  File "/Users/tynanseltzer/anaconda3/lib/python3.9/site-packages/PyPDF2/generic/_base.py", line 93, in read_from_stream
    raise PdfReadError("Could not read Null object")
PyPDF2.errors.PdfReadError: Could not read Null object

at what I believe is the identical crash point.

pubpub-zz · 2022-12-08T21:32:11Z

new attempt :

    stream: StreamType,
    pdf: Any,  # PdfReader
    forced_encoding: Union[None, str, List[str], Dict[int, str]] = None,
) -> Union[PdfObject, int, str, ContentStream]:
    tok = stream.read(1)
    stream.seek(-1, 1)  # reset to start
    if tok == b"/":
        return NameObject.read_from_stream(stream, pdf)
    elif tok == b"<":
        # hexadecimal string OR dictionary
        peek = stream.read(2)
        stream.seek(-2, 1)  # reset to start

        if peek == b"<<":
            return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
        else:
            return read_hex_string_from_stream(stream, forced_encoding)
    elif tok == b"[":
        return ArrayObject.read_from_stream(stream, pdf, forced_encoding)
    elif tok == b"t" or tok == b"f":
        return BooleanObject.read_from_stream(stream)
    elif tok == b"(":
        return read_string_from_stream(stream, forced_encoding)
    elif tok == b"e" and stream.read(6) == b"endobj" :
        stream.seek(-6,1)
        return NullObject()
    elif tok == b"n":
        return NullObject.read_from_stream(stream)
    elif tok == b"%":
        # comment
        while tok not in (b"\r", b"\n"):
            tok = stream.read(1)
            # Prevents an infinite loop by raising an error if the stream is at
            # the EOF
            if len(tok) <= 0:
                raise PdfStreamError("File ended unexpectedly.")
        tok = read_non_whitespace(stream)
        stream.seek(-1, 1)
        return read_object(stream, pdf, forced_encoding)
    elif tok in b"0123456789+-.":
        # number object OR indirect reference
        peek = stream.read(20)
        stream.seek(-len(peek), 1)  # reset to start
        if IndirectPattern.match(peek) is not None:
            return IndirectObject.read_from_stream(stream, pdf)
        else:
            return NumberObject.read_from_stream(stream)
    else:
        stream.seek(-20,1)
        raise PdfReadError(
            f"Invalid Elementary Object starting with {tok!r} @{stream.tell()}: {stream.read(80).__repr__()}"
        )

tynanseltzer · 2022-12-08T21:35:46Z

Ran without error.

tynanseltzer · 2022-12-08T21:38:19Z

Unsure if this should be left open, feel free to close. Thank you deeply for your help!

pubpub-zz · 2022-12-08T21:39:57Z

preparing the PR. will be closed once merged

fixes py-pdf#1474

MartinThoma · 2022-12-08T22:07:02Z

You're both amazing ❤️

@pubpub-zz The PR was perfect - short, fixed the issue + added a test. Good work 👍

@tynanseltzer Your initial error report + being responsive surely helped a lot to get this fixed. We appreciate that. If you want, I would add you as a contributor to https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

MartinThoma · 2022-12-08T22:07:26Z

I will make a release to PyPI this weekend. That release will contain the bugfix.

tynanseltzer · 2022-12-08T22:14:02Z

@tynanseltzer Your initial error report + being responsive surely helped a lot to get this fixed. We appreciate that. If you want, I would add you as a contributor to https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html

@MartinThoma I'm ok, thank you though. Appreciate you and pubpub working with a bug report that contained "I will not reproduce the pdf that causes the bug."

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Dec 8, 2022

ROB : accept empty object as null objects

c71a225

fixes py-pdf#1474

pubpub-zz mentioned this issue Dec 8, 2022

ROB : accept empty object as null objects #1477

Merged

MartinThoma closed this as completed in #1477 Dec 8, 2022

MartinThoma closed this as completed in 7f586ae Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: read length must be non-negative or -1 #1474

ValueError: read length must be non-negative or -1 #1474

tynanseltzer commented Dec 8, 2022 •

edited

Loading

pubpub-zz commented Dec 8, 2022

tynanseltzer commented Dec 8, 2022 •

edited

Loading

pubpub-zz commented Dec 8, 2022

tynanseltzer commented Dec 8, 2022

pubpub-zz commented Dec 8, 2022

tynanseltzer commented Dec 8, 2022

tynanseltzer commented Dec 8, 2022 •

edited

Loading

pubpub-zz commented Dec 8, 2022

MartinThoma commented Dec 8, 2022

MartinThoma commented Dec 8, 2022

tynanseltzer commented Dec 8, 2022

ValueError: read length must be non-negative or -1 #1474

ValueError: read length must be non-negative or -1 #1474

Comments

tynanseltzer commented Dec 8, 2022 • edited Loading

Environment

Code + PDF

Traceback

pubpub-zz commented Dec 8, 2022

tynanseltzer commented Dec 8, 2022 • edited Loading

pubpub-zz commented Dec 8, 2022

tynanseltzer commented Dec 8, 2022

pubpub-zz commented Dec 8, 2022

tynanseltzer commented Dec 8, 2022

tynanseltzer commented Dec 8, 2022 • edited Loading

pubpub-zz commented Dec 8, 2022

MartinThoma commented Dec 8, 2022

MartinThoma commented Dec 8, 2022

tynanseltzer commented Dec 8, 2022

tynanseltzer commented Dec 8, 2022 •

edited

Loading

tynanseltzer commented Dec 8, 2022 •

edited

Loading

tynanseltzer commented Dec 8, 2022 •

edited

Loading