Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not obtain correct PDF version because of warning "cannot recognize version marker" is not present #3226

Closed
Matmaus opened this issue Mar 4, 2024 · 6 comments
Labels
not a bug not a bug / user error / unable to reproduce

Comments

@Matmaus
Copy link

Matmaus commented Mar 4, 2024

Description of the bug

File: PDF 2.0 with offset start.pdf (taken from PDF association GitHub page with example PDFs)

Since version v1.23.1 (or v1.23.0 - not sure because v1.23.0 contains a critical bug fixed in v1.23.1 which prevents me runnig it), PyMuPDF stopped providing "cannot recognize version marker" warning. It was available in all previous versions. I use this warning to detect PDF version by itself instead of using one obtained by PyMuPDF (see #1435).

Here is a description of the linked PDF file (source)

This is an example of a PDF file that was updated from a PDF 1.7 file to a PDF 2.0 file. This shows how an incremental save might be used when an existing PDF 1.7 file is updated and you want to mark the PDF as a PDF 2.0 file. The page should display the string "PDF 2.0 files have spacing" if it is properly parsed and interpreted; a different string will display if the viewer is not capable of reading the incremental save in the file. This example also shows how a PDF "file" may contain more than just PDF data. The comments at the beginning of the file are not in PDF syntax and are not considered as part of the PDF data. Note that file offsets in the PDF cross-reference table are relative to the start of the PDF data, and not to the beginning of the file itself.

How to reproduce the bug

To reproduce

import fitz as pymupdf

doc = pymupdf.open('PDF 2.0 with offset start.pdf')  # see section above
doc.metadata['format']
doc.is_repaired
pymupdf.TOOLS.mupdf_warnings()

Version 1.22.5:

>>> import fitz as pymupdf
>>> doc = pymupdf.open('PDF 2.0 with offset start.pdf')
>>> doc.is_repaired
True
>>> doc.metadata['format']
'PDF 1.7'
>>> pymupdf.TOOLS.mupdf_warnings()
'cannot recognize version marker\ntrying to repair broken xref\nrepairing PDF document'

Version 1.23.26:

>>> import fitz as pymupdf
>>> doc = pymupdf.open('PDF 2.0 with offset start.pdf')
>>> doc.is_repaired
True
>>> doc.metadata['format']
'PDF 1.7'
>>> pymupdf.TOOLS.mupdf_warnings()
'trying to repair broken xref\nrepairing PDF document'

Expected behaviour

Either detected PDF version will be 2.0 or at least "cannot recognize version marker" warning will be emited.

PyMuPDF version

1.23.26

Operating system

Linux

Python version

3.10

@JorjMcKie
Copy link
Collaborator

import fitz
doc=fitz.open("PDF 2.0 UTF-8 string and annotation.pdf")
doc.metadata
{'format': 'PDF 2.0', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': '', 'encryption': None}
doc.is_repaired
False
fitz.TOOLS.mupdf_warnings()
''
fitz.__version__
'1.23.26'
doc=fitz.open("PDF 2.0 with offset start.pdf")
doc.metadata
{'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': '', 'encryption': None}
doc.is_repaired
True
fitz.TOOLS.mupdf_warnings()
'format error: cannot recognize version marker\ntrying to repair broken xref\nrepairing PDF document'

I think this behavior is correct - and no bug at all.

@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Mar 4, 2024
@Matmaus
Copy link
Author

Matmaus commented Mar 4, 2024

I am sorry, I did not want to submit the issue in this state, it was an accident I did not even know about.

@Matmaus Matmaus changed the title Can not obtain correct PDF version because Can not obtain correct PDF version because of warning "cannot recognize version marker" is not present Mar 4, 2024
@Matmaus
Copy link
Author

Matmaus commented Mar 4, 2024

I fixed the title and description as it was incomplete.

@Matmaus
Copy link
Author

Matmaus commented Mar 4, 2024

My output is different @JorjMcKie , there is no warning which I would like to see

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fitz
>>> fitz.__version__
'1.23.26'
>>> doc=fitz.open("PDF 2.0 with offset start.pdf")
>>> doc.metadata
{'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': '', 'encryption': None}
>>> doc.is_repaired
True
>>> fitz.TOOLS.mupdf_warnings()
'trying to repair broken xref\nrepairing PDF document'

@JorjMcKie
Copy link
Collaborator

This behavior is re-instated in (one of) the next MuPDF versions:

Python 3.12.1 (tags/v3.12.1:2305ca5, Dec  7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import fitz
In [2]: fitz.__version__
Out[2]: '1.23.26'
In [3]: fitz.mupdf_version_tuple
Out[3]: (1, 24, 0)
In [4]: doc=fitz.open("PDF 2.0 with offset start.pdf")
In [5]: fitz.TOOLS.mupdf_warnings()
Out[5]: 'format error: cannot recognize version marker
trying to repair broken xref
repairing PDF document'

@Matmaus
Copy link
Author

Matmaus commented Mar 4, 2024

Great, so the only thing I should do is to wait for future PyMuPDF relase(es)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce
Projects
None yet
Development

No branches or pull requests

2 participants