Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jules #3204

Merged
merged 4 commits into from
Feb 27, 2024
Merged

Jules #3204

merged 4 commits into from
Feb 27, 2024

Conversation

julian-smith-artifex-com
Copy link
Collaborator

Install pymupdf command that runs fitz/__main__.py:main().

Also made most Package constructor args keyword-only.
This uses pipcl.py's new `entry_points` support.

Note that direct installation with `setup.py install` does not implement this.

tests/test_general.py: test_cli() - basic test of command-line `pymupdf`
command.

Addresses #3199.
Checks that fitz.mupdf.pdf_subset_fonts2() can be called.
Copy link
Collaborator

@JorjMcKie JorjMcKie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Script test_font.py is insufficient as it only confirms that the call to pdf_subset_fonts() does not crash.

Could we include additional tests:

  • Fonts actually are being subset, e.g. by comparing file sizes before / after. So we do need examples with non-subset fonts.
  • Text extraction is not being affected in any way by replacing fonts with their subsets, e.g. by comparing "dict" text extractions before / after.

As per Robin's statement, always all document pages should be used in the array - not a subset. One might argue, that in this case, we shouldn't ever need more than the PDF argument.

I have tried mupdf.pdf_subset_fonts2 and it did not work: no font subsets were created.

@julian-smith-artifex-com
Copy link
Collaborator Author

Script test_font.py is insufficient as it only confirms that the call to pdf_subset_fonts() does not crash.

It's only intended to check the python bindings are working. Additional checks can be added afterwards.

If you prefer, i could remove the test from this PR for now, until we have better tests written?

[...]

I have tried mupdf.pdf_subset_fonts2 and it did not work: no font subsets were created.

Unfortunately there's a mistake in the implementation in mupdf; i have a fix.

@JorjMcKie
Copy link
Collaborator

Ah, I understand.
We need to decide whether we want to safeguard against crashes in MuPDF's pdf_subset_fonts() at all.
If yes, we certainly should process more / many PDFs.
I vote for not doing this: I already confirmed that all our PDFs in the test/resources folder are processed without crash.

Otherwise, I have a test script that makes a fresh PDF with text using almost 10 different fonts and a consequential file size of 2 MB. Font types are OTF, TTF and CID.
After font subset, the file size is only 5% of the original.
If you want, you could include this one now instead ... or we postpone testing this completely.

@JorjMcKie JorjMcKie self-requested a review February 27, 2024 08:18
@julian-smith-artifex-com julian-smith-artifex-com merged commit c41f831 into main Feb 27, 2024
2 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Feb 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants