-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some font ligatures and/or diagraphs extracted as NULL bytes #1
Comments
Thank you for the report. I've implemented ligature skipping support in ver0.1.7 and released it. Please close this if there are no issues. If you have any other feedback, please leave a comment |
Hi @HawkClaws. Thanks for getting back to me and addressing the issue. :) I've rechecked out the original example files, and they appear to be fixed, but there's other parts of this document that contain other ligatures that I guess the others didn't have. I'm not sure what the totality of the issues are in this whole document, and sharing the entire PDF is not probably possible, as I would need to redact a lot. I'll share a few more pages that hopefully covers the other issues: Hopefully these cover the rest of them... thanks again. |
@xdave |
@HawkClaws Just tried 0.1.9, and things seem to be working well for this document now. Thank you :) |
Hi again @HawkClaws. I'm re-opening this because, I just re-checked some things, and i noticed that the ligature removal is actually causing the words to be misspelled, now. For example, in the page961.pdf above, the word "Effective" is now showing up as "Eective", and the word "Button", similarly as "buon". The letters just seem to be removed, which effectively makes it unsearchable using those words. |
I have checked on that status, and it does not seem to be extracted at the time of pdfplumber, |
Seems to be an issue related to the discussion here jsvine/pdfplumber#904
What I'm experiencing is, there's some that are not caught by this auto-expansion process like "tt", "ti", "th", and others. Because of these weird characters, when I attempt to open the resulting Markdown generated, it first prompts me like this:
And when opening it in Text mode, the text contains the null bytes like this:
I don't have a lot of control about how these PDF files are encoded, so I was wondering if you had any insight as to how to handle these properly? Thanks.
Here's a couple of example files:
page2.pdf
page22.pdf
The text was updated successfully, but these errors were encountered: