Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some font ligatures and/or diagraphs extracted as NULL bytes #1

Open
xdave opened this issue Jan 20, 2025 · 6 comments
Open

Some font ligatures and/or diagraphs extracted as NULL bytes #1

xdave opened this issue Jan 20, 2025 · 6 comments

Comments

@xdave
Copy link

xdave commented Jan 20, 2025

Seems to be an issue related to the discussion here jsvine/pdfplumber#904

What I'm experiencing is, there's some that are not caught by this auto-expansion process like "tt", "ti", "th", and others. Because of these weird characters, when I attempt to open the resulting Markdown generated, it first prompts me like this:

Image
And when opening it in Text mode, the text contains the null bytes like this:

Image

I don't have a lot of control about how these PDF files are encoded, so I was wondering if you had any insight as to how to handle these properly? Thanks.

Here's a couple of example files:

page2.pdf

page22.pdf

@HawkClaws
Copy link
Owner

@xdave

Thank you for the report.

I've implemented ligature skipping support in ver0.1.7 and released it.

Please close this if there are no issues. If you have any other feedback, please leave a comment

@xdave
Copy link
Author

xdave commented Jan 21, 2025

Hi @HawkClaws. Thanks for getting back to me and addressing the issue. :) I've rechecked out the original example files, and they appear to be fixed, but there's other parts of this document that contain other ligatures that I guess the others didn't have. I'm not sure what the totality of the issues are in this whole document, and sharing the entire PDF is not probably possible, as I would need to redact a lot. I'll share a few more pages that hopefully covers the other issues:

page238.pdf

page254.pdf

page961.pdf

Hopefully these cover the rest of them... thanks again.

@HawkClaws
Copy link
Owner

@xdave
Table elements do not seem to be supported, so they are now supported.

@xdave
Copy link
Author

xdave commented Jan 21, 2025

@HawkClaws Just tried 0.1.9, and things seem to be working well for this document now. Thank you :)

@xdave xdave closed this as completed Jan 21, 2025
@xdave
Copy link
Author

xdave commented Jan 21, 2025

Hi again @HawkClaws. I'm re-opening this because, I just re-checked some things, and i noticed that the ligature removal is actually causing the words to be misspelled, now. For example, in the page961.pdf above, the word "Effective" is now showing up as "Eective", and the word "Button", similarly as "buon". The letters just seem to be removed, which effectively makes it unsearchable using those words.

@xdave xdave reopened this Jan 21, 2025
@HawkClaws
Copy link
Owner

. For example, in the page961.pdf above, the word "Effective" is now showing up as "Eective", and the word "Button", similarly as "buon". The letters just seem to be removed, which effectively makes it unsearchable using those words.

I have checked on that status, and it does not seem to be extracted at the time of pdfplumber,
It seems difficult to deal with it in this library of wrappers
It is impossible to copy or search strings in Google Chrome, so it seems that you need to use an OCR library.
Let me put this on hold for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants