Some font ligatures and/or diagraphs extracted as NULL bytes #1

xdave · 2025-01-20T21:02:39Z

Seems to be an issue related to the discussion here jsvine/pdfplumber#904

What I'm experiencing is, there's some that are not caught by this auto-expansion process like "tt", "ti", "th", and others. Because of these weird characters, when I attempt to open the resulting Markdown generated, it first prompts me like this:

And when opening it in Text mode, the text contains the null bytes like this:

I don't have a lot of control about how these PDF files are encoded, so I was wondering if you had any insight as to how to handle these properly? Thanks.

Here's a couple of example files:

page2.pdf

page22.pdf

HawkClaws · 2025-01-21T00:13:29Z

@xdave

Thank you for the report.

I've implemented ligature skipping support in ver0.1.7 and released it.

Please close this if there are no issues. If you have any other feedback, please leave a comment

xdave · 2025-01-21T03:04:22Z

Hi @HawkClaws. Thanks for getting back to me and addressing the issue. :) I've rechecked out the original example files, and they appear to be fixed, but there's other parts of this document that contain other ligatures that I guess the others didn't have. I'm not sure what the totality of the issues are in this whole document, and sharing the entire PDF is not probably possible, as I would need to redact a lot. I'll share a few more pages that hopefully covers the other issues:

page238.pdf

page254.pdf

page961.pdf

Hopefully these cover the rest of them... thanks again.

HawkClaws · 2025-01-21T05:01:34Z

@xdave
Table elements do not seem to be supported, so they are now supported.

xdave · 2025-01-21T05:45:54Z

@HawkClaws Just tried 0.1.9, and things seem to be working well for this document now. Thank you :)

xdave · 2025-01-21T06:07:09Z

Hi again @HawkClaws. I'm re-opening this because, I just re-checked some things, and i noticed that the ligature removal is actually causing the words to be misspelled, now. For example, in the page961.pdf above, the word "Effective" is now showing up as "Eective", and the word "Button", similarly as "buon". The letters just seem to be removed, which effectively makes it unsearchable using those words.

HawkClaws · 2025-01-21T06:12:03Z

. For example, in the page961.pdf above, the word "Effective" is now showing up as "Eective", and the word "Button", similarly as "buon". The letters just seem to be removed, which effectively makes it unsearchable using those words.

I have checked on that status, and it does not seem to be extracted at the time of pdfplumber,
It seems difficult to deal with it in this library of wrappers
It is impossible to copy or search strings in Google Chrome, so it seems that you need to use an OCR library.
Let me put this on hold for now.

xdave closed this as completed Jan 21, 2025

xdave reopened this Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some font ligatures and/or diagraphs extracted as NULL bytes #1

Some font ligatures and/or diagraphs extracted as NULL bytes #1

xdave commented Jan 20, 2025

HawkClaws commented Jan 21, 2025

xdave commented Jan 21, 2025

HawkClaws commented Jan 21, 2025

xdave commented Jan 21, 2025

xdave commented Jan 21, 2025

HawkClaws commented Jan 21, 2025

Some font ligatures and/or diagraphs extracted as NULL bytes #1

Some font ligatures and/or diagraphs extracted as NULL bytes #1

Comments

xdave commented Jan 20, 2025

HawkClaws commented Jan 21, 2025

xdave commented Jan 21, 2025

HawkClaws commented Jan 21, 2025

xdave commented Jan 21, 2025

xdave commented Jan 21, 2025

HawkClaws commented Jan 21, 2025