-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update PDF.js (2025 edition) #6784
Comments
From hypothesis/pdf.js-hypothes.is#31 (comment):
|
Digging into this issue, on page 2 of the TraceMonkey paper used in the PDF.js demo I see differences in Unicode normalization between the text extracted from the PDF by our In the text layer: Here "flow" uses a single char ligature, whereas the text returned from PDF.js's text extraction APIs uses separate "f" and "l" characters. This may have been caused by mozilla/pdf.js#16200. We have a |
Related to the change in Unicode normalization, there may be some accessibility issues arising from this. See nvaccess/nvda#14740. |
To summarize the issue for future reference: Older versions of PDF.js used to apply Unicode normalization to the text coming from both the text extraction API and in the hidden text layer in the DOM. The latest version normalizes the text in the text layer but, by default, not that returned by the text extraction API. The rationale is that it makes the characters in the text layer align better with the visual text. A ligature for example that is one character visually is now one character in the text instead of two. Hypothesis uses the text extraction API to efficiently gather text from the PDF in order to find matches for saved annotations' quotes. After a match has been found however, we need to highlight the corresponding text in the text layer. If the text layer and extraction API return different text, the position range calculated based on text from the extraction API needs to be translated into a position range within the text layer. Likewise when saving a new annotation, we ideally should translate the position range in the text layer into that of the text returned by the API. There is some tolerance here because the position saved for annotations is only a hint for quote anchoring. We already deal with the fact that whitespace can be different in the text layer versus the extraction API result. This is handled by There are a few approaches we could take:
|
PDF.js was last updated nearly three years ago.
This issue covers updating to the latest version and addressing any compatibility issues we run into.
Issues identified:
The text was updated successfully, but these errors were encountered: