Update PDF.js (2025 edition) #6784

robertknight · 2025-01-29T16:19:22Z

PDF.js was last updated nearly three years ago.

This issue covers updating to the latest version and addressing any compatibility issues we run into.

Issues identified:

Mismatch of Unicode normalization between text returned by text extraction APIs and text layer. There may also be mismatches in normalization with saved annotations' quotes.
Position of highlights is not updated after changing zoom level
Clicking on annotations doesn't select them in the sidebar, although creating a selection that includes an annotation does indicate there is a highlight (something preventing clicks from propagating?)
The supported browser range for the "legacy" build changed. It only goes back to Safari 16.4 (Mar 2022, now just under 2 years old) and Chrome 103 (May 2022)

robertknight · 2025-01-29T16:20:18Z

From hypothesis/pdf.js-hypothes.is#31 (comment):

The console shows a warning "Text layer text does not match page text. Highlights will be mis-aligned."

Intermittently I have seen an error about a getDownloadInfo property referencing code in pdf-metadata.ts in the client. This method still exists upstream, so it looks like we might be trying to access it before app.pdfDocument has been set?

robertknight · 2025-01-29T16:22:51Z

The most obvious issue is that the position of highlights gets out of sync with the text when zooming in and out. PDF.js has changed the way they update the DOM after zoom. Previously it would remove and recreate various elements and that would trigger re-creation of the highlights. Now it looks like it does something cheaper where a --scale-factor CSS variable is updated and that scales various PDF.js viewer elements. It doesn't affect the highlight elements we have drawn however.

robertknight · 2025-01-29T16:45:47Z

The console shows a warning "Text layer text does not match page text. Highlights will be mis-aligned."

Digging into this issue, on page 2 of the TraceMonkey paper used in the PDF.js demo I see differences in Unicode normalization between the text extracted from the PDF by our getPageTextContent function and the textContent property of the DOM node from the text layer where the user made a text selection.

In the text layer:

Here "ﬂow" uses a single char ligature, whereas the text returned from PDF.js's text extraction APIs uses separate "f" and "l" characters.

This may have been caused by mozilla/pdf.js#16200. We have a translateOffsets function that can handle whitespace differences between the text layer and the extracted text. We may also need to handle Unicode normalization differences here.

robertknight · 2025-01-30T12:30:45Z

Related to the change in Unicode normalization, there may be some accessibility issues arising from this. See nvaccess/nvda#14740.

robertknight · 2025-01-30T14:45:09Z

Here "ﬂow" uses a single char ligature, whereas the text returned from PDF.js's text extraction APIs uses separate "f" and "l" characters.

To summarize the issue for future reference:

Older versions of PDF.js used to apply Unicode normalization to the text coming from both the text extraction API and in the hidden text layer in the DOM. The latest version normalizes the text in the text layer but, by default, not that returned by the text extraction API. The rationale is that it makes the characters in the text layer align better with the visual text. A ligature for example that is one character visually is now one character in the text instead of two.

Hypothesis uses the text extraction API to efficiently gather text from the PDF in order to find matches for saved annotations' quotes. After a match has been found however, we need to highlight the corresponding text in the text layer. If the text layer and extraction API return different text, the position range calculated based on text from the extraction API needs to be translated into a position range within the text layer. Likewise when saving a new annotation, we ideally should translate the position range in the text layer into that of the text returned by the API. There is some tolerance here because the position saved for annotations is only a hint for quote anchoring.

We already deal with the fact that whitespace can be different in the text layer versus the extraction API result. This is handled by translateOffsets as mentioned above.

There are a few approaches we could take:

Request un-normalized text from the text extraction API, so that it matches the text layer. This would mean the extracted text will be different from what was previously extracted which will have consequences for quote anchoring, although the fuzzy matching does allow some tolerance for this. Saving characters like ligatures in annotation quotes may cause some headaches elsewhere in Hypothesis (eg. in search), although we do have some normalization that happens.
Expand the existing logic for handling whitespace differences between extracted text and the text layer, so that it can also handle differences in normalization. See here. This is probably the right way to go, but only if it can be done without adding a ton of complexity.

robertknight marked this as a duplicate of #6184 Jan 31, 2025

robertknight mentioned this issue Jan 31, 2025

Update PDF.js (2024 edition) #6184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update PDF.js (2025 edition) #6784

Update PDF.js (2025 edition) #6784

robertknight commented Jan 29, 2025 •

edited

Loading

robertknight commented Jan 29, 2025

robertknight commented Jan 29, 2025

robertknight commented Jan 29, 2025 •

edited

Loading

robertknight commented Jan 30, 2025

robertknight commented Jan 30, 2025

Update PDF.js (2025 edition) #6784

Update PDF.js (2025 edition) #6784

Comments

robertknight commented Jan 29, 2025 • edited Loading

robertknight commented Jan 29, 2025

robertknight commented Jan 29, 2025

robertknight commented Jan 29, 2025 • edited Loading

robertknight commented Jan 30, 2025

robertknight commented Jan 30, 2025

robertknight commented Jan 29, 2025 •

edited

Loading

robertknight commented Jan 29, 2025 •

edited

Loading