Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PDF.js (2025 edition) #6784

Open
robertknight opened this issue Jan 29, 2025 · 5 comments
Open

Update PDF.js (2025 edition) #6784

robertknight opened this issue Jan 29, 2025 · 5 comments

Comments

@robertknight
Copy link
Member

robertknight commented Jan 29, 2025

PDF.js was last updated nearly three years ago.

This issue covers updating to the latest version and addressing any compatibility issues we run into.

Issues identified:

  • Mismatch of Unicode normalization between text returned by text extraction APIs and text layer. There may also be mismatches in normalization with saved annotations' quotes.
  • Position of highlights is not updated after changing zoom level
  • Clicking on annotations doesn't select them in the sidebar, although creating a selection that includes an annotation does indicate there is a highlight (something preventing clicks from propagating?)
  • The supported browser range for the "legacy" build changed. It only goes back to Safari 16.4 (Mar 2022, now just under 2 years old) and Chrome 103 (May 2022)
@robertknight
Copy link
Member Author

From hypothesis/pdf.js-hypothes.is#31 (comment):

The console shows a warning "Text layer text does not match page text. Highlights will be mis-aligned."

Intermittently I have seen an error about a getDownloadInfo property referencing code in pdf-metadata.ts in the client. This method still exists upstream, so it looks like we might be trying to access it before app.pdfDocument has been set?

@robertknight
Copy link
Member Author

The most obvious issue is that the position of highlights gets out of sync with the text when zooming in and out. PDF.js has changed the way they update the DOM after zoom. Previously it would remove and recreate various elements and that would trigger re-creation of the highlights. Now it looks like it does something cheaper where a --scale-factor CSS variable is updated and that scales various PDF.js viewer elements. It doesn't affect the highlight elements we have drawn however.

Image

@robertknight
Copy link
Member Author

robertknight commented Jan 29, 2025

The console shows a warning "Text layer text does not match page text. Highlights will be mis-aligned."

Digging into this issue, on page 2 of the TraceMonkey paper used in the PDF.js demo I see differences in Unicode normalization between the text extracted from the PDF by our getPageTextContent function and the textContent property of the DOM node from the text layer where the user made a text selection.

In the text layer:

Image

Here "flow" uses a single char ligature, whereas the text returned from PDF.js's text extraction APIs uses separate "f" and "l" characters.

This may have been caused by mozilla/pdf.js#16200. We have a translateOffsets function that can handle whitespace differences between the text layer and the extracted text. We may also need to handle Unicode normalization differences here.

@robertknight
Copy link
Member Author

Related to the change in Unicode normalization, there may be some accessibility issues arising from this. See nvaccess/nvda#14740.

@robertknight
Copy link
Member Author

Here "flow" uses a single char ligature, whereas the text returned from PDF.js's text extraction APIs uses separate "f" and "l" characters.

To summarize the issue for future reference:

Older versions of PDF.js used to apply Unicode normalization to the text coming from both the text extraction API and in the hidden text layer in the DOM. The latest version normalizes the text in the text layer but, by default, not that returned by the text extraction API. The rationale is that it makes the characters in the text layer align better with the visual text. A ligature for example that is one character visually is now one character in the text instead of two.

Hypothesis uses the text extraction API to efficiently gather text from the PDF in order to find matches for saved annotations' quotes. After a match has been found however, we need to highlight the corresponding text in the text layer. If the text layer and extraction API return different text, the position range calculated based on text from the extraction API needs to be translated into a position range within the text layer. Likewise when saving a new annotation, we ideally should translate the position range in the text layer into that of the text returned by the API. There is some tolerance here because the position saved for annotations is only a hint for quote anchoring.

We already deal with the fact that whitespace can be different in the text layer versus the extraction API result. This is handled by translateOffsets as mentioned above.

There are a few approaches we could take:

  1. Request un-normalized text from the text extraction API, so that it matches the text layer. This would mean the extracted text will be different from what was previously extracted which will have consequences for quote anchoring, although the fuzzy matching does allow some tolerance for this. Saving characters like ligatures in annotation quotes may cause some headaches elsewhere in Hypothesis (eg. in search), although we do have some normalization that happens.
  2. Expand the existing logic for handling whitespace differences between extracted text and the text layer, so that it can also handle differences in normalization. See here. This is probably the right way to go, but only if it can be done without adding a ton of complexity.

@robertknight robertknight marked this as a duplicate of #6184 Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant