installation.rst: remove need to install tesseract #3266

deeplow · 2024-03-14T19:41:49Z

As stated in the document already, Tesseract OCR is already bundled into the application.

github-actions · 2024-03-14T19:42:12Z

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

deeplow · 2024-03-14T19:42:52Z

I have read the CLA Document and I hereby sign the CLA

deeplow · 2024-03-14T19:45:08Z

On the Dangerzone project our original read of the documentation made us assume that we'd have to install Tesseract OCR on the host, which can be quite challenging to do in a cross-platform way.

However, later we found that this was not the case and it made an incredible difference (thanks PyMuPDF team!). So hopefully this commit helps future users of the library realize that they don't have to install Tesseract.

jamie-lemon · 2024-03-14T23:54:51Z

Thanks for pointing this stuff out.

I think for this one it is better if bullet point 1, remains and it says:

1. Ensure that Tesseract is installed

Because we can't assume it is installed (even though it probably is vi the MuPDF installation), possibly someone could have removed it - who knows?

Also further up the section it says in the text: "PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, so installation of Tesseract-OCR is still required."

Perhaps that needs to say: "PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, which should already be installed on your system."

deeplow · 2024-03-22T11:27:33Z

Thanks for pointing this stuff out.

You're welcome.

I think for this one it is better if bullet point 1, remains and it says:

Ensure that Tesseract is installed

Because we can't assume it is installed (even though it probably is vi the MuPDF installation), possibly someone could have removed it - who knows?

I don't know about this. I'd be inclined to recommend against, because if I were reading that I'd assume that I'd have to install it. If it stays like that then there's little point in this PR. However you prefer, but I'd be inclined to remove it.

jamie-lemon · 2024-03-22T13:24:45Z

Fair enough, we can top the 1st bullet point. If you update this PR to do this part further up:
"PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, so installation of Tesseract-OCR is still required."
->
"PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, which should already be installed on your system."

Then we are good. :)

deeplow · 2024-03-22T13:30:58Z

Great point! I can do something like that but I think we need to tweak that phrase a bit:

PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, which should already be installed on your system.

This technically is not the case , I think. The tesseract data is the sole thing that's missing. How about something like this:

PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, which you can download here.

How does this sound?

jamie-lemon · 2024-03-22T13:40:41Z

Sure - we try to avoid links which are not descriptive - so here is out !

Let's say:

PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data.

deeplow · 2024-03-22T13:43:07Z

Done 👌

jamie-lemon · 2024-03-22T13:51:38Z

Nearly - because this is RST markup and not straight MD links are a bit different - you need to change it to:

I have to do a screen shot here to prevent this comment form parsing the backticks as code :)

See: https://sublime-and-sphinx-guide.readthedocs.io/en/latest/references.html#links-to-external-web-pages

deeplow · 2024-03-25T15:45:15Z

Whoops. Sorry about that. Fixed.

jamie-lemon

LGTM

installation.rst: remove need to install tesseract

7559ba4

As stated in the document already, Tesseract OCR is already bundled into the application.

github-actions bot added a commit that referenced this pull request Mar 14, 2024

@deeplow has signed the CLA from Pull Request #3266

d36311d

Note that tessdata needs to be installed

42d1438

Fix link (RST -- not Markdown)

6a8fb14

jamie-lemon approved these changes Mar 25, 2024

View reviewed changes

jamie-lemon merged commit 8e32ba1 into pymupdf:main Mar 25, 2024
2 checks passed

github-actions bot locked and limited conversation to collaborators Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

installation.rst: remove need to install tesseract #3266

installation.rst: remove need to install tesseract #3266

deeplow commented Mar 14, 2024

github-actions bot commented Mar 14, 2024 •

edited

Loading

deeplow commented Mar 14, 2024

deeplow commented Mar 14, 2024

jamie-lemon commented Mar 14, 2024

deeplow commented Mar 22, 2024

jamie-lemon commented Mar 22, 2024

deeplow commented Mar 22, 2024

jamie-lemon commented Mar 22, 2024

deeplow commented Mar 22, 2024

jamie-lemon commented Mar 22, 2024

deeplow commented Mar 25, 2024

jamie-lemon left a comment

installation.rst: remove need to install tesseract #3266

installation.rst: remove need to install tesseract #3266

Conversation

deeplow commented Mar 14, 2024

github-actions bot commented Mar 14, 2024 • edited Loading

deeplow commented Mar 14, 2024

deeplow commented Mar 14, 2024

jamie-lemon commented Mar 14, 2024

deeplow commented Mar 22, 2024

jamie-lemon commented Mar 22, 2024

deeplow commented Mar 22, 2024

jamie-lemon commented Mar 22, 2024

deeplow commented Mar 22, 2024

jamie-lemon commented Mar 22, 2024

deeplow commented Mar 25, 2024

jamie-lemon left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 14, 2024 •

edited

Loading