Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

installation.rst: remove need to install tesseract #3266

Merged
merged 3 commits into from
Mar 25, 2024

Conversation

deeplow
Copy link
Contributor

@deeplow deeplow commented Mar 14, 2024

As stated in the document already, Tesseract OCR is already bundled into the application.

As stated in the document already, Tesseract OCR is already bundled into the application.
Copy link
Contributor

github-actions bot commented Mar 14, 2024

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@deeplow
Copy link
Contributor Author

deeplow commented Mar 14, 2024

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Mar 14, 2024
@deeplow
Copy link
Contributor Author

deeplow commented Mar 14, 2024

On the Dangerzone project our original read of the documentation made us assume that we'd have to install Tesseract OCR on the host, which can be quite challenging to do in a cross-platform way.

However, later we found that this was not the case and it made an incredible difference (thanks PyMuPDF team!). So hopefully this commit helps future users of the library realize that they don't have to install Tesseract.

@jamie-lemon
Copy link
Collaborator

Thanks for pointing this stuff out.

I think for this one it is better if bullet point 1, remains and it says:

1. Ensure that Tesseract is installed

Because we can't assume it is installed (even though it probably is vi the MuPDF installation), possibly someone could have removed it - who knows?

Also further up the section it says in the text: "PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, so installation of Tesseract-OCR is still required."

Perhaps that needs to say: "PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, which should already be installed on your system."

@deeplow
Copy link
Contributor Author

deeplow commented Mar 22, 2024

Thanks for pointing this stuff out.

You're welcome.

I think for this one it is better if bullet point 1, remains and it says:

  1. Ensure that Tesseract is installed

Because we can't assume it is installed (even though it probably is vi the MuPDF installation), possibly someone could have removed it - who knows?

I don't know about this. I'd be inclined to recommend against, because if I were reading that I'd assume that I'd have to install it. If it stays like that then there's little point in this PR. However you prefer, but I'd be inclined to remove it.

@jamie-lemon
Copy link
Collaborator

Fair enough, we can top the 1st bullet point. If you update this PR to do this part further up:
"PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, so installation of Tesseract-OCR is still required."
->
"PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, which should already be installed on your system."

Then we are good. :)

@deeplow
Copy link
Contributor Author

deeplow commented Mar 22, 2024

Great point! I can do something like that but I think we need to tweak that phrase a bit:

PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, which should already be installed on your system.

This technically is not the case , I think. The tesseract data is the sole thing that's missing. How about something like this:

PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data, which you can download here.

How does this sound?

@jamie-lemon
Copy link
Collaborator

Sure - we try to avoid links which are not descriptive - so here is out !

Let's say:

PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need Tesseract’s language support data.

@deeplow
Copy link
Contributor Author

deeplow commented Mar 22, 2024

Done 👌

@jamie-lemon
Copy link
Collaborator

Nearly - because this is RST markup and not straight MD links are a bit different - you need to change it to:

Screenshot 2024-03-22 at 13 49 58

I have to do a screen shot here to prevent this comment form parsing the backticks as code :)

See: https://sublime-and-sphinx-guide.readthedocs.io/en/latest/references.html#links-to-external-web-pages

@deeplow
Copy link
Contributor Author

deeplow commented Mar 25, 2024

Whoops. Sorry about that. Fixed.

Copy link
Collaborator

@jamie-lemon jamie-lemon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jamie-lemon jamie-lemon merged commit 8e32ba1 into pymupdf:main Mar 25, 2024
2 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants