Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR on pages 2+ is only recognized in browsers but not by poppler/etree unless done in two steps #1459

Open
jribault opened this issue Jan 14, 2025 · 0 comments

Comments

@jribault
Copy link

jribault commented Jan 14, 2025

When I run ocrmypdf on a PDF where the first page is already OCRed (with a footer OCRed on every page), I try to force OCR on pages 2–99999999. Although the resulting PDF has selectable text on all pages in a browser, when I process it in Python (using a library based on poppler and etree), only the text from page 1 is accessible.

VN---3335Y_Y.fs.pdf

ocrmypdf --pages 2-99999999 --tesseract-timeout 3600 --oversample 300 --clean -l vie+eng --force-ocr\
  VN/VN---3335Y_Y.fs.pdf \
 VNsearchable/VN---3335Y_Y.fs.pdf`

VN---3335Y_Y.fs-problem.pdf

Workaround
Remove OCR layer on pages 2+ and produce an intermediate PDF:

ocrmypdf --pages 2-99999999 --tesseract-timeout 0 --oversample 300 --clean -l vie+eng --force-ocr \
    VN/VN---3335Y_Y.fs.pdf \
    VNsearchable/VN---3335Y_Y.fs-step1.pdf

VN---3335Y_Y.fs-step1.pdf

Then run with --skip-text on the intermediate file:

ocrmypdf --tesseract-timeout 3600 --oversample 300 --clean -l vie+eng --skip-text \
    VNsearchable/VN---3335Y_Y.fs-step1.pdf \
    VNsearchable/VN---3335Y_Y.pdf

VN---3335Y_Y.fs.pdf

Now all pages’ text (including page 2 onwards) is fully accessible via poppler + etree.

Expected Behavior

Forcing OCR on pages 2+ in one command should yield the same PDF as doing it in two steps (first removing the OCR layer on pages 2+ then re-running OCR with --skip-text).

If I’ve misunderstood anything or missed any important detail, please let me know — I really appreciate your help in troubleshooting this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant