-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
App Crash when highlight a word on Arabic pdf #12478
Comments
Do you actually have a proper set of tesseract data installed for your language? I'm unfamiliar with tesseract and its error messages, but what's in the logs vaguely smells like PEBCAK, at least in part ;). |
Yes, I installed the Arabic Tesseract trained data with all complimentary files from https://github.com/tesseract-ocr/tessdata/tree/3.04.00 as mentioned in the user guide, and it worked normally on other device as mentioned before May be the log seems odd, because I tried in many books to make sure the problem is not in specific file I think this part of the log is the main problem if anyone can help : |
That's too old for the updated tesseract we use starting with 07/2024. The info message in KOReader is up-to-date:
@offset-torque: that part of the user guide needs to be updated. |
I missed that too! I should update my ocr files too :D |
Boticed while triaging koreader#12478
If the person who is responsible for the OCR integration in KOReader corrects the current guide section with up-to-date information, I can include it in the upcoming update. If this is not you, please ping the relevant dev. |
It's basically what was quoted above (i.e., the current in-app help text). I have no idea who wrote what's currently in the guide for that section (and it doesn't matter all that much, nothing much has changed in practice, we just need to stop pointing to outdated links, basically). |
I think it was me 👀 |
Outdated info I linked is in our wiki page (under the title of "Dictionary support") so that's out of my jurisdiction. We have two options:
Considering that more users will follow the user guide than the wiki page, I suggest the second option. In short, until one of our devs spend a little effort to revise this tiny section, user guide will stay like this. |
Thank you so much, now there is no crash, but unfortunately most of words are not recognized, a message appears saying there is no OCR tesseract data, I think this has to do with the quality of the Arabic train data, but any help is appreciated. |
Have you tried the 3 variants ( |
No, only one, I'll try the other two when I return from Work, thank you for your support |
I must say I always had a bad experience highlight/lookup'ing in scanned PDF (including in mine made of book page photos made at public libraries, just concatenating the JPG - or after conversion to B&W TIF, in a PDF). Ie. in our sample.pdf, setting I don't know if OCR is already at play there, or if it is some other part of the code that should pick up segments of text in the bitmap before giving it to OCR. (I remember that 6 years ago, I opened #3688, mentionning it may have something to do with the book dpi.) |
Thanks for the dpi tip, i didn't imagine English OCR also having problems, hope there would be improvement of the OCR settings next update |
Yes, wrong OCR coordinates is still a mystery for me. Couldn't figure it out |
Done. |
I will expand the user guide section according to the updated wiki. |
I've just now quickly reworded the following section that mentioned deprecated (Unfortunately, you can't add new entries to arrays in the Advanced settings UI, so we can't quite entirely get rid of the manual edit nonsense). |
#12481 I can't get Forced OCR to work even with the correct data copied to that folder. |
I don't know about Android version, but it worked on my kindle, although not very good result, you can OCR your pdf and read the OCRed version directly on koreader, until the Force OCR problem resolved, after a lot of trials Pdf24 give fairly good results and Free |
Thank you for your reply. I tried to OCR my file and made a PDF with a text layer with ABBYY. Strangely enough, even though the result file is smaller and I can select words on my computer, Koreader just won't let me highlight or look up words in the dictionary. Long pressing had no effect whatsoever. I wonder if there is a format requirement for pdfs with text layers to work in Koreader. I really want to highlight notes in this book. |
That's strange, it works with me, may be you keep the Force OCR option, you must switch it off for the original OCR to work, or may be like you said it differ according to the pdf version |
Force OCR means "ignore the embedded text layer because it's beyond atrocious". You presumably rarely want to turn that on at all, and certainly not by default. |
Yes that's correct if the file have text layer originally, most English books I read have epub version or pdf with text layer so no need for this option except rarely, while in other languages (like Arabic for me) most books I need are scanned pdfs with no text layer so I either OCR the file first using other app or website which is time consuming, or directly read it with force OCR option on which would be much better option if worked properly. |
If there's no text layer OCR is always performed. Force OCR only refers to ignoring the text layer (i.e., forcing OCR even though it's not necessary). Is there a test document perhaps? But it's always possible OCR just doesn't do a great job. |
Thank you that's great discovery 😅, but unfortunately most of words are not recognized, that's why I thought no OCR is performed automatically, Mostly a message appears saying No OCR data available, here's example of scanned Arabic pdf |
The recent Tesseract upgrade has changed its behavior a bit in a way we don't (hopefully not can't) deal with yet. The older version used to nearly always return some nonsense while the newer version is better about saying it doesn't know. The interface still goes by the logic that no results means you need to set it up. |
I can confirm that most words don't seem to be recognized, even with the best model. The issue with Arabic support seems to be known (Cf. tesseract-ocr/tesseract#2047). There are other models available here, plus this one mentioned in the issue above. I don't get better results on that test document, but maybe you'll have more luck on other documents? When testing them, make sure to rename the file to |
Yes the last update is slightly better, I hope it could be fixed soon. |
Thank you 🙏, Yes the Arabic tesseract support is so messed up. |
Click on a file, and then on the "Raw" or download (📥) button. |
Yes I found them 😅, i didn't see them at first among the other files, thank you again for your support |
Can you try after changing --- i/frontend/document/koptinterface.lua
+++ w/frontend/document/koptinterface.lua
@@ -24,7 +24,7 @@ local KoptInterface = {
-- in `$TESSDATA_PREFIX/` on more recent versions).
tessocr_data = not os.getenv('TESSDATA_PREFIX') and DataStorage:getDataDir().."/data/tessdata" or nil,
ocr_lang = "eng",
- ocr_type = 3, -- default 0, for more accuracy use 3
+ ocr_type = -1, -- default 0, for more accuracy use 3
last_context_size = nil,
default_context_size = 1024*1024,
} |
Thanks for the tip, the recognition is much better now, of course still far from perfect especially for sentences, but for single words the recognition is now better by miles. |
Issue
When trying to highlight any word while the Force OCR option is "on" and choosing Arabic language, Koreader crash immediately.
Note 1: I'm using the 3.04 tesseract Arabic files and changed the language in default.custom.lua file as in manual to replace Chinese by Arabic
Note 2: A friend of mine is using the exact same tesseract files and having no problem or rarely crash, his device is jailbroken Kindle Oasis 9th generation
Steps to reproduce
1- open Koreader through kual launcher
2- choose any pdf book ( mainly happens with Arabic pdfs)
3- choose Force OCR option & Arabic language
4- try to highlight any word or paragraph
crash.log
(if applicable)crash.log
is a file that is automatically created when KOReader crashes. It can normally be found in the KOReader directory:/mnt/private/koreader
for Cervanteskoreader/
directory for Kindle.adds/koreader/
directory for Koboapplications/koreader/
directory for PocketbookAndroid logs are kept in memory. Please go to [Menu] → Help → Bug Report to save these logs to a file.
Please try to include the relevant sections in your issue description.
You can upload the whole
crash.log
file (zipped if necessary) on GitHub by dragging and dropping it onto this textbox.If your issue doesn't directly concern a Lua crash, we'll quite likely need you to reproduce the issue with verbose debug logging enabled before providing the logs to us.
To do so, go to
Top menu → Hamburger menu → Help → Report a bug
and tapEnable verbose logging
. Restart as requested, then repeat the steps for your issue.If you instead opt to inline it, please do so behind a spoiler tag:
crash.log
crash.log
The text was updated successfully, but these errors were encountered: