App Crash when highlight a word on Arabic pdf #12478

mhmadaladin · 2024-09-06T16:57:47Z

KOReader version: 7/2024
Device: Kindle PW3 jailbroken by language break

Issue

When trying to highlight any word while the Force OCR option is "on" and choosing Arabic language, Koreader crash immediately.

Note 1: I'm using the 3.04 tesseract Arabic files and changed the language in default.custom.lua file as in manual to replace Chinese by Arabic

Note 2: A friend of mine is using the exact same tesseract files and having no problem or rarely crash, his device is jailbroken Kindle Oasis 9th generation

Steps to reproduce

1- open Koreader through kual launcher
2- choose any pdf book ( mainly happens with Arabic pdfs)
3- choose Force OCR option & Arabic language
4- try to highlight any word or paragraph

`crash.log` (if applicable)

crash.log is a file that is automatically created when KOReader crashes. It can normally be found in the KOReader directory:

/mnt/private/koreader for Cervantes
koreader/ directory for Kindle
.adds/koreader/ directory for Kobo
applications/koreader/ directory for Pocketbook

Android logs are kept in memory. Please go to [Menu] → Help → Bug Report to save these logs to a file.

Please try to include the relevant sections in your issue description.
You can upload the whole crash.log file (zipped if necessary) on GitHub by dragging and dropping it onto this textbox.

If your issue doesn't directly concern a Lua crash, we'll quite likely need you to reproduce the issue with verbose debug logging enabled before providing the logs to us.
To do so, go to Top menu → Hamburger menu → Help → Report a bug and tap Enable verbose logging. Restart as requested, then repeat the steps for your issue.

If you instead opt to inline it, please do so behind a spoiler tag:

crash.log

<Paste crash.log content here>

crash.log

The text was updated successfully, but these errors were encountered:

NiLuJe · 2024-09-06T21:20:00Z

Do you actually have a proper set of tesseract data installed for your language?

I'm unfamiliar with tesseract and its error messages, but what's in the logs vaguely smells like PEBCAK, at least in part ;).

mhmadaladin · 2024-09-06T21:41:20Z

Yes, I installed the Arabic Tesseract trained data with all complimentary files from https://github.com/tesseract-ocr/tessdata/tree/3.04.00 as mentioned in the user guide, and it worked normally on other device as mentioned before

May be the log seems odd, because I tried in many books to make sure the problem is not in specific file

I think this part of the log is the main problem if anyone can help :
Error: LSTM requested, but not present!! Loading tesseract.
mgr->GetComponent(TESSDATA_INTTEMP, &fp):Error:Assert failed:in file build/arm-kindlepw2-linux-gnueabi/thirdparty/tesseract/source/src/classify/adaptmatch.cpp, line 539
lipc-wait-event exited normally with status: 0
Aborted
---------**

benoit-pierre · 2024-09-06T21:56:13Z

Yes, I installed the Arabic Tesseract trained data with all complimentary files from https://github.com/tesseract-ocr/tessdata/tree/3.04.00 as mentioned in the user guide, and it worked normally on other device as mentioned before

That's too old for the updated tesseract we use starting with 07/2024. The info message in KOReader is up-to-date:

No OCR results or no language data.

KOReader has a build-in OCR engine for recognizing words in scanned PDF and DjVu documents. In order to use OCR in scanned pages, you need to install tesseract trained data for your document language.

You can download language data files for Tesseract version 5.3.4 from https://tesseract-ocr.github.io/tessdoc/Data-Files

Copy the language data files (e.g., eng.traineddata for English and spa.traineddata for Spanish) into koreader/data/tessdata]])

@offset-torque: that part of the user guide needs to be updated.

mergen3107 · 2024-09-06T22:02:39Z

I missed that too! I should update my ocr files too :D

Boticed while triaging koreader#12478

offset-torque · 2024-09-07T04:17:25Z

that part of the user guide needs to be updated.

If the person who is responsible for the OCR integration in KOReader corrects the current guide section with up-to-date information, I can include it in the upcoming update. If this is not you, please ping the relevant dev.

NiLuJe · 2024-09-07T04:53:33Z

It's basically what was quoted above (i.e., the current in-app help text). I have no idea who wrote what's currently in the guide for that section (and it doesn't matter all that much, nothing much has changed in practice, we just need to stop pointing to outdated links, basically).

mergen3107 · 2024-09-07T04:57:27Z

I think it was me 👀
On mobile now, can’t edit stuff much, sorry

offset-torque · 2024-09-07T06:02:43Z

I have no idea who wrote what's currently in the guide for that section (and it doesn't matter all that much, nothing much has changed in practice, we just need to stop pointing to outdated links, basically).

Outdated info I linked is in our wiki page (under the title of "Dictionary support") so that's out of my jurisdiction.

We have two options:

Someone updates the wiki page so this link directs the user to the correct information
Someone updates the OCR section in the user guide properly (which is not touched for at least 3 years)

Considering that more users will follow the user guide than the wiki page, I suggest the second option. In short, until one of our devs spend a little effort to revise this tiny section, user guide will stay like this.

mhmadaladin · 2024-09-07T06:28:54Z

Yes, I installed the Arabic Tesseract trained data with all complimentary files from https://github.com/tesseract-ocr/tessdata/tree/3.04.00 as mentioned in the user guide, and it worked normally on other device as mentioned before

That's too old for the updated tesseract we use starting with 07/2024. The info message in KOReader is up-to-date:

No OCR results or no language data.
KOReader has a build-in OCR engine for recognizing words in scanned PDF and DjVu documents. In order to use OCR in scanned pages, you need to install tesseract trained data for your document language.
You can download language data files for Tesseract version 5.3.4 from https://tesseract-ocr.github.io/tessdoc/Data-Files
Copy the language data files (e.g., eng.traineddata for English and spa.traineddata for Spanish) into koreader/data/tessdata]])

@offset-torque: that part of the user guide needs to be updated.

Thank you so much, now there is no crash, but unfortunately most of words are not recognized, a message appears saying there is no OCR tesseract data, I think this has to do with the quality of the Arabic train data, but any help is appreciated.

benoit-pierre · 2024-09-07T10:00:40Z

Have you tried the 3 variants (tessdata, tessdata-best, tessdata-fast)?

mhmadaladin · 2024-09-07T10:55:39Z

Have you tried the 3 variants (tessdata, tessdata-best, tessdata-fast)?

No, only one, I'll try the other two when I return from Work, thank you for your support

poire-z · 2024-09-07T11:14:29Z

I must say I always had a bad experience highlight/lookup'ing in scanned PDF (including in mine made of book page photos made at public libraries, just concatenating the JPG - or after conversion to B&W TIF, in a PDF).
Thought I would update my tessdata, using https://github.com/tesseract-ocr/tessdata_fast/blob/main/eng.traineddata and see, but still no.

Ie. in our sample.pdf, setting Forced OCR: on, long-pressin on that "population" word, I get a highlight where the black bar is and a "No OCR results or no language data" message.

I don't know if OCR is already at play there, or if it is some other part of the code that should pick up segments of text in the bitmap before giving it to OCR.

(I remember that 6 years ago, I opened #3688, mentionning it may have something to do with the book dpi.)

mhmadaladin · 2024-09-07T12:54:19Z

I must say I always had a bad experience highlight/lookup'ing in scanned PDF (including in mine made of book page photos made at public libraries, just concatenating the JPG - or after conversion to B&W TIF, in a PDF). Thought I would update my tessdata, using https://github.com/tesseract-ocr/tessdata_fast/blob/main/eng.traineddata and see, but still no.

Ie. in our sample.pdf, setting Forced OCR: on, long-pressin on that "population" word, I get a highlight where the black bar is and a "No OCR results or no language data" message.

I don't know if OCR is already at play there, or if it is some other part of the code that should pick up segments of text in the bitmap before giving it to OCR.

(I remember that 6 years ago, I opened #3688, mentionning it may have something to do with the book dpi.)

Thanks for the dpi tip, i didn't imagine English OCR also having problems, hope there would be improvement of the OCR settings next update

mergen3107 · 2024-09-07T13:56:02Z

Yes, wrong OCR coordinates is still a mystery for me. Couldn't figure it out

NiLuJe · 2024-09-07T14:14:56Z

Someone updates the wiki page so this link directs the user to the correct information

Done.

Noticed while triaging #12478 ;)

offset-torque · 2024-09-07T14:56:42Z

I will expand the user guide section according to the updated wiki.

NiLuJe · 2024-09-07T15:12:29Z

I've just now quickly reworded the following section that mentioned deprecated defaults.lua stuff, that might need to be reworded to align with however the defaults stuff is explained elsewhere in the guide ;).

(Unfortunately, you can't add new entries to arrays in the Advanced settings UI, so we can't quite entirely get rid of the manual edit nonsense).

Steven630 · 2024-09-08T02:46:38Z

#12481 I can't get Forced OCR to work even with the correct data copied to that folder.

mhmadaladin · 2024-09-08T08:30:38Z

#12481 I can't get Forced OCR to work even with the correct data copied to that folder.

I don't know about Android version, but it worked on my kindle, although not very good result, you can OCR your pdf and read the OCRed version directly on koreader, until the Force OCR problem resolved, after a lot of trials Pdf24 give fairly good results and Free
https://tools.pdf24.org/en/ocr-pdf

Steven630 · 2024-09-08T08:49:23Z

#12481 I can't get Forced OCR to work even with the correct data copied to that folder.

I don't know about Android version, but it worked on my kindle, although not very good result, you can OCR your pdf and read the OCRed version directly on koreader, until the Force OCR problem resolved, after a lot of trials Pdf24 give fairly good results and Free
https://tools.pdf24.org/en/ocr-pdf

Thank you for your reply. I tried to OCR my file and made a PDF with a text layer with ABBYY. Strangely enough, even though the result file is smaller and I can select words on my computer, Koreader just won't let me highlight or look up words in the dictionary. Long pressing had no effect whatsoever. I wonder if there is a format requirement for pdfs with text layers to work in Koreader. I really want to highlight notes in this book.

mhmadaladin · 2024-09-08T09:09:03Z

#12481 I can't get Forced OCR to work even with the correct data copied to that folder.

I don't know about Android version, but it worked on my kindle, although not very good result, you can OCR your pdf and read the OCRed version directly on koreader, until the Force OCR problem resolved, after a lot of trials Pdf24 give fairly good results and Free
https://tools.pdf24.org/en/ocr-pdf

Thank you for your reply. I tried to OCR my file and made a PDF with a text layer with ABBYY. Strangely enough, even though the result file is smaller and I can select words on my computer, Koreader just won't let me highlight or look up words in the dictionary. Long pressing had no effect whatsoever. I wonder if there is a format requirement for pdfs with text layers to work in Koreader. I really want to highlight notes in this book.

That's strange, it works with me, may be you keep the Force OCR option, you must switch it off for the original OCR to work, or may be like you said it differ according to the pdf version

Frenzie · 2024-09-08T12:22:40Z

Force OCR means "ignore the embedded text layer because it's beyond atrocious". You presumably rarely want to turn that on at all, and certainly not by default.

mhmadaladin · 2024-09-08T13:20:03Z

Force OCR means "ignore the embedded text layer because it's beyond atrocious". You presumably rarely want to turn that on at all, and certainly not by default.

Yes that's correct if the file have text layer originally, most English books I read have epub version or pdf with text layer so no need for this option except rarely, while in other languages (like Arabic for me) most books I need are scanned pdfs with no text layer so I either OCR the file first using other app or website which is time consuming, or directly read it with force OCR option on which would be much better option if worked properly.

Frenzie · 2024-09-08T18:54:09Z

If there's no text layer OCR is always performed. Force OCR only refers to ignoring the text layer (i.e., forcing OCR even though it's not necessary).

Is there a test document perhaps? But it's always possible OCR just doesn't do a great job.

mhmadaladin · 2024-09-08T20:44:22Z

If there's no text layer OCR is always performed. Force OCR only refers to ignoring the text layer (i.e., forcing OCR even though it's not necessary).

Is there a test document perhaps? But it's always possible OCR just doesn't do a great job.

Thank you that's great discovery 😅, but unfortunately most of words are not recognized, that's why I thought no OCR is performed automatically, Mostly a message appears saying No OCR data available, here's example of scanned Arabic pdf
الحياة_الخالدة_لهنرييتا_لاكس_،_ريبيكا_سكلوت.pdf

Frenzie · 2024-09-08T22:03:02Z

The recent Tesseract upgrade has changed its behavior a bit in a way we don't (hopefully not can't) deal with yet. The older version used to nearly always return some nonsense while the newer version is better about saying it doesn't know. The interface still goes by the logic that no results means you need to set it up.

benoit-pierre · 2024-09-08T22:18:10Z

I can confirm that most words don't seem to be recognized, even with the best model. The issue with Arabic support seems to be known (Cf. tesseract-ocr/tesseract#2047).

There are other models available here, plus this one mentioned in the issue above. I don't get better results on that test document, but maybe you'll have more luck on other documents? When testing them, make sure to rename the file to ara.traineddata.

mhmadaladin · 2024-09-09T07:45:44Z

The recent Tesseract upgrade has changed its behavior a bit in a way we don't (hopefully not can't) deal with yet. The older version used to nearly always return some nonsense while the newer version is better about saying it doesn't know. The interface still goes by the logic that no results means you need to set it up.

Yes the last update is slightly better, I hope it could be fixed soon.

mhmadaladin · 2024-09-09T07:51:03Z

I can confirm that most words don't seem to be recognized, even with the best model. The issue with Arabic support seems to be known (Cf. tesseract-ocr/tesseract#2047).

There are other models available here, plus this one mentioned in the issue above. I don't get better results on that test document, but maybe you'll have more luck on other documents? When testing them, make sure to rename the file to ara.traineddata.

Thank you 🙏, Yes the Arabic tesseract support is so messed up.
Sorry but how to download the train files in the first link, I can't see any option to download.

benoit-pierre · 2024-09-09T08:24:20Z

Sorry but how to download the train files in the first link, I can't see any option to download.

Click on a file, and then on the "Raw" or download (📥) button.

mhmadaladin · 2024-09-09T09:49:38Z

Sorry but how to download the train files in the first link, I can't see any option to download.

Click on a file, and then on the "Raw" or download (📥) button.

Yes I found them 😅, i didn't see them at first among the other files, thank you again for your support

benoit-pierre · 2024-09-10T15:15:20Z

Can you try after changing ocr_type from 3 to -1 in frontend/document/koptinterface.lua?

--- i/frontend/document/koptinterface.lua
+++ w/frontend/document/koptinterface.lua
@@ -24,7 +24,7 @@ local KoptInterface = {
     -- in `$TESSDATA_PREFIX/` on more recent versions).
     tessocr_data = not os.getenv('TESSDATA_PREFIX') and DataStorage:getDataDir().."/data/tessdata" or nil,
     ocr_lang = "eng",
-    ocr_type = 3, -- default 0, for more accuracy use 3
+    ocr_type = -1, -- default 0, for more accuracy use 3
     last_context_size = nil,
     default_context_size = 1024*1024,
 }

mhmadaladin · 2024-09-20T01:39:11Z

Can you try after changing ocr_type from 3 to -1 in frontend/document/koptinterface.lua?

--- i/frontend/document/koptinterface.lua
+++ w/frontend/document/koptinterface.lua
@@ -24,7 +24,7 @@ local KoptInterface = {
     -- in `$TESSDATA_PREFIX/` on more recent versions).
     tessocr_data = not os.getenv('TESSDATA_PREFIX') and DataStorage:getDataDir().."/data/tessdata" or nil,
     ocr_lang = "eng",
-    ocr_type = 3, -- default 0, for more accuracy use 3
+    ocr_type = -1, -- default 0, for more accuracy use 3
     last_context_size = nil,
     default_context_size = 1024*1024,
 }

Thanks for the tip, the recognition is much better now, of course still far from perfect especially for sentences, but for single words the recognition is now better by miles.

mhmadaladin · 2024-09-28T01:18:00Z

Can you try after changing ocr_type from 3 to -1 in frontend/document/koptinterface.lua?

--- i/frontend/document/koptinterface.lua
+++ w/frontend/document/koptinterface.lua
@@ -24,7 +24,7 @@ local KoptInterface = {
     -- in `$TESSDATA_PREFIX/` on more recent versions).
     tessocr_data = not os.getenv('TESSDATA_PREFIX') and DataStorage:getDataDir().."/data/tessdata" or nil,
     ocr_lang = "eng",
-    ocr_type = 3, -- default 0, for more accuracy use 3
+    ocr_type = -1, -- default 0, for more accuracy use 3
     last_context_size = nil,
     default_context_size = 1024*1024,
 }

I have a question if you can help, now when highlighting a text, text not shown on bookmarks list, it appears as an empty line, is there any option I can change to make the highlighted text appear in bookmarks list.
I attached a picture to explain.

NiLuJe added the need more info label Sep 6, 2024

NiLuJe added a commit to NiLuJe/koreader that referenced this issue Sep 6, 2024

ReaderHighlight: Fix an old typo in the OCR help string

e3da323

Boticed while triaging koreader#12478

NiLuJe mentioned this issue Sep 6, 2024

ReaderHighlight: Fix an old typo in the OCR help string #12479

Merged

NiLuJe added a commit that referenced this issue Sep 7, 2024

ReaderHighlight: Fix an old typo in the OCR help string (#12479)

ffc4929

Noticed while triaging #12478 ;)

benoit-pierre mentioned this issue Nov 15, 2024

Lines empty in bookmarks list when highlighting Arabic text #12736

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

App Crash when highlight a word on Arabic pdf #12478

App Crash when highlight a word on Arabic pdf #12478

mhmadaladin commented Sep 6, 2024 •

edited by NiLuJe

Loading

NiLuJe commented Sep 6, 2024 •

edited

Loading

mhmadaladin commented Sep 6, 2024 •

edited

Loading

benoit-pierre commented Sep 6, 2024

mergen3107 commented Sep 6, 2024

offset-torque commented Sep 7, 2024

NiLuJe commented Sep 7, 2024 •

edited

Loading

mergen3107 commented Sep 7, 2024

offset-torque commented Sep 7, 2024

mhmadaladin commented Sep 7, 2024

benoit-pierre commented Sep 7, 2024

mhmadaladin commented Sep 7, 2024

poire-z commented Sep 7, 2024

mhmadaladin commented Sep 7, 2024

mergen3107 commented Sep 7, 2024

NiLuJe commented Sep 7, 2024

offset-torque commented Sep 7, 2024

NiLuJe commented Sep 7, 2024 •

edited

Loading

Steven630 commented Sep 8, 2024 •

edited

Loading

mhmadaladin commented Sep 8, 2024

Steven630 commented Sep 8, 2024

mhmadaladin commented Sep 8, 2024

Frenzie commented Sep 8, 2024

mhmadaladin commented Sep 8, 2024 •

edited

Loading

Frenzie commented Sep 8, 2024

mhmadaladin commented Sep 8, 2024

Frenzie commented Sep 8, 2024

benoit-pierre commented Sep 8, 2024

mhmadaladin commented Sep 9, 2024

mhmadaladin commented Sep 9, 2024

benoit-pierre commented Sep 9, 2024

mhmadaladin commented Sep 9, 2024

benoit-pierre commented Sep 10, 2024

mhmadaladin commented Sep 20, 2024

mhmadaladin commented Sep 28, 2024

App Crash when highlight a word on Arabic pdf #12478

App Crash when highlight a word on Arabic pdf #12478

Comments

mhmadaladin commented Sep 6, 2024 • edited by NiLuJe Loading

Issue

Steps to reproduce

crash.log (if applicable)

NiLuJe commented Sep 6, 2024 • edited Loading

mhmadaladin commented Sep 6, 2024 • edited Loading

benoit-pierre commented Sep 6, 2024

mergen3107 commented Sep 6, 2024

offset-torque commented Sep 7, 2024

NiLuJe commented Sep 7, 2024 • edited Loading

mergen3107 commented Sep 7, 2024

offset-torque commented Sep 7, 2024

mhmadaladin commented Sep 7, 2024

benoit-pierre commented Sep 7, 2024

mhmadaladin commented Sep 7, 2024

poire-z commented Sep 7, 2024

mhmadaladin commented Sep 7, 2024

mergen3107 commented Sep 7, 2024

NiLuJe commented Sep 7, 2024

offset-torque commented Sep 7, 2024

NiLuJe commented Sep 7, 2024 • edited Loading

Steven630 commented Sep 8, 2024 • edited Loading

mhmadaladin commented Sep 8, 2024

Steven630 commented Sep 8, 2024

mhmadaladin commented Sep 8, 2024

Frenzie commented Sep 8, 2024

mhmadaladin commented Sep 8, 2024 • edited Loading

Frenzie commented Sep 8, 2024

mhmadaladin commented Sep 8, 2024

Frenzie commented Sep 8, 2024

benoit-pierre commented Sep 8, 2024

mhmadaladin commented Sep 9, 2024

mhmadaladin commented Sep 9, 2024

benoit-pierre commented Sep 9, 2024

mhmadaladin commented Sep 9, 2024

benoit-pierre commented Sep 10, 2024

mhmadaladin commented Sep 20, 2024

mhmadaladin commented Sep 28, 2024

mhmadaladin commented Sep 6, 2024 •

edited by NiLuJe

Loading

`crash.log` (if applicable)

NiLuJe commented Sep 6, 2024 •

edited

Loading

mhmadaladin commented Sep 6, 2024 •

edited

Loading

NiLuJe commented Sep 7, 2024 •

edited

Loading

NiLuJe commented Sep 7, 2024 •

edited

Loading

Steven630 commented Sep 8, 2024 •

edited

Loading

mhmadaladin commented Sep 8, 2024 •

edited

Loading