-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DO NOT MERGE: Fix reversal of numerals in Arabic script #2266
Conversation
Why? Can you give an example where this rule is wrong?
What do you mean? If the problem is that it looks different when you paste it to github, then you can overcome it like this:
=> שלום עולם! |
eg. For RTL languages, the processing is done in LTR order. So a final combining mark will be seen first and will get error that word begins with combining mark. |
I tried with <div dir=rtl> but the output loses linefeeds and becomes one para. When copied in github the text also shows little red marks around the numbers in Arabic script, probably some kind of control marks for change of direction. However no such marks are visible in notepad++. |
Please save as .txt and upload. |
It is already there as .txt at https://github.com/Shreeshrii/tessdata_arabic/blob/master/Arabic-TOC-ara-Amiri.txt I was trying to edit it with div in github. |
https://raw.githubusercontent.com/Shreeshrii/tessdata_arabic/master/Arabic-TOC-ara-Amiri.txt looks right in firefox. |
Is it legal to a combining mark to be the last char in a word? |
I do not know about Arabic to answer that but in Indic languages combining marks for dependent vowels are quite common as last character in word. |
I have tested only with the TOC image since it was easy to verify. Need to check for cases where the numbers are in middle of a sentence. |
So 'is_combiner' is one of these: |
Bottom line, it seems to me that the commented code does not affect Arabic or any other RTL language. Therefore, I suggest to undo the first part. |
These do not seem to be handled correctly. So this PR should NOT be merged.
Thanks, @amitdo. I will undo and test. I am closing this PR as it does not handle the reversal in all cases. |
Errors on reverting the change === Phase UP: Generating unicharset and unichar properties files === |
At iteration 75/100/101, Mean rms=1.173%, delta=2.748%, char train=11.154%, word train=18.172%, skip ratio=1%, New best char error = 11.154 wrote best model:./ara-Amiri-Revert-from-Arabic/ara-Amiri-Revert11.154_75.checkpoint wrote checkpoint. Encoding of string failed! Failure bytes: d9 8b d8 a7 d8 af d9 83 d8 a4 d9 85 db 94 d8 a9 d9 8a d8 b1 d8 a7 d8 af d8 a7 d9 84 d8 a7 20 d8 a9 d8 a6 d9 8a d9 87 d9 84 d8 a7 20 d8 a1 d8 a7 d8 b6 d8 b9 d8 a3 d8 a8 20 d8 a9 d9 82 d8 ab d9 84 d8 a7 20 d8 af d9 8a d8 af d8 ac d8 aa Encoding of string failed! Failure bytes: d9 92 d9 86 d9 90 d9 85 20 db 94 db 8c da be d8 aa 20 d9 84 d8 b2 d8 ba 20 d9 88 d8 ac 20 d8 a7 da af d9 88 db 81 20 db 94 da ba db 8c d8 b1 da a9 20 db 94 db 92 db 81 20 d8 b4 db 8c d9 be 20 29 d9 a3 20 d9 86 d8 b3 d8 ad 20 22 20 d9 90 d8 b2 d9 85 d8 b1 20 d8 9b 29 20 d8 b1 da af d9 85 20 d8 aa d8 b1 d8 a7 da be d8 a8 d8 8c d9 86 d8 a7 d8 aa d8 b3 da a9 d8 a7 d9 be 20 da be d8 aa d8 a7 d8 b3 20 d8 8f |
Word started with a combiner:0x64b Encoding of string failed! Failure bytes: d9 8b d8 a7 d8 af d9 83 d8 a4 d9 85 db 94 d8 a9 d9 8a d8 b1 d8 a7 d8 af d8 a7 d9 84 d8 a7 20 d8 a9 d8 a6 d9 8a d9 87 d9 84 d8 a7 20 d8 a1 d8 a7 d8 b6 d8 b9 d8 a3 d8 a8 20 d8 a9 d9 82 d8 ab d9 84 d8 a7 20 d8 af d9 8a d8 af d8 ac d8 aa |
I missed this part:
|
I was wrong :-) |
This was the easy part :-) |
Thanks, @amitdo . These are useful resource links for reference. |
Fixes issue #2263
Also addresses related issues
Add Indic numerals and missing punctuation to Arabic tesseract-ocr/langdata#131
Arabic training data has room for improvement #2047
See https://github.com/Shreeshrii/tessdata_arabic
for the finetuned traineddata file and test image and OCR output.
The OCR output looks correct when seen in notepad++ in RTL view. However I am not able to copy it in github.