You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I render this text: "مين" (b'\u0645\u064a\u0646') using PIL(default: libraqm layout) where text is reshaped using libraqm library I get:
Text was transoformed into b'\ufee3\uFC94' . That is expected behavior because "\u064a\u0646" was transformed into ligature "\uFC94" and initial b'\u0645' transformed into initial form b'\ufee3'.
Note: b'\u0645\u064a\u0646' - all letters are in unshaped form
But when I use arabic_reshaper(text) I get:
NOTE: image was created using PIL(basic_layout). It means that PIL does rendering letter by letter from "left to right" and therefore your arabic_reshaper will do its work but makes failure in this case.
The text I get from reshaper is b'\ufee3\ufef4\ufee6' instead of expected b'\ufee3\uFC94'
The source of problem: ligature regex is performed and first match is '\u0645\u064A' y(ARABIC LIGATURE MEEM WITH YEH) which has only isolated form and you therefore correctly skip (continue) it here: https://github.com/mpcabd/python-arabic-reshaper/blob/master/arabic_reshaper/arabic_reshaper.py#L220
But as subsequent ligature "\u064a\u0646" is overlapping with previous match "\u064a\u0646" it's not returned by finditer function and therefore not applied.
I have simple fix:
diff --git a/arabic_reshaper/arabic_reshaper.py b/arabic_reshaper/arabic_reshaper.py
index 4721a6a..a94cd0f 100644
--- a/arabic_reshaper/arabic_reshaper.py
+++ b/arabic_reshaper/arabic_reshaper.py
@@ -186,14 +186,17 @@ class ArabicReshaper(object):
if delete_tatweel:
text = text.replace(TATWEEL, '')
- for match in re.finditer(self._ligatures_re, text):
+ regex_start = 0
+ matchIt = re.finditer(self._ligatures_re, text)
+ match = next(matchIt, None)
+ while match:
group_index = next((
i for i, group in enumerate(match.groups()) if group
), -1)
forms = self._get_ligature_forms_from_re_group_index(
group_index
)
- a, b = match.span()
+ a, b = tuple(i+regex_start for i in match.span())
a_form = output[a][FORM]
b_form = output[b - 1][FORM]
ligature_form = None
@@ -218,9 +221,13 @@ class ArabicReshaper(object):
else:
ligature_form = MEDIAL
if not forms[ligature_form]:
+ regex_start = a+1
+ matchIt = re.finditer(self._ligatures_re, text[regex_start:])
+ match = next(matchIt, None)
continue
output[a] = (forms[ligature_form], NOT_SUPPORTED)
output[a+1:b] = repeat(('', NOT_SUPPORTED), b - 1 - a)
+ match = next(matchIt, None)
result = []
if not delete_harakat and -1 in positions_harakat:
The text was updated successfully, but these errors were encountered:
jurajmichalak1
pushed a commit
to jurajmichalak1/python-arabic-reshaper
that referenced
this issue
Oct 13, 2022
When I render this text: "مين" (b'\u0645\u064a\u0646') using PIL(default: libraqm layout) where text is reshaped using libraqm library I get:
data:image/s3,"s3://crabby-images/b79d0/b79d01727bd9fd9706034d9ac55286eeb32b58c7" alt="image"
Text was transoformed into b'\ufee3\uFC94' . That is expected behavior because "\u064a\u0646" was transformed into ligature "\uFC94" and initial b'\u0645' transformed into initial form b'\ufee3'.
Note: b'\u0645\u064a\u0646' - all letters are in unshaped form
But when I use arabic_reshaper(text) I get:
data:image/s3,"s3://crabby-images/6137e/6137e18cdb09e518aad037a604e1f51a5bad1d0a" alt="image"
NOTE: image was created using PIL(basic_layout). It means that PIL does rendering letter by letter from "left to right" and therefore your arabic_reshaper will do its work but makes failure in this case.
The text I get from reshaper is b'\ufee3\ufef4\ufee6' instead of expected b'\ufee3\uFC94'
The source of problem: ligature regex is performed and first match is '\u0645\u064A' y(ARABIC LIGATURE MEEM WITH YEH) which has only isolated form and you therefore correctly skip (
continue
) it here:https://github.com/mpcabd/python-arabic-reshaper/blob/master/arabic_reshaper/arabic_reshaper.py#L220
But as subsequent ligature "\u064a\u0646" is overlapping with previous match "\u064a\u0646" it's not returned by finditer function and therefore not applied.
I have simple fix:
The text was updated successfully, but these errors were encountered: