Problem with non-overlapping matches of ligature pattern in string when ligature is skipped due to its form mismatch #86

jurajmichalak1 · 2022-10-13T13:42:39Z

When I render this text: "مين" (b'\u0645\u064a\u0646') using PIL(default: libraqm layout) where text is reshaped using libraqm library I get:

Text was transoformed into b'\ufee3\uFC94' . That is expected behavior because "\u064a\u0646" was transformed into ligature "\uFC94" and initial b'\u0645' transformed into initial form b'\ufee3'.
Note: b'\u0645\u064a\u0646' - all letters are in unshaped form

But when I use arabic_reshaper(text) I get:

NOTE: image was created using PIL(basic_layout). It means that PIL does rendering letter by letter from "left to right" and therefore your arabic_reshaper will do its work but makes failure in this case.
The text I get from reshaper is b'\ufee3\ufef4\ufee6' instead of expected b'\ufee3\uFC94'

The source of problem: ligature regex is performed and first match is '\u0645\u064A' y(ARABIC LIGATURE MEEM WITH YEH) which has only isolated form and you therefore correctly skip (continue) it here:
https://github.com/mpcabd/python-arabic-reshaper/blob/master/arabic_reshaper/arabic_reshaper.py#L220
But as subsequent ligature "\u064a\u0646" is overlapping with previous match "\u064a\u0646" it's not returned by finditer function and therefore not applied.

I have simple fix:

diff --git a/arabic_reshaper/arabic_reshaper.py b/arabic_reshaper/arabic_reshaper.py
index 4721a6a..a94cd0f 100644
--- a/arabic_reshaper/arabic_reshaper.py
+++ b/arabic_reshaper/arabic_reshaper.py
@@ -186,14 +186,17 @@ class ArabicReshaper(object):
             if delete_tatweel:
                 text = text.replace(TATWEEL, '')
 
-            for match in re.finditer(self._ligatures_re, text):
+            regex_start = 0
+            matchIt = re.finditer(self._ligatures_re, text)
+            match = next(matchIt, None)
+            while match:
                 group_index = next((
                     i for i, group in enumerate(match.groups()) if group
                 ), -1)
                 forms = self._get_ligature_forms_from_re_group_index(
                     group_index
                 )
-                a, b = match.span()
+                a, b = tuple(i+regex_start for i in match.span())
                 a_form = output[a][FORM]
                 b_form = output[b - 1][FORM]
                 ligature_form = None
@@ -218,9 +221,13 @@ class ArabicReshaper(object):
                     else:
                         ligature_form = MEDIAL
                 if not forms[ligature_form]:
+                    regex_start = a+1
+                    matchIt = re.finditer(self._ligatures_re, text[regex_start:])
+                    match = next(matchIt, None)
                     continue
                 output[a] = (forms[ligature_form], NOT_SUPPORTED)
                 output[a+1:b] = repeat(('', NOT_SUPPORTED), b - 1 - a)
+                match = next(matchIt, None)
 
         result = []
         if not delete_harakat and -1 in positions_harakat:

The text was updated successfully, but these errors were encountered:

…tches of ligature pattern in string when previous overlapping ligature candidate is skipped due to its form mismatch

jurajmichalak1 mentioned this issue Oct 13, 2022

fix issue #86 - missed ligature due to non-overlapping regex matches … #87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with non-overlapping matches of ligature pattern in string when ligature is skipped due to its form mismatch #86

Problem with non-overlapping matches of ligature pattern in string when ligature is skipped due to its form mismatch #86

jurajmichalak1 commented Oct 13, 2022 •

edited

Loading

Problem with non-overlapping matches of ligature pattern in string when ligature is skipped due to its form mismatch #86

Problem with non-overlapping matches of ligature pattern in string when ligature is skipped due to its form mismatch #86

Comments

jurajmichalak1 commented Oct 13, 2022 • edited Loading

jurajmichalak1 commented Oct 13, 2022 •

edited

Loading