Capital roman letters not converted in some languages #127

tventimi · 2024-09-11T19:40:00Z

I have found that with some languages, when doing a roman-to-script conversion, capital letters are not converted. I have seen this both in the dev and prod versions. Below are a few examples. I have noticed this issue in general for South Asian languages, though not for all of them (Tamil, Hindi, and Malayalam seem fine).

Nepali (Devanagari): Rāshṭriya > Rआस्ह्ट्रिय
Gujarati: Mahākavi > Mઅહાકવિ
Telugu: Shankarabharanam > Sహన్కరభరనం

scossu · 2024-09-11T21:11:19Z

Thanks for reporting. Are all the languages where you encountered the error case insensitive?

tventimi · 2024-09-12T11:57:09Z

I'm not sure (I don't really know much about these languages), but I can check with our South Asian specialist here at Princeton, and let you know.

tventimi · 2024-09-12T14:14:31Z

I confirmed that these languages are not case sensitive. I should mention that these languages may not be the only affected ones. It seems like it may be a general issue with languages that use data from aksharamukha, but I can invesitgate in more detail if that would be helpful.

scossu · 2024-09-19T13:16:27Z

I found the cause for this. The technical solution is relatively simple: converting the source string to all-lowercase when transliterating case-insensitive languages. The challenge is that we'll have to mark all scripts we know to be case insensitive. I may want to create a list and ask Jessalyn to circulate it among LC departments.

I have a working solution and have already marked the three language you mention in this ticket, plus Chinese and Korean (and I can safely add a few such as Arabic and Hebrew). Even though most of these won't be affected because they have no R2S mapping, it might be good for completeness of information (it would show in the /languages API endpoint).

scossu · 2024-09-28T21:32:46Z

@tventimi This should be fixed in #134.

The tests in your initial post now yield (in Python API):

[nav] In [1]: from scriptshifter.trans import transliterate
WARNING:scriptshifter:No SMTP host defined. Feedback form won't be available.

[ins] In [2]: transliterate("Rāshṭriya", "nepali_devanagari", t_dir="r2s")
INFO:scriptshifter.trans:Transliteration is from Roman to nepali_devanagari.
Out[2]: ('रास्ह्ट्रिय', [])

[ins] In [3]: transliterate("Mahākavi", "gujarati", t_dir="r2s")
INFO:scriptshifter.trans:Transliteration is from Roman to gujarati.
Out[3]: ('મહાકવિ', [])

[ins] In [4]: transliterate("Shankarabharanam", "telugu", t_dir="r2s")
INFO:scriptshifter.trans:Transliteration is from Roman to telugu.
Out[4]: ('స్హన్కరభరనం', [])

I can't tell if the results are the expected ones, but the Roman letters are gone.

tventimi · 2024-09-30T18:10:14Z

Great, thank you! Do you want me to ask someone at Princeton about verifying these results? (With the understanding that the other issue with Nepali reported in #132 is still being investigated).

tventimi · 2024-10-14T16:30:37Z

This issue has come up again with the new Uighur (Arabic) table. As an example, try roman-to-script conversion of the following string.

Chaghatay Uyghur tili tătqiqatidin ilmi maqalilăr

scossu · 2024-10-19T02:17:17Z

Fixed in 3510651.

Uighur was not marked as case insensitive.

scossu · 2024-10-19T02:22:38Z

Incidentally, while fixing the Uighur Arabic table I ran into a couple of duplicate keys that might cause problems:

In https://github.com/lcnetdev/scriptshifter/blob/test/scriptshifter/tables/data/uighur_arabic.yml:

\uFEEA is duplicated in lines 130 and 162 with different mappings
\u0647 is duplicated in lines 131 and 164 with different mappings

tventimi · 2024-10-21T14:47:41Z

Thanks, @scossu, for noticing this. I looked back at the LOC Uighur table (https://www.loc.gov/catdir/cpso/romanization/uighur.pdf). In the rows for "ă" and "h" there are entries in which more than one Arabic character may correspond to a Roman character (in such cases, the characters are separated by a comma in the respective cell). The result is that there is some ambiguity. So, \uFEEA could be "ă" in the medial or final position, or "h" in the final position. \u0647 could be "ă" in the medial or final position, or "h" when alone.

Taking position into account can partly resolve the ambiguity, but not entirely. Since I don't know the language, I don't know how best to address this. But if we want a quick fix until we can confirm with a language specialist, I would say we could make the following changes:

line 162 - "%\uFEEA": "h"
line 164 - "%\u0647%": "h"

Does this seem like it would work?

scossu · 2024-10-22T00:59:39Z

Yes, that seems like a simple fix until you get more feedback from a language expert.

scossu · 2024-10-22T01:12:00Z

Fixed in e2f0d2b.

scossu self-assigned this Sep 11, 2024

scossu added bug Something isn't working framework labels Sep 11, 2024

scossu mentioned this issue Sep 28, 2024

Fix case-insensitive R2S. #134

Merged

scossu closed this as completed in #134 Sep 30, 2024

scossu reopened this Oct 15, 2024

scossu closed this as completed Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capital roman letters not converted in some languages #127

Capital roman letters not converted in some languages #127

tventimi commented Sep 11, 2024

scossu commented Sep 11, 2024

tventimi commented Sep 12, 2024

tventimi commented Sep 12, 2024

scossu commented Sep 19, 2024

scossu commented Sep 28, 2024

tventimi commented Sep 30, 2024

tventimi commented Oct 14, 2024

scossu commented Oct 19, 2024

scossu commented Oct 19, 2024

tventimi commented Oct 21, 2024

scossu commented Oct 22, 2024

scossu commented Oct 22, 2024

Capital roman letters not converted in some languages #127

Capital roman letters not converted in some languages #127

Comments

tventimi commented Sep 11, 2024

scossu commented Sep 11, 2024

tventimi commented Sep 12, 2024

tventimi commented Sep 12, 2024

scossu commented Sep 19, 2024

scossu commented Sep 28, 2024

tventimi commented Sep 30, 2024

tventimi commented Oct 14, 2024

scossu commented Oct 19, 2024

scossu commented Oct 19, 2024

tventimi commented Oct 21, 2024

scossu commented Oct 22, 2024

scossu commented Oct 22, 2024