Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capital roman letters not converted in some languages #127

Closed
tventimi opened this issue Sep 11, 2024 · 12 comments · Fixed by #134
Closed

Capital roman letters not converted in some languages #127

tventimi opened this issue Sep 11, 2024 · 12 comments · Fixed by #134
Assignees
Labels
bug Something isn't working framework

Comments

@tventimi
Copy link
Contributor

I have found that with some languages, when doing a roman-to-script conversion, capital letters are not converted. I have seen this both in the dev and prod versions. Below are a few examples. I have noticed this issue in general for South Asian languages, though not for all of them (Tamil, Hindi, and Malayalam seem fine).

Nepali (Devanagari): Rāshṭriya > Rआस्ह्ट्रिय
Gujarati: Mahākavi > Mઅહાકવિ
Telugu: Shankarabharanam > Sహన్కరభరనం

@scossu scossu self-assigned this Sep 11, 2024
@scossu scossu added bug Something isn't working framework labels Sep 11, 2024
@scossu
Copy link
Collaborator

scossu commented Sep 11, 2024

Thanks for reporting. Are all the languages where you encountered the error case insensitive?

@tventimi
Copy link
Contributor Author

I'm not sure (I don't really know much about these languages), but I can check with our South Asian specialist here at Princeton, and let you know.

@tventimi
Copy link
Contributor Author

I confirmed that these languages are not case sensitive. I should mention that these languages may not be the only affected ones. It seems like it may be a general issue with languages that use data from aksharamukha, but I can invesitgate in more detail if that would be helpful.

@scossu
Copy link
Collaborator

scossu commented Sep 19, 2024

I found the cause for this. The technical solution is relatively simple: converting the source string to all-lowercase when transliterating case-insensitive languages. The challenge is that we'll have to mark all scripts we know to be case insensitive. I may want to create a list and ask Jessalyn to circulate it among LC departments.

I have a working solution and have already marked the three language you mention in this ticket, plus Chinese and Korean (and I can safely add a few such as Arabic and Hebrew). Even though most of these won't be affected because they have no R2S mapping, it might be good for completeness of information (it would show in the /languages API endpoint).

@scossu
Copy link
Collaborator

scossu commented Sep 28, 2024

@tventimi This should be fixed in #134.

The tests in your initial post now yield (in Python API):

[nav] In [1]: from scriptshifter.trans import transliterate
WARNING:scriptshifter:No SMTP host defined. Feedback form won't be available.

[ins] In [2]: transliterate("Rāshṭriya", "nepali_devanagari", t_dir="r2s")
INFO:scriptshifter.trans:Transliteration is from Roman to nepali_devanagari.
Out[2]: ('रास्ह्ट्रिय', [])

[ins] In [3]: transliterate("Mahākavi", "gujarati", t_dir="r2s")
INFO:scriptshifter.trans:Transliteration is from Roman to gujarati.
Out[3]: ('મહાકવિ', [])

[ins] In [4]: transliterate("Shankarabharanam", "telugu", t_dir="r2s")
INFO:scriptshifter.trans:Transliteration is from Roman to telugu.
Out[4]: ('స్హన్కరభరనం', [])

I can't tell if the results are the expected ones, but the Roman letters are gone.

@tventimi
Copy link
Contributor Author

Great, thank you! Do you want me to ask someone at Princeton about verifying these results? (With the understanding that the other issue with Nepali reported in #132 is still being investigated).

@tventimi
Copy link
Contributor Author

This issue has come up again with the new Uighur (Arabic) table. As an example, try roman-to-script conversion of the following string.

Chaghatay Uyghur tili tătqiqatidin ilmi maqalilăr

@scossu scossu reopened this Oct 15, 2024
@scossu
Copy link
Collaborator

scossu commented Oct 19, 2024

Fixed in 3510651.

Uighur was not marked as case insensitive.

@scossu scossu closed this as completed Oct 19, 2024
@scossu
Copy link
Collaborator

scossu commented Oct 19, 2024

Incidentally, while fixing the Uighur Arabic table I ran into a couple of duplicate keys that might cause problems:

In https://github.com/lcnetdev/scriptshifter/blob/test/scriptshifter/tables/data/uighur_arabic.yml:

  • \uFEEA is duplicated in lines 130 and 162 with different mappings
  • \u0647 is duplicated in lines 131 and 164 with different mappings

@tventimi
Copy link
Contributor Author

Thanks, @scossu, for noticing this. I looked back at the LOC Uighur table (https://www.loc.gov/catdir/cpso/romanization/uighur.pdf). In the rows for "ă" and "h" there are entries in which more than one Arabic character may correspond to a Roman character (in such cases, the characters are separated by a comma in the respective cell). The result is that there is some ambiguity. So, \uFEEA could be "ă" in the medial or final position, or "h" in the final position. \u0647 could be "ă" in the medial or final position, or "h" when alone.

Taking position into account can partly resolve the ambiguity, but not entirely. Since I don't know the language, I don't know how best to address this. But if we want a quick fix until we can confirm with a language specialist, I would say we could make the following changes:

line 162 - "%\uFEEA": "h"
line 164 - "%\u0647%": "h"

Does this seem like it would work?

@scossu
Copy link
Collaborator

scossu commented Oct 22, 2024

Yes, that seems like a simple fix until you get more feedback from a language expert.

@scossu
Copy link
Collaborator

scossu commented Oct 22, 2024

Fixed in e2f0d2b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants