-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capital roman letters not converted in some languages #127
Comments
Thanks for reporting. Are all the languages where you encountered the error case insensitive? |
I'm not sure (I don't really know much about these languages), but I can check with our South Asian specialist here at Princeton, and let you know. |
I confirmed that these languages are not case sensitive. I should mention that these languages may not be the only affected ones. It seems like it may be a general issue with languages that use data from aksharamukha, but I can invesitgate in more detail if that would be helpful. |
I found the cause for this. The technical solution is relatively simple: converting the source string to all-lowercase when transliterating case-insensitive languages. The challenge is that we'll have to mark all scripts we know to be case insensitive. I may want to create a list and ask Jessalyn to circulate it among LC departments. I have a working solution and have already marked the three language you mention in this ticket, plus Chinese and Korean (and I can safely add a few such as Arabic and Hebrew). Even though most of these won't be affected because they have no R2S mapping, it might be good for completeness of information (it would show in the |
@tventimi This should be fixed in #134. The tests in your initial post now yield (in Python API):
I can't tell if the results are the expected ones, but the Roman letters are gone. |
Great, thank you! Do you want me to ask someone at Princeton about verifying these results? (With the understanding that the other issue with Nepali reported in #132 is still being investigated). |
This issue has come up again with the new Uighur (Arabic) table. As an example, try roman-to-script conversion of the following string. Chaghatay Uyghur tili tătqiqatidin ilmi maqalilăr |
Fixed in 3510651. Uighur was not marked as case insensitive. |
Incidentally, while fixing the Uighur Arabic table I ran into a couple of duplicate keys that might cause problems: In https://github.com/lcnetdev/scriptshifter/blob/test/scriptshifter/tables/data/uighur_arabic.yml:
|
Thanks, @scossu, for noticing this. I looked back at the LOC Uighur table (https://www.loc.gov/catdir/cpso/romanization/uighur.pdf). In the rows for "ă" and "h" there are entries in which more than one Arabic character may correspond to a Roman character (in such cases, the characters are separated by a comma in the respective cell). The result is that there is some ambiguity. So, \uFEEA could be "ă" in the medial or final position, or "h" in the final position. \u0647 could be "ă" in the medial or final position, or "h" when alone. Taking position into account can partly resolve the ambiguity, but not entirely. Since I don't know the language, I don't know how best to address this. But if we want a quick fix until we can confirm with a language specialist, I would say we could make the following changes: line 162 - "%\uFEEA": "h" Does this seem like it would work? |
Yes, that seems like a simple fix until you get more feedback from a language expert. |
Fixed in e2f0d2b. |
I have found that with some languages, when doing a roman-to-script conversion, capital letters are not converted. I have seen this both in the dev and prod versions. Below are a few examples. I have noticed this issue in general for South Asian languages, though not for all of them (Tamil, Hindi, and Malayalam seem fine).
Nepali (Devanagari): Rāshṭriya > Rआस्ह्ट्रिय
Gujarati: Mahākavi > Mઅહાકવિ
Telugu: Shankarabharanam > Sహన్కరభరనం
The text was updated successfully, but these errors were encountered: