diff --git a/ext/oriya.html b/ext/oriya.html
new file mode 100644
index 0000000..994ee01
--- /dev/null
+++ b/ext/oriya.html
@@ -0,0 +1,1266 @@
+
+{{ombox
+| text = ଏହା ରୋମାନାଇଜେସନ (ରୋମାନ ଲିପିରେ ଓଡ଼ିଆ ଲେଖା) ପାଇଁ । କାମକରୁଥିବା ମାନକ: [[ଆନ୍ତର୍ଜାତୀୟ ଉଚ୍ଚାରଣଗତ ଲିପି]] (ଆଇପିଏ - IPA), [https://www.loc.gov/catdir/cpso/romanization/oriya.pdf ALA/LC ରୋମାନ] ଓ ALA/LC ଫାଇଲ ନାମ (ରୋମାନ ଓ [https://r12a.github.io/app-conversion/ ହେକ୍ସ NCR]) ଏବଂ [[:wikt:ଉଇକିଅଭିଧାନ:ରୋମାନ ଟ୍ରାନ୍ସଲିଟରେସନ|ଭାଷାକୋଷ]] । ବାମ ବାକ୍ସରେ ୟୁନିକୋଡ଼ ଫଣ୍ଟରେ ଟାଇପ କରା ଲେଖା କପି କରନ୍ତୁ ବା ଟାଇପ କରନ୍ତୁ । ଉପରୁ ମାନକ ଅପସନ ବାଛନ୍ତୁ । ମୋବାଇଲ ବ୍ୟବହାର କରୁଥିଲେ ''ଡେସ୍କଟପ ଦେଖଣା'' ବାଛନ୍ତୁ ।
+
+[[ସହଯୋଗ:ଓଡ଼ିଆ/ରୋମାନାଇଜେସନ/ନମୁନା_ଲେଖା|ନମୁନା ଲେଖା]] • See this page in [[ସହଯୋଗ:ଓଡ଼ିଆ/ରୋମାନାଇଜେସନ/en|English]]
+
+}}
+{{clear}}
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+{{clear}}
+
+== ସ୍ରୋତ ==
+* ଏହି ଲିପ୍ୟାନ୍ତରଣ କନଭର୍ଟର [[ସହଯୋଗ:IPA/Odia|ଓଡ଼ିଆ ଆଇପିଏ]], [[ଏଏଲଏ-ଏଲସି ରୋମାନାଇଜେସନ]] ଏବଂ [[wikt:or:ଉଇକିଅଭିଧାନ:ରୋମାନ ଟ୍ରାନ୍ସଲିଟରେସନ|ଉଇକିଅଭିଧାନ ରୋମାନ ଟ୍ରାନ୍ସଲିଟରେସନ]] ନିୟମ ଆଧାରରେ ତିଆରି ।
+
+== ପରିଚାଳନା ==
+* ତିଆରି ଓ ପରିଚାଳନା: [[ବ୍ୟବହାରକାରୀ:Psubhashish]]
+* ଆଇପିଏ ଓ ରୋମାନ ସହଯୋଗ: [[ବ୍ୟବହାରକାରୀ:Prateek Pattanaik]] ଓ [[User:Bikash Ojha]]
+* ''ସୋର୍ସ କୋଡ଼ [[ବ୍ୟବହାରକାରୀ:Jnanaranjan sahu]], [[ବ୍ୟବହାରକାରୀ:ଶିତିକଣ୍ଠ ଦାଶ]], [[ବ୍ୟବହାରକାରୀ:TWO^0|ମନୋଜ ସାହୁକାର]] ଏବଂ [[ବ୍ୟବହାରକାରୀ:Psubhashish]]ଙ୍କ ତିଆରି [[ଉଇକିପିଡ଼ିଆ:କନଭର୍ଟର|କନଭର୍ଟର]] ଆଧାରରେ ତିଆରି''
+
+== ସୋର୍ସ କୋଡ଼ ଓ ଲାଇସେନ୍ସ==
+* [https://github.com/OdiaWikimedia/Converter/blob/master/IPA-Romanization/ipa-roman ସୋର୍ସ କୋଡ଼] (ଶେଷ ଅପଡେଟ ୨୪ ମଇ ୨୦୨୩)
+* ଏହି ସଫ୍ଟୱେର [https://www.gnu.org/licenses/gpl-3.0.en.html GNU General Public License v3.0]ରେ ଉପଲବ୍ଧ
+
+== ଅନ୍ୟାନ୍ୟ ଟୁଲ ==
+* [[ଉଇକିପିଡ଼ିଆ:ଟୁଲ/ଏନକୋଡ଼ିଙ୍ଗ କନଭର୍ଟର|ଏନକୋଡ଼ିଙ୍ଗ କନଭର୍ଟର]]: ଶ୍ରୀଲିପି, ଆକୃତି, ସମ୍ବାଦ ଆଦି ଅଣ-ମାନକ ଏନକୋଡ଼ିଂରୁ ଇଉନିକୋଡ଼ ଓଡ଼ିଆରେ ରୂପାନ୍ତର
+* [[ଉଇକିପିଡ଼ିଆ:ଟୁଲ/ଅନ୍ୟ ଲିପିରୁ ଓଡ଼ିଆକୁ ରୂପାନ୍ତର|ଅନ୍ୟ ଲିପିରୁ ଓଡ଼ିଆକୁ ରୂପାନ୍ତର]]: ଅହମିୟା/ବଙ୍ଗଳା, ଦେବନାଗରୀ, ଗୁଜରାଟୀ, ରୋମାନ ଓ ଉର୍ଦ୍ଦୁରୁ ଓଡ଼ିଆକୁ ରୂପାନ୍ତର (''ପରୀକ୍ଷାମୂଳକ ତେଣୁ କିଛି ଭୁଲ ରହିପାରେ'')
+* [https://github.com/OdiaWikimedia/Wordlist ଶବ୍ଦତାଲିକା]: ଓଡ଼ିଆ ଉଇକିପିଡ଼ିଆ, ଉଇକିପାଠାଗାର ଓ ଉଇକିଅଭିଧାନରେ ଥିବା ଶବ୍ଦଗୁଡ଼ିକର ଏକ ଲମ୍ବା ତାଲିକା । ଏସବୁ ଟେକ୍ଟଟ-ଟୁ-ସ୍ପିଚ, ସ୍ପିଚ-ଟୁ-ଟେକ୍ସଟ ଓ ବାକି ସେହିଭଳି ସଫ୍ଟଓଏର ତିଆରି କାମରେ ଲାଗିବ
+* [[/ପାନଗ୍ରାମ|ପାନଗ୍ରାମ]]: ବର୍ଣ୍ଣମାଳାର ସବୁ ମୌଳିକ ଅକ୍ଷର ଥିବା ଏକ ବାକ୍ୟ
+
diff --git a/scriptshifter/tables/__init__.py b/scriptshifter/tables/__init__.py
index 7df1958..a2098c9 100644
--- a/scriptshifter/tables/__init__.py
+++ b/scriptshifter/tables/__init__.py
@@ -16,7 +16,7 @@
from yaml import Loader
from scriptshifter import DB_PATH
-from scriptshifter.exceptions import BREAK, ConfigError
+from scriptshifter.exceptions import BREAK, ApiError, ConfigError
__doc__ = """
@@ -209,6 +209,9 @@ def populate_table(conn, tid, tname):
if "roman_to_script" in data:
flags |= FEAT_R2S
+ if not data.get("general", {}).get("case_sensitive", True):
+ flags |= FEAT_CASEI
+
conn.execute(
"UPDATE tbl_language SET features = ? WHERE id = ?",
(flags, tid))
@@ -555,6 +558,9 @@ def get_lang_general(conn, lang):
FROM tbl_language WHERE name = ?""", (lang,))
lang_data = lang_q.fetchone()
+ if not lang_data:
+ raise ApiError(f"No language data found for {lang}", 404)
+
return {
"id": lang_data[0],
"data": {
diff --git a/scriptshifter/tables/data/_chinese_base.yml b/scriptshifter/tables/data/_chinese_base.yml
index 0be6e01..4c51521 100644
--- a/scriptshifter/tables/data/_chinese_base.yml
+++ b/scriptshifter/tables/data/_chinese_base.yml
@@ -1,13 +1,13 @@
# This file is derived and kept in sync with Princeton's OCLC Connexion Pinyin
# converter (https://github.com/pulibrary/oclcpinyin/).
-general: # Section names and other keywords are all snake_cased.
+general: # Section names and other keywords are all snake_cased.
name: Chinese base (from Princeton)
parents:
- _ignore_base
script_to_roman:
- map: # Mapping section.
+ map: # Mapping section.
"\u5DF4\u57FA\u65AF\u5766\u4F0A\u65AF\u862D\u5171\u548C\u570B": "Bajisitan Yisilan Gongheguo "
"\u5DF4\u57FA\u65AF\u5766\u4F0A\u65AF\u5170\u5171\u548C\u56FD": "Bajisitan Yisilan Gongheguo "
"\u5DF4\u97F3\u90ED\u695E\u8499\u53E4\u81EA\u6CBB\u5DDE": "Bayinguoleng Menggu Zizhizhou "
diff --git a/scriptshifter/tables/data/arabic.yml b/scriptshifter/tables/data/arabic.yml
index 8679c87..5e76b9b 100644
--- a/scriptshifter/tables/data/arabic.yml
+++ b/scriptshifter/tables/data/arabic.yml
@@ -1,9 +1,11 @@
# Arabic S2R using the 3rd-party ArabicTransliterator library:
# https://github.com/MTG/ArabicTransliterator
+---
general:
name: Arabic
description: Arabic S2R using a 3rd party library.
+ case_sensitive: false
script_to_roman:
hooks:
diff --git a/scriptshifter/tables/data/chinese.yml b/scriptshifter/tables/data/chinese.yml
index 15ad34f..a846713 100644
--- a/scriptshifter/tables/data/chinese.yml
+++ b/scriptshifter/tables/data/chinese.yml
@@ -2,10 +2,13 @@
#
# All other Chinese mappings are kept in _chinese_base.yml. This mapping only
# adds an overlay for parsing numerals and Scriptshifter-specific features.
+
+---
general:
name: Chinese
parents:
- _chinese_base
+ case_sensitive: false
options:
- id: marc_field
diff --git a/scriptshifter/tables/data/gujarati.yml b/scriptshifter/tables/data/gujarati.yml
index e72278e..19ea6cc 100644
--- a/scriptshifter/tables/data/gujarati.yml
+++ b/scriptshifter/tables/data/gujarati.yml
@@ -1,5 +1,7 @@
+---
general:
name: Gujarati
+ case_sensitive: false
script_to_roman:
hooks:
diff --git a/scriptshifter/tables/data/hebrew.yml b/scriptshifter/tables/data/hebrew.yml
index d30aac4..0a45bf7 100644
--- a/scriptshifter/tables/data/hebrew.yml
+++ b/scriptshifter/tables/data/hebrew.yml
@@ -1,6 +1,8 @@
+---
general:
name: Hebrew
description: Hebrew S2R.
+ case_sensitive: false
options:
- id: genre
@@ -19,4 +21,3 @@ script_to_roman:
post_config:
-
- hebrew.dicta_api.s2r_post_config
-
diff --git a/scriptshifter/tables/data/kannada.yml b/scriptshifter/tables/data/kannada.yml
index 4b60a29..6a956a1 100644
--- a/scriptshifter/tables/data/kannada.yml
+++ b/scriptshifter/tables/data/kannada.yml
@@ -1,5 +1,7 @@
+---
general:
name: Kannada
+ case_sensitive: false
script_to_roman:
hooks:
diff --git a/scriptshifter/tables/data/malayalam.yml b/scriptshifter/tables/data/malayalam.yml
index ae3dad5..7d38d6a 100644
--- a/scriptshifter/tables/data/malayalam.yml
+++ b/scriptshifter/tables/data/malayalam.yml
@@ -1,5 +1,7 @@
+---
general:
name: Malayalam
+ case_sensitive: false
script_to_roman:
hooks:
diff --git a/scriptshifter/tables/data/marathi_devanagari.yml b/scriptshifter/tables/data/marathi_devanagari.yml
index 5e99971..8cc9d34 100644
--- a/scriptshifter/tables/data/marathi_devanagari.yml
+++ b/scriptshifter/tables/data/marathi_devanagari.yml
@@ -1,5 +1,7 @@
+---
general:
name: Marathi (Devanagari)
+ case_sensitive: false
script_to_roman:
hooks:
diff --git a/scriptshifter/tables/data/nepali_devanagari.yml b/scriptshifter/tables/data/nepali_devanagari.yml
index 2aa20a2..67339fb 100644
--- a/scriptshifter/tables/data/nepali_devanagari.yml
+++ b/scriptshifter/tables/data/nepali_devanagari.yml
@@ -1,5 +1,7 @@
+---
general:
name: Nepali (Devanagari)
+ case_sensitive: false
script_to_roman:
hooks:
diff --git a/scriptshifter/tables/data/oriya.yml b/scriptshifter/tables/data/oriya.yml
index a3a911e..4ebaef0 100644
--- a/scriptshifter/tables/data/oriya.yml
+++ b/scriptshifter/tables/data/oriya.yml
@@ -1,5 +1,7 @@
+---
general:
name: Oriya
+ case_sensitive: false
script_to_roman:
hooks:
diff --git a/scriptshifter/tables/data/pali.yml b/scriptshifter/tables/data/pali.yml
index 41462f4..ff23125 100644
--- a/scriptshifter/tables/data/pali.yml
+++ b/scriptshifter/tables/data/pali.yml
@@ -1,5 +1,7 @@
+---
general:
name: Pali
+ case_sensitive: false
script_to_roman:
hooks:
diff --git a/scriptshifter/tables/data/sanskrit_devanagari.yml b/scriptshifter/tables/data/sanskrit_devanagari.yml
index 8bd162f..3d5f9af 100644
--- a/scriptshifter/tables/data/sanskrit_devanagari.yml
+++ b/scriptshifter/tables/data/sanskrit_devanagari.yml
@@ -1,5 +1,7 @@
+---
general:
name: Sanskrit (Devanagari)
+ case_sensitive: false
script_to_roman:
hooks:
diff --git a/scriptshifter/tables/data/sinhalese.yml b/scriptshifter/tables/data/sinhalese.yml
index 58ae0db..1f639c4 100644
--- a/scriptshifter/tables/data/sinhalese.yml
+++ b/scriptshifter/tables/data/sinhalese.yml
@@ -1,5 +1,7 @@
+---
general:
name: Sinhalese
+ case_sensitive: false
script_to_roman:
hooks:
diff --git a/scriptshifter/tables/data/telugu.yml b/scriptshifter/tables/data/telugu.yml
index a1267b8..2618d40 100644
--- a/scriptshifter/tables/data/telugu.yml
+++ b/scriptshifter/tables/data/telugu.yml
@@ -1,5 +1,7 @@
+---
general:
name: Telugu
+ case_sensitive: false
script_to_roman:
hooks:
diff --git a/scriptshifter/tables/data/thai.yml b/scriptshifter/tables/data/thai.yml
index e0d4bb4..10b80f8 100644
--- a/scriptshifter/tables/data/thai.yml
+++ b/scriptshifter/tables/data/thai.yml
@@ -1,5 +1,7 @@
+---
general:
name: Thai
+ case_sensitive: false
options:
- id: ThaiTranscription
diff --git a/scriptshifter/tables/data/yiddish.yml b/scriptshifter/tables/data/yiddish.yml
index 9539695..2c961c6 100644
--- a/scriptshifter/tables/data/yiddish.yml
+++ b/scriptshifter/tables/data/yiddish.yml
@@ -1,5 +1,7 @@
+---
general:
name: Yiddish
+ case_sensitive: false
options:
- id: loshn_koydesh
diff --git a/scriptshifter/trans.py b/scriptshifter/trans.py
index 9b1e552..7d68601 100644
--- a/scriptshifter/trans.py
+++ b/scriptshifter/trans.py
@@ -5,7 +5,7 @@
from scriptshifter.exceptions import BREAK, CONT
from scriptshifter.tables import (
- BOW, EOW, WORD_BOUNDARY, FEAT_R2S, FEAT_S2R, HOOK_PKG_PATH,
+ BOW, EOW, WORD_BOUNDARY, FEAT_CASEI, FEAT_R2S, FEAT_S2R, HOOK_PKG_PATH,
get_connection, get_lang_dcap, get_lang_general, get_lang_hooks,
get_lang_ignore, get_lang_map, get_lang_normalize)
@@ -111,6 +111,10 @@ def transliterate(src, lang, t_dir="s2r", capitalize=False, options={}):
f"Roman-to-script not yet supported for {lang}."
)
+ # Normalize case before post_config and rule-based normalization.
+ if not ctx.general["case_sensitive"]:
+ ctx._src = ctx.src.lower()
+
# This hook may take over the whole transliteration process or delegate
# it to some external process, and return the output string directly.
if _run_hook("post_config", ctx) == BREAK:
@@ -309,6 +313,9 @@ def transliterate(src, lang, t_dir="s2r", capitalize=False, options={}):
def _normalize_src(ctx, norm_rules):
+ """
+ Normalize source text according to rules.
+ """
for nk, nv in norm_rules.items():
ctx._src = ctx.src.replace(nk, nv)
logger.debug(f"Normalized source: {ctx.src}")
diff --git a/tests/data/script_samples/unclassified.csv b/tests/data/script_samples/unclassified.csv
index 8b6d3ca..6cd0941 100644
--- a/tests/data/script_samples/unclassified.csv
+++ b/tests/data/script_samples/unclassified.csv
@@ -2,7 +2,7 @@ armenian,Մեդիա իրավունք : (ուսումնական ձեռնարկ) ,
armenian,Ա Բ Գ Դ Ե Զ Է Ը Թ Ժ Ի Լ Խ Ծ Կ Հ Ձ Ղ Ճ Մ Յ Ն Շ Ո Չ Պ Ջ Ռ Ս Վ Տ Ր Ւ Փ Ք Օ Ֆ ՙ ՚ ՛ ՜ ՝ ՞ ՟ ա բ գ դ ե զ է ը թ ժ ի լ խ ծ կ ձ ղ ճ մ յ ն շ ո չ պ ջ ռ ս վ տ ր ց ւ փ ք օ ֆ և ։ ֊ .,A B G D E Y Z Ē Ě Tʻ Zh I L Kh Ts K H Dz Gg Ch M Y N Sh O Chʻ P J Ṛ S V T R Tsʻ W U Pʻ Kʻ Ew Ev Ō Fa b g d e y z ē ě tʻ zh i l kh ts k h dz gh ch m y n sh o chʻ p j ṛ s v t r tsʻ w u pʻ kʻ ew ev ō f,,
georgian,ადგილობრივი თვითმმართველობის კოდექსი : საქართველოს ორგანული კანონი; 2018 წლის 7 სექტებრის მდგომარეობით.,Adgilobrivi tʻvitʻmmartʻvelobis kodekʻsi : Sakʻartʻvelos organuli kanoni; 2018 clis 7 sekʻtembris mdgomareobitʻ.,,
hindi,परमहंस की पीड़ा : महान क्रांतिकारी रामप्रसाद बिस्मिल के जीवन पर आधारित उपन्यास,Paramahaṃsa kī pīṛā : mahāna krāntikārī Rāmaprasāda Bismila ke jīvana para ādhārita upanyāsa,,
-mongolian_mongol_bichig,ᠳᠠᠶᠢᠴᠢᠩ ᠭᠦᠷᠦᠨ ᠦ ᠦᠶ ᠡ ᠶᠢᠨ ᠥᠯᠠᠨ ᠺᠡᠯᠡᠨ ᠦ ᠦᠰᠦᠭ ᠬᠠᠪᠰᠸᠷᠸᠭᠰᠠᠨ ᠰᠸᠷᠪᠸᠯᠵᠢ ᠪᠢᠴᠢᠭ ᠦᠨ ᠰᠸᠳᠸᠯᠸᠯ,Dayicing gu̇ru̇n-u̇ u̇y-e-yin olan kelen-u̇ u̇su̇g qabsuruġsan surbulji bicig-u̇n sudulul,,
+mongolian_mongol_bichig,ᠳᠠᠶᠢᠴᠢᠩ ᠭᠦᠷᠦᠨ ᠦ ᠦᠶᠡ ᠶᠢᠨ ᠣᠯᠠᠨ ᠬᠡᠯᠡᠨ ᠦ ᠦᠰᠦᠭ ᠬᠠᠪᠰᠤᠷᠤᠭᠰᠠᠨ ᠰᠤᠷᠪᠤᠯᠵᠢ ᠪᠢᠴᠢᠭ ᠦᠨ ᠰᠤᠳᠤᠯᠤᠯ,dayicing gu̇ru̇n-u̇ u̇y-e-yin olan kelen-u̇ u̇su̇g qabsuruġsan surbulji bicig-u̇n sudulul,,
,আগবাৰীত ফুলিলে সোনে মোৰ চম্পা,Āgabārīta phulile soṇe mora campā,,
,Milli dövlətçilik hərəkatının yüksəlişi və Xalq Cümhuriyyəti dövründə Azərbaycançılıq ideyası,Milli dövlätçilik häräkatının yüksälişi vä Xalq Cümhuriyyäti dövründä azärbaycançılıq ideyası,,
,مجنون مجنون دوشون منى شعر توپلوسو ,Macnūn macnūn düşün manī : şiʻr toplūsū,,