diff --git a/ext/oriya.html b/ext/oriya.html new file mode 100644 index 0000000..994ee01 --- /dev/null +++ b/ext/oriya.html @@ -0,0 +1,1266 @@ + +{{ombox +| text = ଏହା ରୋମାନାଇଜେସନ (ରୋମାନ ଲିପିରେ ଓଡ଼ିଆ ଲେଖା) ପାଇଁ । କାମକରୁଥିବା ମାନକ: [[ଆନ୍ତର୍ଜାତୀୟ ଉଚ୍ଚାରଣଗତ ଲିପି]] (ଆଇପିଏ - IPA), [https://www.loc.gov/catdir/cpso/romanization/oriya.pdf ALA/LC ରୋମାନ] ଓ ALA/LC ଫାଇଲ ନାମ (ରୋମାନ ଓ [https://r12a.github.io/app-conversion/ ହେକ୍ସ NCR]) ଏବଂ [[:wikt:ଉଇକିଅଭିଧାନ:ରୋମାନ ଟ୍ରାନ୍ସଲିଟରେସନ|ଭାଷାକୋଷ]] । ବାମ ବାକ୍ସରେ ୟୁନିକୋଡ଼ ଫଣ୍ଟରେ ଟାଇପ କରା ଲେଖା କପି କରନ୍ତୁ ବା ଟାଇପ କରନ୍ତୁ । ଉପରୁ ମାନକ ଅପସନ ବାଛନ୍ତୁ । ମୋବାଇଲ ବ୍ୟବହାର କରୁଥିଲେ ''ଡେସ୍କଟପ ଦେଖଣା'' ବାଛନ୍ତୁ । +
+[[ସହଯୋଗ:ଓଡ଼ିଆ/ରୋମାନାଇଜେସନ/ନମୁନା_ଲେଖା|ନମୁନା ଲେଖା]] • See this page in [[ସହଯୋଗ:ଓଡ଼ିଆ/ରୋମାନାଇଜେସନ/en|English]] +
+}} +{{clear}} + +
+ +{{clear}} + +== ସ୍ରୋତ == +* ଏହି ଲିପ୍ୟାନ୍ତରଣ କନଭର୍ଟର [[ସହଯୋଗ:IPA/Odia|ଓଡ଼ିଆ ଆଇପିଏ]], [[ଏଏଲଏ-ଏଲସି ରୋମାନାଇଜେସନ]] ଏବଂ [[wikt:or:ଉଇକିଅଭିଧାନ:ରୋମାନ ଟ୍ରାନ୍ସଲିଟରେସନ|ଉଇକିଅଭିଧାନ ରୋମାନ ଟ୍ରାନ୍ସଲିଟରେସନ]] ନିୟମ ଆଧାରରେ ତିଆରି । + +== ପରିଚାଳନା == +* ତିଆରି ଓ ପରିଚାଳନା: [[ବ୍ୟବହାରକାରୀ:Psubhashish]] +* ଆଇପିଏ ଓ ରୋମାନ ସହଯୋଗ: [[ବ୍ୟବହାରକାରୀ:Prateek Pattanaik]] ଓ [[User:Bikash Ojha]] +* ''ସୋର୍ସ କୋଡ଼ [[ବ୍ୟବହାରକାରୀ:Jnanaranjan sahu]], [[ବ୍ୟବହାରକାରୀ:ଶିତିକଣ୍ଠ ଦାଶ]], [[ବ୍ୟବହାରକାରୀ:TWO^0|ମନୋଜ ସାହୁକାର]] ଏବଂ [[ବ୍ୟବହାରକାରୀ:Psubhashish]]ଙ୍କ ତିଆରି [[ଉଇକିପିଡ଼ିଆ:କନଭର୍ଟର|କନଭର୍ଟର]] ଆଧାରରେ ତିଆରି'' + +== ସୋର୍ସ କୋଡ଼ ଓ ଲାଇସେନ୍ସ== +* [https://github.com/OdiaWikimedia/Converter/blob/master/IPA-Romanization/ipa-roman ସୋର୍ସ କୋଡ଼] (ଶେଷ ଅପଡେଟ ୨୪ ମଇ ୨୦୨୩) +* ଏହି ସଫ୍ଟୱେର [https://www.gnu.org/licenses/gpl-3.0.en.html GNU General Public License v3.0]ରେ ଉପଲବ୍ଧ + +== ଅନ୍ୟାନ୍ୟ ଟୁଲ == +* [[ଉଇକିପିଡ଼ିଆ:ଟୁଲ/ଏନକୋଡ଼ିଙ୍ଗ କନଭର୍ଟର|ଏନକୋଡ଼ିଙ୍ଗ କନଭର୍ଟର]]: ଶ୍ରୀଲିପି, ଆକୃତି, ସମ୍ବାଦ ଆଦି ଅଣ-ମାନକ ଏନକୋଡ଼ିଂରୁ ଇଉନିକୋଡ଼ ଓଡ଼ିଆରେ ରୂପାନ୍ତର +* [[ଉଇକିପିଡ଼ିଆ:ଟୁଲ/ଅନ୍ୟ ଲିପିରୁ ଓଡ଼ିଆକୁ ରୂପାନ୍ତର|ଅନ୍ୟ ଲିପିରୁ ଓଡ଼ିଆକୁ ରୂପାନ୍ତର]]: ଅହମିୟା/ବଙ୍ଗଳା, ଦେବନାଗରୀ, ଗୁଜରାଟୀ, ରୋମାନ ଓ ଉର୍ଦ୍ଦୁରୁ ଓଡ଼ିଆକୁ ରୂପାନ୍ତର (''ପରୀକ୍ଷାମୂଳକ ତେଣୁ କିଛି ଭୁଲ ରହିପାରେ'') +* [https://github.com/OdiaWikimedia/Wordlist ଶବ୍ଦତାଲିକା]: ଓଡ଼ିଆ ଉଇକିପିଡ଼ିଆ, ଉଇକିପାଠାଗାର ଓ ଉଇକିଅଭିଧାନରେ ଥିବା ଶବ୍ଦଗୁଡ଼ିକର ଏକ ଲମ୍ବା ତାଲିକା । ଏସବୁ ଟେକ୍ଟଟ-ଟୁ-ସ୍ପିଚ, ସ୍ପିଚ-ଟୁ-ଟେକ୍ସଟ ଓ ବାକି ସେହିଭଳି ସଫ୍ଟଓଏର ତିଆରି କାମରେ ଲାଗିବ +* [[/ପାନଗ୍ରାମ|ପାନଗ୍ରାମ]]: ବର୍ଣ୍ଣମାଳାର ସବୁ ମୌଳିକ ଅକ୍ଷର ଥିବା ଏକ ବାକ୍ୟ + diff --git a/scriptshifter/tables/__init__.py b/scriptshifter/tables/__init__.py index 7df1958..a2098c9 100644 --- a/scriptshifter/tables/__init__.py +++ b/scriptshifter/tables/__init__.py @@ -16,7 +16,7 @@ from yaml import Loader from scriptshifter import DB_PATH -from scriptshifter.exceptions import BREAK, ConfigError +from scriptshifter.exceptions import BREAK, ApiError, ConfigError __doc__ = """ @@ -209,6 +209,9 @@ def populate_table(conn, tid, tname): if "roman_to_script" in data: flags |= FEAT_R2S + if not data.get("general", {}).get("case_sensitive", True): + flags |= FEAT_CASEI + conn.execute( "UPDATE tbl_language SET features = ? WHERE id = ?", (flags, tid)) @@ -555,6 +558,9 @@ def get_lang_general(conn, lang): FROM tbl_language WHERE name = ?""", (lang,)) lang_data = lang_q.fetchone() + if not lang_data: + raise ApiError(f"No language data found for {lang}", 404) + return { "id": lang_data[0], "data": { diff --git a/scriptshifter/tables/data/_chinese_base.yml b/scriptshifter/tables/data/_chinese_base.yml index 0be6e01..4c51521 100644 --- a/scriptshifter/tables/data/_chinese_base.yml +++ b/scriptshifter/tables/data/_chinese_base.yml @@ -1,13 +1,13 @@ # This file is derived and kept in sync with Princeton's OCLC Connexion Pinyin # converter (https://github.com/pulibrary/oclcpinyin/). -general: # Section names and other keywords are all snake_cased. +general: # Section names and other keywords are all snake_cased. name: Chinese base (from Princeton) parents: - _ignore_base script_to_roman: - map: # Mapping section. + map: # Mapping section. "\u5DF4\u57FA\u65AF\u5766\u4F0A\u65AF\u862D\u5171\u548C\u570B": "Bajisitan Yisilan Gongheguo " "\u5DF4\u57FA\u65AF\u5766\u4F0A\u65AF\u5170\u5171\u548C\u56FD": "Bajisitan Yisilan Gongheguo " "\u5DF4\u97F3\u90ED\u695E\u8499\u53E4\u81EA\u6CBB\u5DDE": "Bayinguoleng Menggu Zizhizhou " diff --git a/scriptshifter/tables/data/arabic.yml b/scriptshifter/tables/data/arabic.yml index 8679c87..5e76b9b 100644 --- a/scriptshifter/tables/data/arabic.yml +++ b/scriptshifter/tables/data/arabic.yml @@ -1,9 +1,11 @@ # Arabic S2R using the 3rd-party ArabicTransliterator library: # https://github.com/MTG/ArabicTransliterator +--- general: name: Arabic description: Arabic S2R using a 3rd party library. + case_sensitive: false script_to_roman: hooks: diff --git a/scriptshifter/tables/data/chinese.yml b/scriptshifter/tables/data/chinese.yml index 15ad34f..a846713 100644 --- a/scriptshifter/tables/data/chinese.yml +++ b/scriptshifter/tables/data/chinese.yml @@ -2,10 +2,13 @@ # # All other Chinese mappings are kept in _chinese_base.yml. This mapping only # adds an overlay for parsing numerals and Scriptshifter-specific features. + +--- general: name: Chinese parents: - _chinese_base + case_sensitive: false options: - id: marc_field diff --git a/scriptshifter/tables/data/gujarati.yml b/scriptshifter/tables/data/gujarati.yml index e72278e..19ea6cc 100644 --- a/scriptshifter/tables/data/gujarati.yml +++ b/scriptshifter/tables/data/gujarati.yml @@ -1,5 +1,7 @@ +--- general: name: Gujarati + case_sensitive: false script_to_roman: hooks: diff --git a/scriptshifter/tables/data/hebrew.yml b/scriptshifter/tables/data/hebrew.yml index d30aac4..0a45bf7 100644 --- a/scriptshifter/tables/data/hebrew.yml +++ b/scriptshifter/tables/data/hebrew.yml @@ -1,6 +1,8 @@ +--- general: name: Hebrew description: Hebrew S2R. + case_sensitive: false options: - id: genre @@ -19,4 +21,3 @@ script_to_roman: post_config: - - hebrew.dicta_api.s2r_post_config - diff --git a/scriptshifter/tables/data/kannada.yml b/scriptshifter/tables/data/kannada.yml index 4b60a29..6a956a1 100644 --- a/scriptshifter/tables/data/kannada.yml +++ b/scriptshifter/tables/data/kannada.yml @@ -1,5 +1,7 @@ +--- general: name: Kannada + case_sensitive: false script_to_roman: hooks: diff --git a/scriptshifter/tables/data/malayalam.yml b/scriptshifter/tables/data/malayalam.yml index ae3dad5..7d38d6a 100644 --- a/scriptshifter/tables/data/malayalam.yml +++ b/scriptshifter/tables/data/malayalam.yml @@ -1,5 +1,7 @@ +--- general: name: Malayalam + case_sensitive: false script_to_roman: hooks: diff --git a/scriptshifter/tables/data/marathi_devanagari.yml b/scriptshifter/tables/data/marathi_devanagari.yml index 5e99971..8cc9d34 100644 --- a/scriptshifter/tables/data/marathi_devanagari.yml +++ b/scriptshifter/tables/data/marathi_devanagari.yml @@ -1,5 +1,7 @@ +--- general: name: Marathi (Devanagari) + case_sensitive: false script_to_roman: hooks: diff --git a/scriptshifter/tables/data/nepali_devanagari.yml b/scriptshifter/tables/data/nepali_devanagari.yml index 2aa20a2..67339fb 100644 --- a/scriptshifter/tables/data/nepali_devanagari.yml +++ b/scriptshifter/tables/data/nepali_devanagari.yml @@ -1,5 +1,7 @@ +--- general: name: Nepali (Devanagari) + case_sensitive: false script_to_roman: hooks: diff --git a/scriptshifter/tables/data/oriya.yml b/scriptshifter/tables/data/oriya.yml index a3a911e..4ebaef0 100644 --- a/scriptshifter/tables/data/oriya.yml +++ b/scriptshifter/tables/data/oriya.yml @@ -1,5 +1,7 @@ +--- general: name: Oriya + case_sensitive: false script_to_roman: hooks: diff --git a/scriptshifter/tables/data/pali.yml b/scriptshifter/tables/data/pali.yml index 41462f4..ff23125 100644 --- a/scriptshifter/tables/data/pali.yml +++ b/scriptshifter/tables/data/pali.yml @@ -1,5 +1,7 @@ +--- general: name: Pali + case_sensitive: false script_to_roman: hooks: diff --git a/scriptshifter/tables/data/sanskrit_devanagari.yml b/scriptshifter/tables/data/sanskrit_devanagari.yml index 8bd162f..3d5f9af 100644 --- a/scriptshifter/tables/data/sanskrit_devanagari.yml +++ b/scriptshifter/tables/data/sanskrit_devanagari.yml @@ -1,5 +1,7 @@ +--- general: name: Sanskrit (Devanagari) + case_sensitive: false script_to_roman: hooks: diff --git a/scriptshifter/tables/data/sinhalese.yml b/scriptshifter/tables/data/sinhalese.yml index 58ae0db..1f639c4 100644 --- a/scriptshifter/tables/data/sinhalese.yml +++ b/scriptshifter/tables/data/sinhalese.yml @@ -1,5 +1,7 @@ +--- general: name: Sinhalese + case_sensitive: false script_to_roman: hooks: diff --git a/scriptshifter/tables/data/telugu.yml b/scriptshifter/tables/data/telugu.yml index a1267b8..2618d40 100644 --- a/scriptshifter/tables/data/telugu.yml +++ b/scriptshifter/tables/data/telugu.yml @@ -1,5 +1,7 @@ +--- general: name: Telugu + case_sensitive: false script_to_roman: hooks: diff --git a/scriptshifter/tables/data/thai.yml b/scriptshifter/tables/data/thai.yml index e0d4bb4..10b80f8 100644 --- a/scriptshifter/tables/data/thai.yml +++ b/scriptshifter/tables/data/thai.yml @@ -1,5 +1,7 @@ +--- general: name: Thai + case_sensitive: false options: - id: ThaiTranscription diff --git a/scriptshifter/tables/data/yiddish.yml b/scriptshifter/tables/data/yiddish.yml index 9539695..2c961c6 100644 --- a/scriptshifter/tables/data/yiddish.yml +++ b/scriptshifter/tables/data/yiddish.yml @@ -1,5 +1,7 @@ +--- general: name: Yiddish + case_sensitive: false options: - id: loshn_koydesh diff --git a/scriptshifter/trans.py b/scriptshifter/trans.py index 9b1e552..7d68601 100644 --- a/scriptshifter/trans.py +++ b/scriptshifter/trans.py @@ -5,7 +5,7 @@ from scriptshifter.exceptions import BREAK, CONT from scriptshifter.tables import ( - BOW, EOW, WORD_BOUNDARY, FEAT_R2S, FEAT_S2R, HOOK_PKG_PATH, + BOW, EOW, WORD_BOUNDARY, FEAT_CASEI, FEAT_R2S, FEAT_S2R, HOOK_PKG_PATH, get_connection, get_lang_dcap, get_lang_general, get_lang_hooks, get_lang_ignore, get_lang_map, get_lang_normalize) @@ -111,6 +111,10 @@ def transliterate(src, lang, t_dir="s2r", capitalize=False, options={}): f"Roman-to-script not yet supported for {lang}." ) + # Normalize case before post_config and rule-based normalization. + if not ctx.general["case_sensitive"]: + ctx._src = ctx.src.lower() + # This hook may take over the whole transliteration process or delegate # it to some external process, and return the output string directly. if _run_hook("post_config", ctx) == BREAK: @@ -309,6 +313,9 @@ def transliterate(src, lang, t_dir="s2r", capitalize=False, options={}): def _normalize_src(ctx, norm_rules): + """ + Normalize source text according to rules. + """ for nk, nv in norm_rules.items(): ctx._src = ctx.src.replace(nk, nv) logger.debug(f"Normalized source: {ctx.src}") diff --git a/tests/data/script_samples/unclassified.csv b/tests/data/script_samples/unclassified.csv index 8b6d3ca..6cd0941 100644 --- a/tests/data/script_samples/unclassified.csv +++ b/tests/data/script_samples/unclassified.csv @@ -2,7 +2,7 @@ armenian,Մեդիա իրավունք : (ուսումնական ձեռնարկ) , armenian,Ա Բ Գ Դ Ե Զ Է Ը Թ Ժ Ի Լ Խ Ծ Կ Հ Ձ Ղ Ճ Մ Յ Ն Շ Ո Չ Պ Ջ Ռ Ս Վ Տ Ր Ւ Փ Ք Օ Ֆ ՙ ՚ ՛ ՜ ՝ ՞ ՟ ա բ գ դ ե զ է ը թ ժ ի լ խ ծ կ ձ ղ ճ մ յ ն շ ո չ պ ջ ռ ս վ տ ր ց ւ փ ք օ ֆ և ։ ֊ .,A B G D E Y Z Ē Ě Tʻ Zh I L Kh Ts K H Dz Gg Ch M Y N Sh O Chʻ P J Ṛ S V T R Tsʻ W U Pʻ Kʻ Ew Ev Ō Fa b g d e y z ē ě tʻ zh i l kh ts k h dz gh ch m y n sh o chʻ p j ṛ s v t r tsʻ w u pʻ kʻ ew ev ō f,, georgian,ადგილობრივი თვითმმართველობის კოდექსი : საქართველოს ორგანული კანონი; 2018 წლის 7 სექტებრის მდგომარეობით.,Adgilobrivi tʻvitʻmmartʻvelobis kodekʻsi : Sakʻartʻvelos organuli kanoni; 2018 clis 7 sekʻtembris mdgomareobitʻ.,, hindi,परमहंस की पीड़ा : महान क्रांतिकारी रामप्रसाद बिस्मिल के जीवन पर आधारित उपन्यास,Paramahaṃsa kī pīṛā : mahāna krāntikārī Rāmaprasāda Bismila ke jīvana para ādhārita upanyāsa,, -mongolian_mongol_bichig,ᠳᠠᠶᠢᠴᠢᠩ ᠭᠦᠷᠦᠨ ᠦ ᠦᠶ ᠡ ᠶᠢᠨ ᠥᠯᠠᠨ ᠺᠡᠯᠡᠨ ᠦ ᠦᠰᠦᠭ ᠬᠠᠪᠰᠸᠷᠸᠭᠰᠠᠨ ᠰᠸᠷᠪᠸᠯᠵᠢ ᠪᠢᠴᠢᠭ ᠦᠨ ᠰᠸᠳᠸᠯᠸᠯ,Dayicing gu̇ru̇n-u̇ u̇y-e-yin olan kelen-u̇ u̇su̇g qabsuruġsan surbulji bicig-u̇n sudulul,, +mongolian_mongol_bichig,ᠳᠠᠶᠢᠴᠢᠩ ᠭᠦᠷᠦᠨ ᠦ ᠦᠶ᠎ᠡ ᠶᠢᠨ ᠣᠯᠠᠨ ᠬᠡᠯᠡᠨ ᠦ ᠦᠰᠦᠭ ᠬᠠᠪᠰᠤᠷᠤᠭᠰᠠᠨ ᠰᠤᠷᠪᠤᠯᠵᠢ ᠪᠢᠴᠢᠭ ᠦᠨ ᠰᠤᠳᠤᠯᠤᠯ,dayicing gu̇ru̇n-u̇ u̇y-e-yin olan kelen-u̇ u̇su̇g qabsuruġsan surbulji bicig-u̇n sudulul,, ,আগবাৰীত ফুলিলে সোনে মোৰ চম্পা,Āgabārīta phulile soṇe mora campā,, ,Milli dövlətçilik hərəkatının yüksəlişi və Xalq Cümhuriyyəti dövründə Azərbaycançılıq ideyası,Milli dövlätçilik häräkatının yüksälişi vä Xalq Cümhuriyyäti dövründä azärbaycançılıq ideyası,, ,مجنون مجنون دوشون منى شعر توپلوسو ,Macnūn macnūn düşün manī : şiʻr toplūsū,,