repository setup

sinaahmadi · Mar 25, 2023 · f89f2e5 · f89f2e5
commit f89f2e5
Show file tree

Hide file tree

Showing 21 changed files with 79,459 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,38 @@
+### macOS ###
+# General
+.DS_Store
+.AppleDouble
+.LSOverride
+
+# Icon must end with two \r
+Icon
+
+
+# Thumbnails
+._*
+
+# Files that might appear in the root of a volume
+.DocumentRevisions-V100
+.fseventsd
+.Spotlight-V100
+.TemporaryItems
+.Trashes
+.VolumeIcon.icns
+.com.apple.timemachine.donotpresent
+
+# Directories potentially created on remote AFP share
+.AppleDB
+.AppleDesktop
+Network Trash Folder
+Temporary Items
+.apdisk
+
+### macOS Patch ###
+# iCloud generated files
+*.icloud
+
+hidden/
+fastText-0.9.2
+corpora
+*.vec
+*.bin
diff --git a/Kurdish-alphabets.png b/Kurdish-alphabets.png
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 Sina Ahmadi
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,96 @@
+# Language Identification of Kurdish & Zaza-Gorani Languages
+---
+
+![Kurdish alphabets](Kurdish-alphabets.png)
+
+Language identification or detection is the task of detecting the language in which a sentence is written. This repository provides models for language identificaiton of Kurdish and Zaza-Gorani languages with their Kurdified Perso-Arabic and Latin scripts. Our models can predict the following languages and scripts:
+
+* Northern Kurdish / کورمانجی (Kurmanji, `kmr`) - both scripts with `kuarab` & `kulatn` labels
+* Central Kurdish / سۆرانی (Sorani, `ckb`) - both scripts with `ckbarab` & `ckblatn` labels
+* Southern Kurdish / کوردیی خوارین (`sdh`) 
+* Gorani / گۆرانی (Hawrami, `hac`)
+* Zazaki / Zazakî / (`zza`) - both scripts with `zza` for Bedirxan and `zzawiki` for the script used on [Zazaki Wikipedia](https://diq.wikipedia.org)
+* Arabic / اَلْعَرَبِيَّةُ (`ar`)
+* Persian / فارسی (`fa`)
+* Turkish / Türkçe / (`tr`)
+
+## How to use?
+
+You can load the trained [models](models) using [fastText](https://fasttext.cc) in Python or on command-line. Please consult [https://fasttext.cc/docs/en/python-module.html](https://fasttext.cc/docs/en/python-module.html).
+
+Two models are provided:
+
+* [models/KLID_model.ftz](models/KLID_model.ftz): use this if you don't mind about detecting the script of the language. This predicts language codes only.
+* [models/KLID_model_scr.ftz](models/KLID_model_scr.ftz): use this if you want the script label in addition to the language code. This predicts language and script.
+
+Here is an example in Python:
+
+```python
+>>> import fasttext
+>>> model = fasttext.load_model("models/KLID_model.ftz")
+
+# Central Kurdish
+>>> model.predict("لەزۆربەی یارییەکان گوڵ تۆمار دەکات") 
+(('__label__ckb',), array([1.00002003]))
+>>> model.predict("لەزۆربەی یارییەکان گوڵ تۆمار دەکات", k=5)
+(('__label__ckb', '__label__ku'), array([1.00002003e+00, 1.00000989e-05]))
+>>> model.predict("لەزۆربەی یارییەکان گوڵ تۆمار دەکات")
+(('__label__ckb',), array([1.00002003]))
+>>> model.predict("باڵیۆزی عێراق")
+(('__label__ckb',), array([1.00001979]))
+
+# Southern Kurdish
+>>> model.predict("چەس ئمڕوو چە قوومیاس؟!!") 
+(('__label__sdh',), array([1.00003743]))
+
+# Gorani
+>>> model.predict("داستانێ فرەتەر و درێژتەرەنه و دەسی سەر پەی") 
+(('__label__hac',), array([0.99998134]))
+
+# Kurmanji
+>>> model.predict("ئەگەر بێژم ئەز فەرهادم") 
+(('__label__ku',), array([0.93445575]))
+
+# Zazaki
+>>> model.predict("Seba naye zî ganî ma rayîr û metodanê xo xurtêr bikerê.") 
+(('__label__zza',), array([1.00003004]))
+
+# Northern Kurdish
+>>> model.predict("Amerîkayîyan di sala 2004 de zîndana Ebû Xerîb girtin.") 
+(('__label__ku',), array([0.99766862]))
+
+# Central Kurdish
+>>> model.predict("Emin filsêkim le kitêban dest nekewtbû bełam") 
+(('__label__ckb',), array([1.00001991]))
+
+# Central Kurdish
+>>> model.predict("گەرەکمە پێی بێژم نامگەرەکە") 
+(('__label__ku',), array([0.99485904])) 
+>>> model.predict("جا ئەتوو وەرە دەگەڵ وی ڕێک کەوە")
+(('__label__sdh',), array([0.84034669])) 
+
+# English
+>>> model.predict("To be, or not to be") 
+(('__label__zza',), array([1.00003004]))
+```
+
+If you would like to train your own models, you can use the datasets provided in the [datasets](datasets) folder. All the datasets are merged into [train](datasets/train.txt) and [train_scr](datasets/train.txt); these two files refer to the instances tagged without and with their scripts, respectively.
+
+## Cite this corpus
+If you're using the models, please cite the project along with the following paper ([bib file](https://sinaahmadi.github.io/bibliography/ahmadi2023fieldmatters.bib) | [PDF](https://sinaahmadi.github.io/docs/articles/ahmadi2023fieldmatters.pdf)). 
+
+```
+@inproceedings{ahmadi2023fieldmatters,
+  title = "Approaches to Corpus Creation for Low-Resource Language Technology: the Case of {Southern Kurdish and Laki}",
+  author = "Ahmadi, Sina and Azin, Zahra and Belelli, Sara and Anastasopoulos, Antonios",
+  booktitle = "Proceedings of the second workshop on NLP applications to field linguistics",
+  month = may,
+  year = "2023",
+  address = "Dubrovnik, Croatia",
+  publisher = "The 17th Conference of the European Chapter of the Association for Computational Linguistics"
+}
+```
+
+## License
+
+[MIT](LICENSE)