-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit f89f2e5
Showing
21 changed files
with
79,459 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
### macOS ### | ||
# General | ||
.DS_Store | ||
.AppleDouble | ||
.LSOverride | ||
|
||
# Icon must end with two \r | ||
Icon | ||
|
||
|
||
# Thumbnails | ||
._* | ||
|
||
# Files that might appear in the root of a volume | ||
.DocumentRevisions-V100 | ||
.fseventsd | ||
.Spotlight-V100 | ||
.TemporaryItems | ||
.Trashes | ||
.VolumeIcon.icns | ||
.com.apple.timemachine.donotpresent | ||
|
||
# Directories potentially created on remote AFP share | ||
.AppleDB | ||
.AppleDesktop | ||
Network Trash Folder | ||
Temporary Items | ||
.apdisk | ||
|
||
### macOS Patch ### | ||
# iCloud generated files | ||
*.icloud | ||
|
||
hidden/ | ||
fastText-0.9.2 | ||
corpora | ||
*.vec | ||
*.bin |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2023 Sina Ahmadi | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
# Language Identification of Kurdish & Zaza-Gorani Languages | ||
--- | ||
|
||
 | ||
|
||
Language identification or detection is the task of detecting the language in which a sentence is written. This repository provides models for language identificaiton of Kurdish and Zaza-Gorani languages with their Kurdified Perso-Arabic and Latin scripts. Our models can predict the following languages and scripts: | ||
|
||
* Northern Kurdish / کورمانجی (Kurmanji, `kmr`) - both scripts with `kuarab` & `kulatn` labels | ||
* Central Kurdish / سۆرانی (Sorani, `ckb`) - both scripts with `ckbarab` & `ckblatn` labels | ||
* Southern Kurdish / کوردیی خوارین (`sdh`) | ||
* Gorani / گۆرانی (Hawrami, `hac`) | ||
* Zazaki / Zazakî / (`zza`) - both scripts with `zza` for Bedirxan and `zzawiki` for the script used on [Zazaki Wikipedia](https://diq.wikipedia.org) | ||
* Arabic / اَلْعَرَبِيَّةُ (`ar`) | ||
* Persian / فارسی (`fa`) | ||
* Turkish / Türkçe / (`tr`) | ||
|
||
## How to use? | ||
|
||
You can load the trained [models](models) using [fastText](https://fasttext.cc) in Python or on command-line. Please consult [https://fasttext.cc/docs/en/python-module.html](https://fasttext.cc/docs/en/python-module.html). | ||
|
||
Two models are provided: | ||
|
||
* [models/KLID_model.ftz](models/KLID_model.ftz): use this if you don't mind about detecting the script of the language. This predicts language codes only. | ||
* [models/KLID_model_scr.ftz](models/KLID_model_scr.ftz): use this if you want the script label in addition to the language code. This predicts language and script. | ||
|
||
Here is an example in Python: | ||
|
||
```python | ||
>>> import fasttext | ||
>>> model = fasttext.load_model("models/KLID_model.ftz") | ||
|
||
# Central Kurdish | ||
>>> model.predict("لەزۆربەی یارییەکان گوڵ تۆمار دەکات") | ||
(('__label__ckb',), array([1.00002003])) | ||
>>> model.predict("لەزۆربەی یارییەکان گوڵ تۆمار دەکات", k=5) | ||
(('__label__ckb', '__label__ku'), array([1.00002003e+00, 1.00000989e-05])) | ||
>>> model.predict("لەزۆربەی یارییەکان گوڵ تۆمار دەکات") | ||
(('__label__ckb',), array([1.00002003])) | ||
>>> model.predict("باڵیۆزی عێراق") | ||
(('__label__ckb',), array([1.00001979])) | ||
|
||
# Southern Kurdish | ||
>>> model.predict("چەس ئمڕوو چە قوومیاس؟!!") | ||
(('__label__sdh',), array([1.00003743])) | ||
|
||
# Gorani | ||
>>> model.predict("داستانێ فرەتەر و درێژتەرەنه و دەسی سەر پەی") | ||
(('__label__hac',), array([0.99998134])) | ||
|
||
# Kurmanji | ||
>>> model.predict("ئەگەر بێژم ئەز فەرهادم") | ||
(('__label__ku',), array([0.93445575])) | ||
|
||
# Zazaki | ||
>>> model.predict("Seba naye zî ganî ma rayîr û metodanê xo xurtêr bikerê.") | ||
(('__label__zza',), array([1.00003004])) | ||
|
||
# Northern Kurdish | ||
>>> model.predict("Amerîkayîyan di sala 2004 de zîndana Ebû Xerîb girtin.") | ||
(('__label__ku',), array([0.99766862])) | ||
|
||
# Central Kurdish | ||
>>> model.predict("Emin filsêkim le kitêban dest nekewtbû bełam") | ||
(('__label__ckb',), array([1.00001991])) | ||
|
||
# Central Kurdish | ||
>>> model.predict("گەرەکمە پێی بێژم نامگەرەکە") | ||
(('__label__ku',), array([0.99485904])) | ||
>>> model.predict("جا ئەتوو وەرە دەگەڵ وی ڕێک کەوە") | ||
(('__label__sdh',), array([0.84034669])) | ||
|
||
# English | ||
>>> model.predict("To be, or not to be") | ||
(('__label__zza',), array([1.00003004])) | ||
``` | ||
|
||
If you would like to train your own models, you can use the datasets provided in the [datasets](datasets) folder. All the datasets are merged into [train](datasets/train.txt) and [train_scr](datasets/train.txt); these two files refer to the instances tagged without and with their scripts, respectively. | ||
|
||
## Cite this corpus | ||
If you're using the models, please cite the project along with the following paper ([bib file](https://sinaahmadi.github.io/bibliography/ahmadi2023fieldmatters.bib) | [PDF](https://sinaahmadi.github.io/docs/articles/ahmadi2023fieldmatters.pdf)). | ||
|
||
``` | ||
@inproceedings{ahmadi2023fieldmatters, | ||
title = "Approaches to Corpus Creation for Low-Resource Language Technology: the Case of {Southern Kurdish and Laki}", | ||
author = "Ahmadi, Sina and Azin, Zahra and Belelli, Sara and Anastasopoulos, Antonios", | ||
booktitle = "Proceedings of the second workshop on NLP applications to field linguistics", | ||
month = may, | ||
year = "2023", | ||
address = "Dubrovnik, Croatia", | ||
publisher = "The 17th Conference of the European Chapter of the Association for Computational Linguistics" | ||
} | ||
``` | ||
|
||
## License | ||
|
||
[MIT](LICENSE) |
Oops, something went wrong.