Code-Switching Language Tagger

General Description

python-tagger is a language identification tool written in Python based on similar code written in Scala. Note that this code has only been tested on the language pairs English-Spanish and English-French.

Requirements

To run the script, the following files are required:

training corpora for each language in UTF-8,
the latest version of the Stanford Named Entity Recognizer from here,
a gold standard for computing priors and evaluation in tab-separated format.

Additionally, running it requires Python3 and NLTK. The script assumes that the gold standard is already tokenized with the language tags in the rightmost column. Note that as of now, the only classifiers supported for Named Entity Recognition are for English and Spanish.

Usage

eval.py [options] infile

usage: evaluator.py [-h] [--ngram ngram] [--tokenize] [--header]
                    [--gold-delimiter GOLD_DELIMITER] [-v]
                    [infile]

Tag a mixed-language text by language

positional arguments:
  infile                corpus file (Default: stdin)

optional arguments:
  -h, --help            show this help message and exit
  --ngram ngram         size of character ngrams (Default: 5)
  --tokenize            tokenize flag (Default: False)
  --header              header flag (Default: False)
  --gold-delimiter GOLD_DELIMITER
                        delimiter for gold standard file (Default: tab)
  -v, --verbose         verbose flag (Default: False)

Further options in config.ini file:

[DEFAULT]
- LANG_SET = Comma-separated list of language tags corresponding to language tags in gold standard
- NGRAM = Size of ngram model
- TOKENIZE = Tokenize test corpus (unused)
- HEADER = Presence of Header row included in test corpus
- VERBOSE = Print diagnostic information
[TRAIN_PATHS]
- LANG1_TRAIN = Path to training data for Language 1
- LANG2_TRAIN = Path to training data for Language 2
[CLASS_PATHS]
- CLASS_JAR = Path to Stanford NER .jar file
- LANG1_CLASS = Path to NER classifier for Language 1
- LANG2_CLASS = Path to NER classifier for Language 2
[GOLD]
- GOLD_PATH = Path to gold standard
- GOLD_DELIMITER = Quoted delimeter for gold standard
- LANG1_OTHER = Other tags in gold standard corresponding to Language 1
- LANG2_OTHER = Other tags in gold standard corresponding to Language 2
- NE_TAG = Tag in gold standard corresponding to Named Entity
- OTHER_TAGS = Unwanted tags in gold standard (unused)
[ADVANCED]
- NER_CHUNK_SIZE = Token batch size for calls to Named Entity Recognizer

TODO

Translate from Scala
Update and document code
Write test code
Convert to Python package

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
TrainingCorpora		TrainingCorpora
.gitignore		.gitignore
README.md		README.md
cngram.py		cngram.py
config.ini		config.ini
cs_model.py		cs_model.py
eval_gui.py		eval_gui.py
evaluator.py		evaluator.py
hmm.py		hmm.py
spanish.ancora.distsim.s512.crf.ser.gz		spanish.ancora.distsim.s512.crf.ser.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code-Switching Language Tagger

General Description

Requirements

Usage

TODO

About

Releases

Packages

Contributors 6

Languages

Bilingual-Annotation-Task-Force/python-tagger

Folders and files

Latest commit

History

Repository files navigation

Code-Switching Language Tagger

General Description

Requirements

Usage

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages