Repository for Cross-lingual Parsing with Polyglot Training and Multi-treebank Learning: A Faroese Case Study, submitted to the DeepLo Workshop at EMNLP 2019.
The code will be tidied up over the next couple of weeks.
This project is developed in Python 3.6 using a Conda environment.
Create a Conda environment with Python 3.6:
conda create -n multilingual_parsing python=3.6
Activate the Conda environment:
source activate multilingual_parsing
This project uses some new AllenNLP features which are not available in the official 0.8.4 release. As such, we will build the 0.8.5-unreleased version from the master
branch on GitHub. If there are any problems try updating pip, setuptools and wheel as mentioned here.
cd
git clone https://github.com/allenai/allennlp.git
cd allennlp
pip install --editable .
Make the library
available in $PYTHONPATH
. From the multilingual_parsing
directory:
cd /path/to/multilingual_parsing
export PYTHONPATH="$PWD/library"
# or permanently:
vim ~/.bashrc
export PYTHONPATH=/path/to/multilingual_parsing/library
source ~/.bashrc
You will need to obtain the original Faroese data for these experiments.
-
Clone the original repository to somewhere in your file system, e.g. in your home directory:
cd $HOME && git clone https://github.com/ftyers/cross-lingual-parsing.git
-
Change directory to your clone of this repo and create a symbolic link to the original data:
cd path/to/multilingual-parsing ln -s /home/user/cross-lingual-parsing/data/ .
This should create a directory structure multilingual-parsing/data/
.
- Download UD v2.2 treebanks.
./scripts/get_ud_treebank.sh
We follow the same process to develop datasets with automatically predicted pos-labels as the CoNLL 2018 shared task. That is, we perform jack-knifing on the training set to predict POS tags. POS tags on the development set are predicted with a model trained on the gold-standard training set.
./scripts/create_k_folds.sh
./scripts/train_with_cross_val.sh
./scripts/predict_with_cross_val.sh
# train a model on full training data and predict the dev set:
./scripts/train_source_tagger.sh monolingual
./scripts/predict_with_cross_val.sh dev
-
Train a source model on source treebanks. The
model_type
argument supplied can be eithermonolingual
ormultilingual
and determines whether to use a monolingual or multilingual model accordingly. You will already have trained source taggers from the previous step../scripts/train_source_parser.sh <model_type>
-
Use a source model to predict annotations for files translated into source languages. The
model_type
argument supplied can be eithermonolingual
ormultilingual
and determines whether to use a monolingual or multilingual model accordingly.# first supply tags ./scripts/predict_source_tagger.sh monolingual user # parse the translations ./scripts/predict_source_parser.sh <model_type> user
-
Project from source languges to the target language.
./scripts/project_all.sh <model_type>
-
Take only the valid sentences.
./scripts/validate_all.sh <model_type> single
-
Combine sentences where we have 3/4 valid projected sentences.
python utils/treebanks_union.py <model_type>
-
Perform MST voting over the matching, calidated sentences.
./scripts/merge_all.sh <model_type>
-
Validate the voted sentences.
./scripts/validate_all.sh <model_type> combined
-
Check for double-headed sentences.
./scripts/check_double_headed_all.sh <model_type>
Train a tagging and parsing models on the synthetic target treebank. model_type
can be either monolingual
or multilingual
. We train a tagger here so that we can produce silver tags for the final test set.
./scripts/train_target_tagger.sh <model_type>
./scripts/train_target_parser.sh <model_type>
Predict the Faroese test set with the various target taggers and parsers.
./scripts/predict_target_tagger.sh <model_type>
./scripts/predict_target_parser.sh <model_type>
./scripts/train_multi_target_tagger.sh <model_type>
./scripts/train_multi_ target_parser.sh <model_type>
./scripts/predict_multi_target_tagger.sh <model_type>
./scripts/predict_multi_target_parser.sh <model_type>