multilingual-parsing

Repository for Cross-lingual Parsing with Polyglot Training and Multi-treebank Learning: A Faroese Case Study, submitted to the DeepLo Workshop at EMNLP 2019.

The code will be tidied up over the next couple of weeks.

Installation

This project is developed in Python 3.6 using a Conda environment.

Download and install Conda.

Create a Conda environment with Python 3.6:

conda create -n multilingual_parsing python=3.6

Activate the Conda environment:

source activate multilingual_parsing

This project uses some new AllenNLP features which are not available in the official 0.8.4 release. As such, we will build the 0.8.5-unreleased version from the master branch on GitHub. If there are any problems try updating pip, setuptools and wheel as mentioned here.

cd
git clone https://github.com/allenai/allennlp.git
cd allennlp
pip install --editable .

Make the library available in $PYTHONPATH. From the multilingual_parsing directory:

cd /path/to/multilingual_parsing

export PYTHONPATH="$PWD/library"

# or permanently:
vim ~/.bashrc
export PYTHONPATH=/path/to/multilingual_parsing/library
source ~/.bashrc

Obtain data

You will need to obtain the original Faroese data for these experiments.

Clone the original repository to somewhere in your file system, e.g. in your home directory:
```
cd $HOME && git clone https://github.com/ftyers/cross-lingual-parsing.git
```
Change directory to your clone of this repo and create a symbolic link to the original data:
```
cd path/to/multilingual-parsing
ln -s /home/user/cross-lingual-parsing/data/ .
```

This should create a directory structure multilingual-parsing/data/.

Download UD v2.2 treebanks.
```
./scripts/get_ud_treebank.sh
```

Create silver training/ development sets

We follow the same process to develop datasets with automatically predicted pos-labels as the CoNLL 2018 shared task. That is, we perform jack-knifing on the training set to predict POS tags. POS tags on the development set are predicted with a model trained on the gold-standard training set.

./scripts/create_k_folds.sh

./scripts/train_with_cross_val.sh

./scripts/predict_with_cross_val.sh

# train a model on full training data and predict the dev set:

./scripts/train_source_tagger.sh monolingual

./scripts/predict_with_cross_val.sh dev

Train source models

Train a source model on source treebanks. The model_type argument supplied can be either monolingual or multilingual and determines whether to use a monolingual or multilingual model accordingly. You will already have trained source taggers from the previous step.
```
./scripts/train_source_parser.sh <model_type>
```

Predict translated source files

Use a source model to predict annotations for files translated into source languages. The model_type argument supplied can be either monolingual or multilingual and determines whether to use a monolingual or multilingual model accordingly.
```
# first supply tags
./scripts/predict_source_tagger.sh monolingual user

# parse the translations
./scripts/predict_source_parser.sh <model_type> user
```

Projection steps

Project from source languges to the target language.
```
./scripts/project_all.sh <model_type>
```

Take only the valid sentences.

./scripts/validate_all.sh <model_type> single

Combine sentences where we have 3/4 valid projected sentences.
```
python utils/treebanks_union.py <model_type>
```
Perform MST voting over the matching, calidated sentences.
```
./scripts/merge_all.sh <model_type>
```

Validate the voted sentences.

./scripts/validate_all.sh <model_type> combined

Check for double-headed sentences.

./scripts/check_double_headed_all.sh <model_type>

Train target models

Train a tagging and parsing models on the synthetic target treebank. model_type can be either monolingual or multilingual. We train a tagger here so that we can produce silver tags for the final test set.

./scripts/train_target_tagger.sh <model_type>

./scripts/train_target_parser.sh <model_type>

Predict target models

Predict the Faroese test set with the various target taggers and parsers.

./scripts/predict_target_tagger.sh <model_type>

./scripts/predict_target_parser.sh <model_type>

Train/ predict using multi-treebank target models.

./scripts/train_multi_target_tagger.sh <model_type>

./scripts/train_multi_ target_parser.sh <model_type>

./scripts/predict_multi_target_tagger.sh <model_type>

./scripts/predict_multi_target_parser.sh <model_type>

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
configs		configs
library		library
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

multilingual-parsing

Table of Contents

Installation

Obtain data

Create silver training/ development sets

Train source models

Predict translated source files

Projection steps

Train target models

Predict target models

Train/ predict using multi-treebank target models.

About

Releases

Packages

Contributors 2

Languages

License

jbrry/multilingual-parsing

Folders and files

Latest commit

History

Repository files navigation

multilingual-parsing

Table of Contents

Installation

Obtain data

Create silver training/ development sets

Train source models

Predict translated source files

Projection steps

Train target models

Predict target models

Train/ predict using multi-treebank target models.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages