zentrum-lexikographie · ulf1 · Nov 14, 2020 · Nov 14, 2020 · Nov 14, 2020 · Nov 14, 2020
diff --git a/.dvc/config b/.dvc/config
@@ -0,0 +1,2 @@
+['remote "imsnpars-data"']
+    url = ssh://odo.dwds.de/home/imsnpars/v1
diff --git a/.github/workflows/unittests.yml b/.github/workflows/unittests.yml
@@ -0,0 +1,27 @@
+name: Python application
+
+on: [push]
+
+jobs:
+  build:
+
+    runs-on: ubuntu-18.04
+
+    steps:
+    - uses: actions/checkout@v1
+    - name: Set up Python 3.6
+      uses: actions/setup-python@v1
+      with:
+        python-version: 3.6
+    - name: Install dependencies
+      run: |
+        python setup.py develop -q
+    - name: Download training data and the serialized model
+      run: |
+        dvc pull -r imsnpars-data
+    - name: Lint with flake8
+      run: |
+        flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')
+    - name: Unit Test with pytest
+      run: |
+        pytest
diff --git a/README.md b/README.md
@@ -1,8 +1,12 @@
 # IMSnPars
+IMS Neural Dependency Parser is a re-implementation of the transition- and graph-based parsers described in [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://aclweb.org/anthology/Q16-1023)
 
-IMS Neural Dependency Parser is a re-implementation of the transition- and graph-based parsers described in [Simple and Accurate Dependency Parsing
-Using Bidirectional LSTM Feature Representations](https://aclweb.org/anthology/Q16-1023)
+If you are using this software, please cite the paper [DOI:10.18653/v1/P19-1012](http://doi.org/10.18653/v1/P19-1012) by Agnieszka Faleńska ([github](https://github.com/AgnieszkaFalenska/), [www](https://www.ims.uni-stuttgart.de/en/institute/team/Falenska/)) and Jonas Kuhn ([www](https://www.ims.uni-stuttgart.de/en/institute/team/Kuhn-00013/)).
 
+
+## Releases
+
+### acl2019 branch
 The parser was developed for the paper [The (Non-)Utility of Structural Features in BiLSTM-based
 Dependency Parsers](https://www.aclweb.org/anthology/P19-1012) (see [acl2019 branch](https://github.com/AgnieszkaFalenska/IMSnPars/tree/acl2019) for all the paper specific changes and analysis tools):
 
@@ -21,56 +25,208 @@ Dependency Parsers](https://www.aclweb.org/anthology/P19-1012) (see [acl2019 bra
 }
 ```
 
-## Required software
 
-> Python 3.7
+## Usage
 
-> [Dynet 2.0](http://dynet.io/), see [its installation instructions](https://github.com/clab/dynet/#installation) for compile-time requirements
+### Install virtual env
 
-> [NetworkX package](https://networkx.github.io/)
+```sh
+python3.6 -m venv .venv
+source .venv/bin/activate
+pip install --upgrade pip
+pip install -r requirements.txt --use-feature=2020-resolver
+pip install -r requirements-dev.txt --use-feature=2020-resolver
+python setup.py develop -q
+```
 
-## Usage
+### Download training data and serialized model
+Please contact the System Administrator for an user account.
 
-### Transition-based parser
+```sh
+# initialize DVC in git repo
+# dvc init
 
-Training a new model:
-```
-python3 imsnpars/main.py --parser TRANS --train [train_file] --save [model_file]
-```
+# set the DVC endpoint
+dvc remote add imsnpars-data ssh://odo.dwds.de/home/imsnpars/v1
 
-Loading and predicting with the trained model:
+# configure your creds (You SSH username/password on odo.dwds.de)
+dvc remote modify --local imsnpars-data port 22
+dvc remote modify --local imsnpars-data user YOURNAME
+dvc remote modify --local imsnpars-data password TOPSECRETPW
+
+# dowload data
+dvc pull -r imsnpars-data
 ```
-python3 imsnpars/main.py --parser TRANS --model [model_file] --test  [test_file] --output [output_file]
+
+### Training a new Model
+There are two types of dependency parsers available
+that need to be specified with the `--parser` flag:
+
+- **Transition-based parser** (`TRANS`)
+- **Graph-based parser** (`GRAPH`)
+
+The training set must be `.conllu` file, 
+and its path is specified with the `--train` flag.
+Further a file path must be specified with the `--save` flag
+where to store the trained and serialized model.
+
+```sh
+imsnparser.py \
+    --parser [TRANS or GRAPH] \
+    --train [train_file] \
+    --save [model_file]
 ```
 
-The parser supports many other options. All of them can be seen after running:
+Example, given you [downloaded the HDT treebank dataset](#download-training-data-and-serialized-model) mentionend above,
+we train a new model with the transition-based parser (TRANS):
+
+```sh
+imsnparser.py \
+    --parser TRANS \
+    --train=data/hdt/train.conllu  \
+    --save=data/model/my-new-model-v1.2.3
 ```
-python3 imsnpars/main.py --parser TRANS --help
+
+### Inference with a Pre-trained Model
+For evaluation as well as production purposes, 
+we can load a pre-trained model as explained in the [previous chapter](#training-a-new-model).
+Both input data (`--test`) and output data (`--output`) are `.conllu` files.
+
+```sh
+imsnparser.py \
+    --parser [TRANS or GRAPH] \
+    --model [model_file] \
+    --test [test_file] \
+    --output [output_file]
 ```
 
-### Graph-based parser
+Analogous to the previous example,
+we can run inference on the HDT test set with our pre-trained model:
 
-Training a new model:
-```
-python3 imsnpars/main.py --parser GRAPH --train [train_file] --save [model_file]
+```sh
+mkdir -p data/output
+imsnparser.py \
+    --parser TRANS \
+    --model=data/model/my-new-model-v1.2.3 \
+    --test=data/hdt/test.conllu  \
+    --output=data/output/predicted.conllu
 ```
 
-Loading and predicting with the trained model:
+### Other settings
+The parser supports many other options. All of them can be seen after running:
+```sh
+imsnparser.py --parser TRANS --help
+imsnparser.py --parser GRAPH --help
 ```
-python3 imsnpars/main.py --parser GRAPH --model [model_file] --test  [test_file] --output [output_file]
+
+### Python Usage
+*IMSnPars* requires the CoNLL-U fields `'form', 'lemma', 'xpos'` (TIGER tagset) as input sequence.
+
+The following examples requires the HDT treebank, see [download](#download-training-data-and-serialized-model) section.
+
+```py
+import imsnpars.configure
+import conllu
+
+# load the parser
+parser = imsnpars.configure.create_parser("data/model")
+
+# open the CoNLL-U file
+fp = open("data/hdt/test.conllu", "r")
+
+# and create a generator
+sentences = conllu.parse_incr(fp, fields=conllu.parser.DEFAULT_FIELDS)
+
+# loop over the generator from here
+sent = next(sentences)
+tmp = imsnpars.configure.parse(parser, sent)
+
+for token in tmp:
+    print(token['head'])
+
+# close the file pointer
+fp.close()
 ```
 
-The parser supports many other options. All of them can be seen after running:
+### 
+In order to avoid too many I/O operations, 
+it's advised to load the whole `.conllu` file into the RAM.
+
+```py
+import imsnpars.configure
+import conllu
+
+# read the whole file as List[TokenList] into the RAM
+with open("data/hdt/test.conllu", "r") as fp:
+    sentences = [sent for sent in conllu.parse_incr(
+        fp, fields=conllu.parser.DEFAULT_FIELDS)]
+
+print(f"#num examples {len(sentences)}")
+
+# parse all examples
+parser = imsnpars.configure.create_parser("data/model")
+parsed = [imsnpars.configure.parse(parser, sent) for sent in sentences]
+
+# check
+for token in parsed[123]:
+    print(token['head'])
 ```
-python3 imsnpars/main.py --parser GRAPH --help
+
+
+### Distribute with Ray.io
+Due to the large size of pre-trained neural network models,
+it's reasonable to distribute large batches across nodes.
+The trade-off is batch size vs. the overhead to load the model.
+
+```py
+import ray
+import psutil
+import imsnpars.configure
+import conllu
+from typing import List
+from conllu.models import TokenList
+import math
+
+# start ray
+num_cpus = max(1, int(psutil.cpu_count() * 0.8))
+ray.init(num_cpus=num_cpus)
+
+@ray.remote
+def imsnparser_ray(sentences: List[TokenList],
+                   model_path: str="data/model") -> List[TokenList]:
+    # load parser
+    parser = imsnpars.configure.create_parser(model_path)
+    # parse all sentences
+    parsed = [imsnpars.configure.parse(parser, sent) for sent in sentences]
+    # done
+    return parsed
+
+# read the whole file as List[TokenList] into the RAM
+with open("data/hdt/test.conllu", "r") as fp:
+    sentences = [sent for sent in conllu.parse_incr(
+        fp, fields=conllu.parser.DEFAULT_FIELDS)]
+
+# distributed batches
+batch_size = math.ceil(len(sentences) / num_cpus)
+
+# start computation
+future_batches = [
+    imsnparser_ray.remote(
+        sentences[(i * batch_size):((i + 1) * batch_size)],
+        model_path="data/model")
+    for i in range(num_cpus)]
+
+# wait for the results
+parsed_batches = ray.get(future_batches)
 ```
 
 ### Tests
-
 *IMSnPars* comes with four testing scripts to check if everything works fine:
-1. systests/test_trans_parser.sh -- trains a new transition-based parser on small fake data and uses this model for prediction
-2. systests/test_graph_parser.sh -- trains a new graph-based parser on small fake data and uses this model for prediction
-3. systests/test_all_trans_parsers.sh -- trains multiple transition-based models with different sets of options
-4. systests/test_all_graph_parsers.sh -- trains multiple graph-based models with different sets of options
+1. `systests/test_trans_parser.sh` -- trains a new transition-based parser on small fake data and uses this model for prediction
+2. `systests/test_graph_parser.sh` -- trains a new graph-based parser on small fake data and uses this model for prediction
+3. `systests/test_all_trans_parsers.sh` -- trains multiple transition-based models with different sets of options
+4. `systests/test_all_graph_parsers.sh` -- trains multiple graph-based models with different sets of options
+
+Please make sure that the software is installed as python package, e.g. run `python setup.py develop -q`.
 
 We recommend running the two first scripts before using *IMSnPars* for other purposes (both tests take less than a minute). Both of the scripts should end with an information that everything went fine. Transition-based parser achieves LAS=64.61 on the fake data and the graph-based one LAS=66.47.
diff --git a/data/model/model.args.dvc b/data/model/model.args.dvc
@@ -1,3 +1,3 @@
 outs:
-- md5: 85b2857dacf64d93d72d8e7d3f91f24e
+- md5: a31aa3da85c9f2c63d1bc522ef2307d6
   path: model.args
diff --git a/data/model/model.params.dvc b/data/model/model.params.dvc
@@ -1,3 +1,3 @@
 outs:
-- md5: 863c12671bbb309dc89310464407b688
+- md5: 30766ff2c5620beb6f2d0c853f04d36f
   path: model.params
diff --git a/imsnpars/configure.py b/imsnpars/configure.py
@@ -1,5 +1,5 @@
 import argparse
-
+from pathlib import Path
 import imsnpars.nparser.options as parser_options
 import imsnpars.tools.utils as parser_utils
 from imsnpars.nparser import builder as parser_builder
@@ -8,7 +8,7 @@
 
 
 def create_parser(model_path):
-    model = (model_path / 'model').as_posix()
+    model = Path(f"{model_path}/model").as_posix()
     argParser = argparse.ArgumentParser(add_help=False)
     argParser.add_argument(
         "--parser", choices=["GRAPH", "TRANS"], required=True
@@ -23,7 +23,7 @@ def create_parser(model_path):
 
     opts = parser_utils.NParserOptions()
     parser_options.fillParserOptions(args, opts)
-    opts.load((model_path / 'model.args').as_posix())
+    opts.load(Path(f"{model_path}/model.args").as_posix())
 
     parser = parser_builder.buildParser(opts)
     parser.load(args.model)

diff --git a/requirements-dev.txt b/requirements-dev.txt
@@ -0,0 +1,3 @@
+pytest
+flake8
+autoflake
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,18 @@
+# neural network model
+dynet>=2.0.0
+networkx<2.5,>=2.1
+
+# preprocessing
+conllu>=3.1.1
+
+# data management
+dvc>=1.6.6
+paramiko>=2.7.2
+
+# computing
+ray>=1.0.0
+psutil>=5.7.2
+
+# python tools
+boltons>=20.2.1
+Click>=7.1.2
diff --git a/scripts/get_global_vars.sh b/scripts/get_global_vars.sh
diff --git a/imsnpars/main.py → scripts/imsnparser.py b/imsnpars/main.py → scripts/imsnparser.py
@@ -1,17 +1,16 @@
-'''
-Created on 18.08.2017
-
-@author: falensaa
-'''
-
+#!/usr/bin/env python3
+"""
+Created on 18.08.2017 @author: falensaa
+2020-11-27 imsnpars.main.py renamed as scripts/imsnparser.py (by @ulf1)
+"""
 import logging
 import sys
 import random
 import argparse
 
-from .tools import utils, evaluator, training
-from .nparser import builder
-from .nparser import options
+from imsnpars.tools import utils, evaluator, training
+from imsnpars.nparser import builder
+from imsnpars.nparser import options
 
 def buildParserFromArgs():
     argParser = argparse.ArgumentParser(description="""IMS Neural Parser""", add_help=False)
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		['remote "imsnpars-data"']
		url = ssh://odo.dwds.de/home/imsnpars/v1