Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nächste Runde von Anpassungen #2

Draft
wants to merge 32 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
a34629f
how to setup a venv
ulf1 Nov 14, 2020
ec746d5
relax version requirements
ulf1 Nov 14, 2020
68e8085
how to download data
ulf1 Nov 14, 2020
2babd51
dvc requires specific version range for networkx
ulf1 Nov 14, 2020
ec1bfaf
Unit tests with Github Actions
ulf1 Nov 14, 2020
a2d5905
install with extra_require
ulf1 Nov 14, 2020
94f23ea
split up stages
ulf1 Nov 14, 2020
d11455d
DVC configuration instructions
ulf1 Nov 16, 2020
5a2b28a
store model in a subfolder v1
ulf1 Nov 16, 2020
e2974e5
downgrade to 3.6
ulf1 Nov 17, 2020
06060f3
install python pkgs via requirements.txt to make use of the --use-fea…
ulf1 Nov 17, 2020
8fb449c
set odo.dwds.de as new SSH endpoint
ulf1 Nov 17, 2020
499bcfc
hash updated
ulf1 Nov 17, 2020
d30f624
comments about the packages' purpose
ulf1 Nov 17, 2020
9934160
Hinweis
ulf1 Nov 17, 2020
3597e19
DVC remote changed
ulf1 Nov 17, 2020
601b83e
version downgrade
ulf1 Nov 17, 2020
3935713
shebang was missing
ulf1 Nov 28, 2020
9ca61d8
path to package corrected
ulf1 Nov 28, 2020
a44049f
main.py as pkg script
ulf1 Nov 28, 2020
b06d98b
consult the requirements file for dependencies
ulf1 Nov 29, 2020
23909ee
readme updated
ulf1 Nov 29, 2020
111b24b
systests scripts refactored for a wider range of shells
ulf1 Nov 29, 2020
c2170f6
move global vars to systests folder
ulf1 Nov 29, 2020
8197352
python coding examples
ulf1 Nov 29, 2020
9b9c1b6
distributed examples evenly
ulf1 Nov 29, 2020
38de6cc
add missing deps to setup.py
ulf1 Nov 29, 2020
efc395e
avoid ray>=1
ulf1 Nov 29, 2020
8dcf03b
use ray 0.8
ulf1 Nov 29, 2020
0cc3210
set zip_safe=True
ulf1 Nov 29, 2020
b5b65ad
try ray==1.0.0
ulf1 Nov 29, 2020
fa0d008
install reqs files with resolver
ulf1 Nov 29, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .dvc/config
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
['remote "imsnpars-data"']
url = ssh://odo.dwds.de/home/imsnpars/v1
27 changes: 27 additions & 0 deletions .github/workflows/unittests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: Python application

on: [push]

jobs:
build:

runs-on: ubuntu-18.04

steps:
- uses: actions/checkout@v1
- name: Set up Python 3.6
uses: actions/setup-python@v1
with:
python-version: 3.6
- name: Install dependencies
run: |
python setup.py develop -q
- name: Download training data and the serialized model
run: |
dvc pull -r imsnpars-data
- name: Lint with flake8
run: |
flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')
- name: Unit Test with pytest
run: |
pytest
214 changes: 185 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
# IMSnPars
IMS Neural Dependency Parser is a re-implementation of the transition- and graph-based parsers described in [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations](https://aclweb.org/anthology/Q16-1023)

IMS Neural Dependency Parser is a re-implementation of the transition- and graph-based parsers described in [Simple and Accurate Dependency Parsing
Using Bidirectional LSTM Feature Representations](https://aclweb.org/anthology/Q16-1023)
If you are using this software, please cite the paper [DOI:10.18653/v1/P19-1012](http://doi.org/10.18653/v1/P19-1012) by Agnieszka Faleńska ([github](https://github.com/AgnieszkaFalenska/), [www](https://www.ims.uni-stuttgart.de/en/institute/team/Falenska/)) and Jonas Kuhn ([www](https://www.ims.uni-stuttgart.de/en/institute/team/Kuhn-00013/)).


## Releases

### acl2019 branch
The parser was developed for the paper [The (Non-)Utility of Structural Features in BiLSTM-based
Dependency Parsers](https://www.aclweb.org/anthology/P19-1012) (see [acl2019 branch](https://github.com/AgnieszkaFalenska/IMSnPars/tree/acl2019) for all the paper specific changes and analysis tools):

Expand All @@ -21,56 +25,208 @@ Dependency Parsers](https://www.aclweb.org/anthology/P19-1012) (see [acl2019 bra
}
```

## Required software

> Python 3.7
## Usage

> [Dynet 2.0](http://dynet.io/), see [its installation instructions](https://github.com/clab/dynet/#installation) for compile-time requirements
### Install virtual env

> [NetworkX package](https://networkx.github.io/)
```sh
python3.6 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --use-feature=2020-resolver
pip install -r requirements-dev.txt --use-feature=2020-resolver
python setup.py develop -q
```

## Usage
### Download training data and serialized model
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DVC benötigt SSH account.

  • Lösung 1: public www ordner auf odo
  • Lösung 2: persistent ID (Datenpublikation)

Please contact the System Administrator for an user account.

### Transition-based parser
```sh
# initialize DVC in git repo
# dvc init

Training a new model:
```
python3 imsnpars/main.py --parser TRANS --train [train_file] --save [model_file]
```
# set the DVC endpoint
dvc remote add imsnpars-data ssh://odo.dwds.de/home/imsnpars/v1

Loading and predicting with the trained model:
# configure your creds (You SSH username/password on odo.dwds.de)
dvc remote modify --local imsnpars-data port 22
dvc remote modify --local imsnpars-data user YOURNAME
dvc remote modify --local imsnpars-data password TOPSECRETPW

# dowload data
dvc pull -r imsnpars-data
```
python3 imsnpars/main.py --parser TRANS --model [model_file] --test [test_file] --output [output_file]

### Training a new Model
There are two types of dependency parsers available
that need to be specified with the `--parser` flag:

- **Transition-based parser** (`TRANS`)
- **Graph-based parser** (`GRAPH`)

The training set must be `.conllu` file,
and its path is specified with the `--train` flag.
Further a file path must be specified with the `--save` flag
where to store the trained and serialized model.

```sh
imsnparser.py \
--parser [TRANS or GRAPH] \
--train [train_file] \
--save [model_file]
```

The parser supports many other options. All of them can be seen after running:
Example, given you [downloaded the HDT treebank dataset](#download-training-data-and-serialized-model) mentionend above,
we train a new model with the transition-based parser (TRANS):

```sh
imsnparser.py \
--parser TRANS \
--train=data/hdt/train.conllu \
--save=data/model/my-new-model-v1.2.3
```
python3 imsnpars/main.py --parser TRANS --help

### Inference with a Pre-trained Model
For evaluation as well as production purposes,
we can load a pre-trained model as explained in the [previous chapter](#training-a-new-model).
Both input data (`--test`) and output data (`--output`) are `.conllu` files.

```sh
imsnparser.py \
--parser [TRANS or GRAPH] \
--model [model_file] \
--test [test_file] \
--output [output_file]
```

### Graph-based parser
Analogous to the previous example,
we can run inference on the HDT test set with our pre-trained model:

Training a new model:
```
python3 imsnpars/main.py --parser GRAPH --train [train_file] --save [model_file]
```sh
mkdir -p data/output
imsnparser.py \
--parser TRANS \
--model=data/model/my-new-model-v1.2.3 \
--test=data/hdt/test.conllu \
--output=data/output/predicted.conllu
```

Loading and predicting with the trained model:
### Other settings
The parser supports many other options. All of them can be seen after running:
```sh
imsnparser.py --parser TRANS --help
imsnparser.py --parser GRAPH --help
```
python3 imsnpars/main.py --parser GRAPH --model [model_file] --test [test_file] --output [output_file]

### Python Usage
*IMSnPars* requires the CoNLL-U fields `'form', 'lemma', 'xpos'` (TIGER tagset) as input sequence.

The following examples requires the HDT treebank, see [download](#download-training-data-and-serialized-model) section.

```py
import imsnpars.configure
import conllu

# load the parser
parser = imsnpars.configure.create_parser("data/model")

# open the CoNLL-U file
fp = open("data/hdt/test.conllu", "r")

# and create a generator
sentences = conllu.parse_incr(fp, fields=conllu.parser.DEFAULT_FIELDS)

# loop over the generator from here
sent = next(sentences)
tmp = imsnpars.configure.parse(parser, sent)

for token in tmp:
print(token['head'])

# close the file pointer
fp.close()
```

The parser supports many other options. All of them can be seen after running:
###
In order to avoid too many I/O operations,
it's advised to load the whole `.conllu` file into the RAM.

```py
import imsnpars.configure
import conllu

# read the whole file as List[TokenList] into the RAM
with open("data/hdt/test.conllu", "r") as fp:
sentences = [sent for sent in conllu.parse_incr(
fp, fields=conllu.parser.DEFAULT_FIELDS)]

print(f"#num examples {len(sentences)}")

# parse all examples
parser = imsnpars.configure.create_parser("data/model")
parsed = [imsnpars.configure.parse(parser, sent) for sent in sentences]

# check
for token in parsed[123]:
print(token['head'])
```
python3 imsnpars/main.py --parser GRAPH --help


### Distribute with Ray.io
Due to the large size of pre-trained neural network models,
it's reasonable to distribute large batches across nodes.
The trade-off is batch size vs. the overhead to load the model.

```py
import ray
import psutil
import imsnpars.configure
import conllu
from typing import List
from conllu.models import TokenList
import math

# start ray
num_cpus = max(1, int(psutil.cpu_count() * 0.8))
ray.init(num_cpus=num_cpus)

@ray.remote
def imsnparser_ray(sentences: List[TokenList],
model_path: str="data/model") -> List[TokenList]:
# load parser
parser = imsnpars.configure.create_parser(model_path)
# parse all sentences
parsed = [imsnpars.configure.parse(parser, sent) for sent in sentences]
# done
return parsed

# read the whole file as List[TokenList] into the RAM
with open("data/hdt/test.conllu", "r") as fp:
sentences = [sent for sent in conllu.parse_incr(
fp, fields=conllu.parser.DEFAULT_FIELDS)]

# distributed batches
batch_size = math.ceil(len(sentences) / num_cpus)

# start computation
future_batches = [
imsnparser_ray.remote(
sentences[(i * batch_size):((i + 1) * batch_size)],
model_path="data/model")
for i in range(num_cpus)]

# wait for the results
parsed_batches = ray.get(future_batches)
```

### Tests

*IMSnPars* comes with four testing scripts to check if everything works fine:
1. systests/test_trans_parser.sh -- trains a new transition-based parser on small fake data and uses this model for prediction
2. systests/test_graph_parser.sh -- trains a new graph-based parser on small fake data and uses this model for prediction
3. systests/test_all_trans_parsers.sh -- trains multiple transition-based models with different sets of options
4. systests/test_all_graph_parsers.sh -- trains multiple graph-based models with different sets of options
1. `systests/test_trans_parser.sh` -- trains a new transition-based parser on small fake data and uses this model for prediction
2. `systests/test_graph_parser.sh` -- trains a new graph-based parser on small fake data and uses this model for prediction
3. `systests/test_all_trans_parsers.sh` -- trains multiple transition-based models with different sets of options
4. `systests/test_all_graph_parsers.sh` -- trains multiple graph-based models with different sets of options

Please make sure that the software is installed as python package, e.g. run `python setup.py develop -q`.

We recommend running the two first scripts before using *IMSnPars* for other purposes (both tests take less than a minute). Both of the scripts should end with an information that everything went fine. Transition-based parser achieves LAS=64.61 on the fake data and the graph-based one LAS=66.47.
2 changes: 1 addition & 1 deletion data/model/model.args.dvc
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
outs:
- md5: 85b2857dacf64d93d72d8e7d3f91f24e
- md5: a31aa3da85c9f2c63d1bc522ef2307d6
path: model.args
2 changes: 1 addition & 1 deletion data/model/model.params.dvc
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
outs:
- md5: 863c12671bbb309dc89310464407b688
- md5: 30766ff2c5620beb6f2d0c853f04d36f
path: model.params
6 changes: 3 additions & 3 deletions imsnpars/configure.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import argparse

from pathlib import Path
import imsnpars.nparser.options as parser_options
import imsnpars.tools.utils as parser_utils
from imsnpars.nparser import builder as parser_builder
Expand All @@ -8,7 +8,7 @@


def create_parser(model_path):
model = (model_path / 'model').as_posix()
model = Path(f"{model_path}/model").as_posix()
argParser = argparse.ArgumentParser(add_help=False)
argParser.add_argument(
"--parser", choices=["GRAPH", "TRANS"], required=True
Expand All @@ -23,7 +23,7 @@ def create_parser(model_path):

opts = parser_utils.NParserOptions()
parser_options.fillParserOptions(args, opts)
opts.load((model_path / 'model.args').as_posix())
opts.load(Path(f"{model_path}/model.args").as_posix())

parser = parser_builder.buildParser(opts)
parser.load(args.model)
Expand Down
3 changes: 3 additions & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
pytest
flake8
autoflake
18 changes: 18 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# neural network model
dynet>=2.0.0
networkx<2.5,>=2.1

# preprocessing
conllu>=3.1.1

# data management
dvc>=1.6.6
paramiko>=2.7.2

# computing
ray>=1.0.0
psutil>=5.7.2

# python tools
boltons>=20.2.1
Click>=7.1.2
8 changes: 0 additions & 8 deletions scripts/get_global_vars.sh

This file was deleted.

17 changes: 8 additions & 9 deletions imsnpars/main.py → scripts/imsnparser.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,16 @@
'''
Created on 18.08.2017

@author: falensaa
'''

#!/usr/bin/env python3
"""
Created on 18.08.2017 @author: falensaa
2020-11-27 imsnpars.main.py renamed as scripts/imsnparser.py (by @ulf1)
"""
import logging
import sys
import random
import argparse

from .tools import utils, evaluator, training
from .nparser import builder
from .nparser import options
from imsnpars.tools import utils, evaluator, training
from imsnpars.nparser import builder
from imsnpars.nparser import options

def buildParserFromArgs():
argParser = argparse.ArgumentParser(description="""IMS Neural Parser""", add_help=False)
Expand Down
Loading