-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Jesse Myrberg
committed
Jul 27, 2017
1 parent
393358e
commit 2b2170a
Showing
7 changed files
with
153 additions
and
35 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,36 +1,135 @@ | ||
# Finnish Neural Lemmatizer (finnlem) | ||
|
||
## Training steps | ||
### 1. Fit a dictionary | ||
python -m dict_train ^ | ||
--dict-save-path ./data/dictionaries/lemmatizer.dict ^ | ||
--dict-train-path ./data/dictionaries/lemmatizer.vocab ^ | ||
--vocab-size 50 ^ | ||
--min-freq 0.0 ^ | ||
--max-freq 1.0 ^ | ||
--file-batch-size 8192 ^ | ||
--prune-every-n 200 | ||
|
||
### 2. Create and train a new model | ||
# finnlem | ||
|
||
**finnlem** is a [neural network](https://en.wikipedia.org/wiki/Artificial_neural_network) based [lemmatizer](https://en.wikipedia.org/wiki/Lemmatisation) model for [Finnish language](https://en.wikipedia.org/wiki/Finnish_language). | ||
|
||
A trained neural network can map given Finnish words into their base form: | ||
``` | ||
Original Base Form | ||
'koira' --> 'koira' | ||
'koiran' --> 'koira' | ||
'koiraa' --> 'koira' | ||
'koiraksi' --> 'koira' | ||
'koirasta' --> 'koira' | ||
``` | ||
The model is a [tensorflow](https://www.tensorflow.org) implementation of a [sequence-to-sequence](https://arxiv.org/abs/1406.1078) recurrent neural network model. | ||
This repository contains the code and data needed for training and making predictions with the model. The [datasets](src/data/datasets) contain over 2M samples in total. | ||
|
||
## Features | ||
 | ||
 | ||
* Easy-to-use Python wrappers for sequence-to-sequence modeling | ||
* Automatical session handling, model checkpointing and logging | ||
* Support for tensorboard | ||
* Sequence-to-sequence model features: [Bahdanau](https://arxiv.org/abs/1409.0473) and [Luong](https://arxiv.org/abs/1508.04025) attention, residual connections, dropout, beamsearch decoding, ... | ||
|
||
## Installation | ||
You should have the latest versions for (as of 7/2017): | ||
* keras | ||
* nltk | ||
* numpy | ||
* pandas | ||
* tensorflow (1.3.0 or greater, with CUDA 8.0 and cuDNN 6.0 or greater) | ||
* unidecode | ||
|
||
After this, clone this repository to your local machine. | ||
|
||
## Example usage | ||
|
||
### Python | ||
|
||
The following is a simple example of using some of the features in the Python API. | ||
See more detailed descriptions of functions and parameters available from the source code documentation. | ||
|
||
#### 1. Fit a dictionary with default parameters | ||
```python | ||
from dictionary import Dictionary | ||
|
||
# Documents to fit in dictionary | ||
docs = ['abcdefghijklmnopqrstuvwxyz','åäö','DNP','#-'] | ||
|
||
# Create a new Dictionary object | ||
d = Dictionary() | ||
|
||
# Fit characters of each document | ||
d.fit(docs) | ||
|
||
# Save for later usage | ||
d.save('./data/dictionaries/lemmatizer.dict') | ||
``` | ||
|
||
#### 2. Create and train a Seq2Seq model with default parameters | ||
```python | ||
from model_wrappers import Seq2Seq | ||
|
||
# Create a new model | ||
model = Seq2Seq(model_dir='./data/models/lemmatizer, | ||
dict_path='./data/dictionaries/lemmatizer.dict')) | ||
|
||
# Create some documents to train on | ||
source_docs = ['koira','koiran','koiraa','koirana','koiraksi','koirassa']*128 | ||
target_docs = ['koira','koira','koira','koira','koira','koira','koira']*128 | ||
|
||
# Train 100 batches, save checkpoint every 25th batch | ||
for i in range(100): | ||
loss,global_step = model.train(source_docs, target_docs, save_every_n_batch=25) | ||
print('Global step %d loss: %f' % (global_step,loss)) | ||
``` | ||
#### 3. Make predictions on test set | ||
```python | ||
test_docs = ['koiraa','koirana','koiraksi'] | ||
pred_docs = model.decode(test_docs) | ||
print(pred_docs) # --> [['koira'],['koira'],['koira']] | ||
``` | ||
|
||
|
||
### Command line | ||
|
||
The following is a bit more complicated example of using the command line to train and predict from files. | ||
|
||
#### 1. Fit a dictionary with default parameters | ||
``` | ||
python -m dict_train | ||
--dict-save-path ./data/dictionaries/lemmatizer.dict | ||
--dict-train-path ./data/dictionaries/lemmatizer.vocab | ||
``` | ||
The dictionary train path file(s) should contain one document per line ([example](src/data/dictionaries/lemmatizer.vocab)). | ||
|
||
#### 2. Create and train a Seq2Seq model with default parameters | ||
``` | ||
python -m model_train | ||
--model-dir ./data/models/lemmatizer | ||
--dict-path ./data/dictionaries/lemmatizer.dict | ||
--train-data-path ./data/datasets/lemmatizer_train.csv | ||
--validation-data-path ./data/datasets/lemmatizer_validation.csv | ||
--validate-n-rows 5000 | ||
python -m model_train ^ | ||
--model-dir ./data/models/lemmatizer2 ^ | ||
--model-dir ./data/models/lemmatizer ^ | ||
--dict-path ./data/dictionaries/lemmatizer.dict ^ | ||
--train-data-path ./data/datasets/lemmatizer_train.csv ^ | ||
--optimizer 'adam' ^ | ||
--learning-rate 0.0001 ^ | ||
--dropout-rate 0.2 ^ | ||
--batch-size 128 ^ | ||
--file-batch-size 8192 ^ | ||
--max-file-pool-size 50 ^ | ||
--shuffle-files True ^ | ||
--shuffle-file-batches True ^ | ||
--save-every-n-batch 500 ^ | ||
--validate-every-n-batch 100 ^ | ||
--validation-data-path ./data/datasets/lemmatizer_validation.csv ^ | ||
--validate-n-rows 5000 | ||
``` | ||
The model train and validation data path file(s) should contain one source and target document per line, | ||
separated by a comma ([example](src/data/datasets/lemmatizer_validation.csv)). | ||
|
||
### 3. Make predictions on test set | ||
#### 3. Make predictions on test set | ||
``` | ||
python -m model_decode ^ | ||
--model-dir ./data/models/lemmatizer ^ | ||
--source-data-path ./data/datasets/lemmatizer_test.csv ^ | ||
--decoded-data-path ./data/decoded/lemmatizer_decoded_1.csv | ||
--test-data-path ./data/datasets/lemmatizer_test.csv ^ | ||
--decoded-data-path ./data/decoded/lemmatizer_decoded.csv | ||
``` | ||
The model source data path file(s) should contain either: | ||
* one source document per line, or | ||
* one source and target document per line, separated by a comma ([example](src/data/datasets/lemmatizer_test.csv)) | ||
|
||
|
||
## Acknowledgements and references | ||
* [JayParks/tf-seq2seq](https://github.com/JayParks/tf-seq2seq): Example sequence-to-sequence implementation in tensorflow | ||
* [Omorfi](https://github.com/flammie/omorfi): Finnish open source morphology tool | ||
* [FinnTreeBank](http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/): Source for datasets | ||
* [Finnish Dependency Parser](http://bionlp.utu.fi/finnish-parser.html): Source for datasets | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# List of available command line parameters | ||
|
||
## dict_train | ||
|
||
## model_train | ||
|
||
## model_decode |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# List of relevant Python API objects, methods and parameters | ||
|
||
## Dictionary | ||
|
||
## Seq2Seq model |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters