Added datasets and trained model. Updated documentation.

jmyrberg · Jul 28, 2017 · 8445536 · 8445536
1 parent f85d55a
commit 8445536
Show file tree

Hide file tree

Showing 38 changed files with 2,970,675 additions and 37 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,8 +1,3 @@
-# Own
-other/
-data/
-src/training/
-
 # Eclipse IDE
 .settings/
 .project

diff --git a/README.md b/README.md
@@ -4,12 +4,12 @@
 
 A trained neural network can map given Finnish words into their base form with quite reasonable accuracy. These are examples of the model output:
 ```
-[ORIGINAL]					   [BASE FORM]
-Kiinalaisessa				--> kiinalainen
-osinkotulojen				-->	osinko#tulo	
-Rajoittavalla				-->	rajoittaa
-multimediaopetusmateriaalia	-->	multi#media#opetus#materiaali
-ei-rasistisella				--> ei-rasistinen
+[ORIGINAL] --> [BASE FORM]
+Kiinalaisessa --> kiinalainen
+osinkotulojen --> osinko#tulo	
+Rajoittavalla --> rajoittaa
+multimediaopetusmateriaalia -->	multi#media#opetus#materiaali
+ei-rasistisella	--> ei-rasistinen
 ```
 The model is a [tensorflow](https://www.tensorflow.org) implementation of a [sequence-to-sequence](https://arxiv.org/abs/1406.1078) (Seq2Seq) recurrent neural network model. 
 This repository contains the code and data needed for training and making predictions with the model. The [datasets](src/data/datasets) contain over 2M samples in total.
@@ -34,14 +34,14 @@ After this, clone this repository to your local machine.
 
 ## Example usage
 
-Three-steps are required in order to make predictions with a trained model:
+Three-steps are required in order to get from zero to making predictions with a trained model:
 
 1. **Dictionary training**: Dictionary is created from training documents, which are processed the same way as the Seq2Seq model inputs later on.
 	Dictionary handles vocabulary/integer mappings required by Seq2Seq.
 2. **Model training**: Seq2Seq model is trained in batches with training documents that contain source and target.
 3. **Model decoding**: Unseen source documents are fed into Seq2Seq model, which makes predictions on the target.
 
-### Python ([See list of available methods here](src/python_api.md))
+### Python ([See list of relevant Python API classes](doc/python_api.md))
 
 The following is a simple example of using some of the features in the Python API.
 See more detailed descriptions of functions and parameters available from the source code documentation.
@@ -69,11 +69,11 @@ from model_wrappers import Seq2Seq
 
 # Create a new model
 model = Seq2Seq(model_dir='./data/models/lemmatizer,
-				dict_path='./data/dictionaries/lemmatizer.dict'))
+				dict_path='./data/dictionaries/lemmatizer.dict')
 
 # Create some documents to train on
 source_docs = ['koira','koiran','koiraa','koirana','koiraksi','koirassa']*128
-target_docs = ['koira','koira','koira','koira','koira','koira','koira']*128
+target_docs = ['koira','koira','koira','koira','koira','koira']*128
 
 # Train 100 batches, save checkpoint every 25th batch
 for i in range(100):
@@ -89,9 +89,9 @@ print(pred_docs) # --> [['koira'],['koira'],['koira']]
 ```
 
 
-### Command line ([See list of available commands here](src/commands.md))
+### Command line ([See list of available commands here](doc/commands.md))
 
-The following is a bit more complicated example of using the command line to train and predict from files.
+The following demonstrates the usage of command line for training and predicting from files.
 
 #### 1. Dictionary training - fit a dictionary with default parameters
 ```
@@ -125,15 +125,16 @@ The model test data path file(s) should contain either:
 ## Extensions
 * To use tensorboard, run command ```python -m tensorflow.tensorboard --logdir=model_dir```, 
 where ```model_dir``` is the Seq2Seq model checkpoint folder.
-* The model was originally created for summarizing Finnish news, by using news contents as the sources, and news titles as the targets.
+* The model was originally created for summarizing the Finnish news, by using news contents as the sources, and news titles as the targets.
 This proved to be quite a difficult task due to rich morphology of Finnish language, and lack of computational resources. My first
-approach to tackle this was to use the base forms for each word, which is what this package can do. In the end, using this model to convert
-every word to their base form would've taken too long.
+approach for tackling the morphology was to use the base forms for each word, which is what the model in this package does by default. However, 
+using this model to convert every word to their base form ended up being too slow to be used as an input for the second model in real time.
 
-	In the end, I decided to use the [Finnish SnowballStemmer from nltk](http://www.nltk.org/_modules/nltk/stem/snowball.html), and train
-	the model with 100k vocabulary. After 36 hours of training with loss decreasing very slowly, I decided to keep this package as the character-level.
+	In the end, I decided to try the [Finnish SnowballStemmer from nltk](http://www.nltk.org/_modules/nltk/stem/snowball.html) in order to get the "base words", 
+	and started training the model with 100k vocabulary. After 36 hours of training with loss decreasing very slowly, I decided to stop, and keep this package as a character-level lemmatizer.
 	However, in [model_wrappers.py](src/model_wrappers.py), there is a global variable *DOC_HANDLER_FUNC*, which enables one to change the preprocessing method easily from
-	characters to words by setting ```DOC_HANDLER_FUNC='WORD'```.
+	characters to words by setting ```DOC_HANDLER_FUNC='WORD'```. Try changing the variable, and/or write your own preprocessing function *doc_to_tokens*, if you'd like to 
+	experiment with the word-level model.
 
 
 ## Acknowledgements and references
@@ -142,4 +143,5 @@ every word to their base form would've taken too long.
 * [FinnTreeBank](http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/): Source for datasets
 * [Finnish Dependency Parser](http://bionlp.utu.fi/finnish-parser.html): Source for datasets
 
-
+---
+Jesse Myrberg ([email protected])
diff --git a/doc/commands.md b/doc/commands.md
@@ -1,4 +1,4 @@
-# List of available command line parameters
+# List of available commands
 
 ## Step 1: Dictionary training
 Required:
@@ -26,9 +26,9 @@ Required when creating the model for the first time:
 ### Model params
 Optional, locked in when creating the model for the first time:
 * ```--cell-type```: Cell type, either 'gru' or 'lstm' (str, default: 'lstm
-* ```--hidden-dim```: Number of neurons in hidden layers (int, default: 256)
+* ```--hidden-dim```: Number of neurons in hidden layers (int, default: 32)
 * ```--attn-dim```: Number of neurons in to use in attention. None means attn-dim = hidden-dim (int, default: None)
-* ```--embedding-dim```: Embedding dimension (int, default: 128)
+* ```--embedding-dim```: Embedding dimension (int, default: 16)
 * ```--depth```: Number of hidden layers in encoder and decoder (int, default: 2)
 * ```--attn-type```: Attention type, either 'bahdanau' or 'luong' (str, default: 'bahdanau
 * ```--attn-input-feeding```: Whether attention is fed to decoder inputs (bool, default: True)

diff --git a/doc/python_api.md b/doc/python_api.md
@@ -34,9 +34,9 @@ Args:
 * ```model_dir```: Model checkpoint and log save path (str)
 * ```dict_path```: Path to existing Dictionary (str, default: None)
 * ```cell_type```: Cell type, either 'gru' or 'lstm' (str, default: 'lstm
-* ```hidden_dim```: Number of neurons in hidden layers (int, default: 256)
+* ```hidden_dim```: Number of neurons in hidden layers (int, default: 32)
 * ```attn_dim```: Number of neurons in to use in attention. None means attn_dim = hidden_dim (int, default: None)
-* ```embedding_dim```: Embedding dimension (int, default: 128)
+* ```embedding_dim```: Embedding dimension (int, default: 16)
 * ```depth```: Number of hidden layers in encoder and decoder (int, default: 2)
 * ```attn_type```: Attention type, either 'bahdanau' or 'luong' (str, default: 'bahdanau
 * ```attn_input_feeding```: Whether attention is fed to decoder inputs (bool, default: True)