diff --git a/AbstractRecurrent.lua b/AbstractRecurrent.lua
index cb87ff4..d379661 100644
--- a/AbstractRecurrent.lua
+++ b/AbstractRecurrent.lua
@@ -282,7 +282,7 @@ function AbstractRecurrent:getGradHiddenState(step, input)
 end
 
 -- set stored grad hidden state
-function AbstractRecurrent:setGradHiddenState(step, hiddenState)
+function AbstractRecurrent:setGradHiddenState(step, gradHiddenState)
    error"Not Implemented"
 end
 
diff --git a/README.md b/README.md
index 971e584..6a71561 100644
--- a/README.md
+++ b/README.md
@@ -27,14 +27,19 @@ Modules that `forward` entire sequences through a decorated `AbstractRecurrent`
  * [RecurrentAttention](#rnn.RecurrentAttention) : a generalized attention model for [REINFORCE modules](https://github.com/nicholas-leonard/dpnn#nn.Reinforce);
 
 Miscellaneous modules and criterions :
- * [MaskZero](#rnn.MaskZero) : zeroes the `output` and `gradOutput` rows of the decorated module for commensurate `input` rows which are tensors of zeros;
+ * [MaskZero](#rnn.MaskZero) : zeroes the `output` and `gradOutput` rows of the decorated module for commensurate
+   * `input` rows which are tensors of zeros (version 1);
+   * `zeroMask` elements which are 1 (version 2);
  * [LookupTableMaskZero](#rnn.LookupTableMaskZero) : extends `nn.LookupTable` to support zero indexes for padding. Zero indexes are forwarded as tensors of zeros;
- * [MaskZeroCriterion](#rnn.MaskZeroCriterion) : zeros the `gradInput` and `loss` rows of the decorated criterion for commensurate `zeroMask` elements which are 1;
+ * [MaskZeroCriterion](#rnn.MaskZeroCriterion) : zeros the `gradInput` and `loss` rows of the decorated criterion for commensurate
+   * `input` rows which are tensors of zeros (version 1);
+   * `zeroMask` elements which are 1 (version 2);
  * [SeqReverseSequence](#rnn.SeqReverseSequence) : reverses an input sequence on a specific dimension;
  * [VariableLength](#rnn.VariableLength): decorates a `Sequencer` to accept and produce a table of variable length inputs and outputs;
 
 Criterions used for handling sequential inputs and targets :
- * [SequencerCriterion](#rnn.SequencerCriterion) : sequentially applies the same criterion to a sequence of inputs and targets (Tensor or Table).
+ * [AbstractSequencerCriterion](#rnn.AbstractSequencerCriterion) : abstact class for criterions that handle sequences (tensor or table);
+ * [SequencerCriterion](#rnn.SequencerCriterion) : sequentially applies the same criterion to a sequence of inputs and targets;
  * [RepeaterCriterion](#rnn.RepeaterCriterion) : repeatedly applies the same criterion with the same target on a sequence.
 
 
@@ -95,30 +100,7 @@ Additional differentiable criterions
 <a name='rnn.examples'></a>
 ## Examples ##
 
-The following are example training scripts using this package :
-
-  * [RNN/LSTM/GRU](examples/recurrent-language-model.lua) for Penn Tree Bank dataset;
-  * [Noise Contrastive Estimate](examples/noise-contrastive-estimate.lua) for training multi-layer [SeqLSTM](#rnn.SeqLSTM) language models on the [Google Billion Words dataset](https://github.com/Element-Research/dataload#dl.loadGBW). The example uses [MaskZero](#rnn.MaskZero) to train independent variable length sequences using the [NCEModule](https://github.com/Element-Research/dpnn#nn.NCEModule) and [NCECriterion](https://github.com/Element-Research/dpnn#nn.NCECriterion). This script is our fastest yet boasting speeds of 20,000 words/second (on NVIDIA Titan X) with a 2-layer LSTM having 250 hidden units, a batchsize of 128 and sequence length of a 100. Note that you will need to have [Torch installed with Lua instead of LuaJIT](http://torch.ch/docs/getting-started.html#_);
-  * [Recurrent Model for Visual Attention](examples/recurrent-visual-attention.lua) for the MNIST dataset;
-  * [Encoder-Decoder LSTM](examples/encoder-decoder-coupling.lua) shows you how to couple encoder and decoder `LSTMs` for sequence-to-sequence networks;
-  * [Simple Recurrent Network](examples/simple-recurrent-network.lua) shows a simple example for building and training a simple recurrent neural network;
-  * [Simple Sequencer Network](examples/simple-sequencer-network.lua) is a version of the above script that uses the Sequencer to decorate the `rnn` instead;
-  * [Sequence to One](examples/sequence-to-one.lua) demonstrates how to do many to one sequence learning as is the case for sentiment analysis;
-  * [Multivariate Time Series](examples/recurrent-time-series.lua) demonstrates how train a simple RNN to do multi-variate time-series predication.
-
-### External Resources
-
-  * [rnn-benchmarks](https://github.com/glample/rnn-benchmarks) : benchmarks comparing Torch (using this library), Theano and TensorFlow.
-  * [Harvard Jupyter Notebook Tutorial](http://nbviewer.jupyter.org/github/CS287/Lectures/blob/gh-pages/notebooks/ElementRNNTutorial.ipynb) : an in-depth tutorial for how to use the Element-Research rnn package by Harvard University;
-  * [dpnn](https://github.com/Element-Research/dpnn) : this is a dependency of the __rnn__ package. It contains useful nn extensions, modules and criterions;
-  * [dataload](https://github.com/Element-Research/dataload) : a collection of torch dataset loaders;
-  * [RNN/LSTM/BRNN/BLSTM training script ](https://github.com/nicholas-leonard/dp/blob/master/examples/recurrentlanguagemodel.lua) for Penn Tree Bank or Google Billion Words datasets;
-  * A brief (1 hours) overview of Torch7, which includes some details about the __rnn__ packages (at the end), is available via this [NVIDIA GTC Webinar video](http://on-demand.gputechconf.com/gtc/2015/webinar/torch7-applied-deep-learning-for-vision-natural-language.mp4). In any case, this presentation gives a nice overview of Logistic Regression, Multi-Layer Perceptrons, Convolutional Neural Networks and Recurrent Neural Networks using Torch7;
-  * [Sequence to Sequence mapping using encoder-decoder RNNs](https://github.com/rahul-iisc/seq2seq-mapping) : a complete training example using synthetic data.
-  * [ConvLSTM](https://github.com/viorik/ConvLSTM) is a repository for training a [Spatio-temporal video autoencoder with differentiable memory](http://arxiv.org/abs/1511.06309).
-  * An [time series example](https://github.com/rracinskij/rnntest01/blob/master/rnntest01.lua) for univariate timeseries prediction.
-  * [Sagar Waghmare](https://github.com/sagarwaghmare69) wrote a nice [tutorial](tutorials/ladder.md) on how to use rnn with nngraph to reproduce the [Lateral Connections in Denoising Autoencoders Support Supervised Learning](http://arxiv.org/pdf/1504.08215.pdf).
-
+A complete list of examples is available in the [examples directory](examples/README.md)
 
 ## Citation ##
 
@@ -127,7 +109,7 @@ If you use __rnn__ in your work, we'd really appreciate it if you could cite the
 Léonard, Nicholas, Sagar Waghmare, Yang Wang, and Jin-Hwa Kim. [rnn: Recurrent Library for Torch.](http://arxiv.org/abs/1511.07889) arXiv preprint arXiv:1511.07889 (2015).
 
 Any significant contributor to the library will also get added as an author to the paper.
-A [significant contributor](https://github.com/Element-Research/rnn/graphs/contributors)
+A [significant contributor](https://github.com/torch/rnn/graphs/contributors)
 is anyone who added at least 300 lines of code to the library.
 
 ## Troubleshooting ##
@@ -136,8 +118,8 @@ Most issues can be resolved by updating the various dependencies:
 ```bash
 luarocks install torch
 luarocks install nn
-luarocks install dpnn
 luarocks install torchx
+luarocks install dataload
 ```
 
 If you are using CUDA :
@@ -156,24 +138,40 @@ If that doesn't fix it, open and issue on github.
 
 <a name='rnn.AbstractRecurrent'></a>
 ## AbstractRecurrent ##
-An abstract class inherited by [Recurrent](#rnn.Recurrent), [RecLSTM](#rnn.RecLSTM) and [GRU](#rnn.GRU).
+An abstract class inherited by [Recurrence](#rnn.Recurrence), [RecLSTM](#rnn.RecLSTM) and [GRU](#rnn.GRU).
 The constructor takes a single argument :
 ```lua
-rnn = nn.AbstractRecurrent([rho])
-```
-Argument `rho` is the maximum number of steps to backpropagate through time (BPTT).
-Sub-classes can set this to a large number like 99999 (the default) if they want to backpropagate through
-the entire sequence whatever its length. Setting lower values of rho are
-useful when long sequences are forward propagated, but we only whish to
-backpropagate through the last `rho` steps, which means that the remainder
-of the sequence doesn't need to be stored (so no additional cost).
-
-### [recurrentModule] getStepModule(step) ###
+rnn = nn.AbstractRecurrent(stepmodule)
+```
+The `stepmodule` argument is an `nn.Module` instance that [cloned with shared parameters](#nn.Module.sharedClone) at each time-step.
+Sub-classes can call the [getStepModule(step)](#rnn.AbstractRecurrent.getStepModule) to automatically clone the `stepmodule`
+and share it's parameters for each time-`step`.
+Each call to `forward/updateOutput` calls `self:getStepModule(self.step)` and increments the `self.step` attribute.
+That is, each `forward` call to an `AbstractRecurrent` instance memorizes a new `step` by memorizing the previous `stepmodule` clones.
+Although they share parameters and their gradients, each `stepmodule` clone has its own `output` and `gradInput` states.
+
+A good example of a `stepmodule` is the [StepLSTM](#rnn.StepLSTM) used internally by the `RecLSTM`, an `AbstractRecurrent` instance.
+The `StepLSTM` implements a single time-step for an LSTM.
+The `RecLSTM` calls `getStepModule(step)` to clone the `StepLSTM` for each time-step.
+The `RecLSTM` handles the feeding back of previous `StepLSTM.output` states and current `input` state into the `StepLSTM`.
+
+Many libraries implement RNNs as modules that forward entire sequences.
+This library also supports this use case by wrapping `AbstractRecurrent` modules into [Sequencer](#rnn.Sequencer) modules
+or more directly via the stand-alone [SeqLSTM](#rnn.SeqLSTM) and [SeqGRU](#rnn.SeqGRU) modules.
+The `rnn` library also provides the `AbstractRecurrent` interface to support real-time RNNs.
+These are RNNs for which the entire `input` sequence is not know in advance.
+Typically, this is because `input[t+1]` is dependent on `output[t] = RNN(input[t])`.
+The `AbstractRecurrent` interface makes it easy to build these real-time RNNs.
+A good example is the [RecurrentAttention](#rnn.RecurrentAttention) module which implements an attention model using real-time RNNs.
+
+<a name='rnn.AbstractRecurrent.getStepModule'></a>
+### [stepmodule] getStepModule(step) ###
 Returns a module for time-step `step`. This is used internally by sub-classes
-to obtain copies of the internal `recurrentModule`. These copies share
+to obtain copies of the internal `stepmodule`. These copies share
 `parameters` and `gradParameters` but each have their own `output`, `gradInput`
 and any other intermediate states.
 
+<a name='rnn.AbstractRecurrent.setOutputStep'></a>
 ### setOutputStep(step) ###
 This is a method reserved for internal use by [Recursor](#rnn.Recursor)
 when doing backward propagation. It sets the object's `output` attribute
@@ -181,25 +179,86 @@ to point to the output at time-step `step`.
 This method was introduced to solve a very annoying bug.
 
 <a name='rnn.AbstractRecurrent.maskZero'></a>
-### maskZero(nInputDim) ###
-Decorates the internal `recurrentModule` with [MaskZero](#rnn.MaskZero).
-The `output` Tensor (or table thereof) of the `recurrentModule`
-will have each row (i.e. samples) zeroed when the commensurate row of the `input`
-is a tensor of zeros.
+### [self] maskZero(v1) ###
+
+Decorates the internal `stepmodule` with [MaskZero](#rnn.MaskZero).
+The `stepmodule` is the module that is [cloned with shared parameters](#nn.Module.sharedClone) at each time-step.
+The `output` and `gradOutput` Tensor (or table thereof) of the `stepmodule`
+will have each row (that is, samples) zeroed where
+ * the commensurate row of the `input` is a tensor of zeros (version 1 with `v1=true`); or
+ * the commensurate element of the `zeroMask` tensor is 1 (version 2; the default).
+
+Version 2 (the default), requires that [`setZeroMask(zeroMask)`](#rnn.AbstractRecurrent.setZeroMask)
+be called beforehand. The `zeroMask` must be a `seqlen x batchsize` ByteTensor or CudaByteTensor.
+
+![zeroMask](doc/image/zeroMask.png)
+In the above figure, we can see an `input` and commensurate `zeroMask` of size `seqlen=4 x batchsize=3`.
+The `input` could have additional dimensions like `seqlen x batchsize x inputsize`.
+The dark blocks in the `input` separate difference sequences in each sample/row.
+The same elements in the `zeroMask` are set to 1, while the remainder are set to 0.
+For version 1, the dark blocks in the `input` would have a norm of 0, by which a `zeroMask` is automatically interpolated.
+For version 2, the `zeroMask` is provided before calling `forward(input)`,
+thereby alleviated the need to call `norm` at each zero-masked module.
+
+The zero-masking implemented by `maskZero()` and `setZeroMask()` makes it possible to pad sequences with different lengths in the same batch with zero vectors.
+
+At a given time-step `t`, a sample `i` is masked when:
+ * the `input[i]` is a row of zeros (version 1) where `input` is a batched time-step; or
+ * the `zeroMask[{t,i}] = 1` (version 2).
+
+When a sample time-step is masked, the hidden state is effectively reset (that is, forgotten) for the next non-mask time-step.
+In other words, it is possible seperate unrelated sequences with a masked element.
 
-The `nInputDim` argument must specify the number of non-batch dims
-in the first Tensor of the `input`. In the case of an `input` table,
-the first Tensor is the first one encountered when doing a depth-first search.
+The `maskZero()` method returns `self`.
+The `maskZero()` method can me called on any `nn.Module`.
+Zero-masking only supports batch mode.
 
-Calling this method makes it possible to pad sequences with different lengths in the same batch with zero vectors.
+See the [noise-contrastive-estimate.lua](examples/noise-contrastive-estimate.lua) script for an example implementation of version 2 zero-masking.
+See the [simple-bisequencer-network-variable.lua](examples/simple-bisequencer-network-variable.lua) script for an example implementation of version 1 zero-masking.
 
-When a sample time-step is masked (i.e. `input` is a row of zeros), then
-the hidden state is effectively reset (i.e. forgotten) for the next non-mask time-step.
-In other words, it is possible seperate unrelated sequences with a masked element.
+<a name='rnn.AbstractRecurrent.setZeroMask'></a>
+### setZeroMask(zeroMask) ##
 
-### trimZero(nInputDim) ###
-Decorates the internal `recurrentModule` with [TrimZero](#rnn.TrimZero).
+Sets the `zeroMask` of the RNN.
+
+For example,
+```lua
+seqlen, batchsize = 2, 4
+inputsize, outputsize = 3, 1
+-- an AbstractRecurrent instance encapsulated by a Sequencer
+lstm = nn.Sequencer(nn.RecLSTM(inputsize, outputsize))
+lstm:maskZero() -- enable version 2 zero-masking
+-- zero-mask the sequence
+zeroMask = torch.ByteTensor(seqlen, batchsize):zero()
+zeroMask[{1,3}] = 1
+zeroMask[{2,4}] = 1
+lstm:setZeroMask(zeroMask)
+-- forward sequence
+input = torch.randn(seqlen, batchsize, inputsize)
+output = lstm:forward(input)
+print(output)
+(1,.,.) =
+ -0.1715
+  0.0212
+  0.0000
+  0.3301
+
+(2,.,.) =
+  0.1695
+ -0.2507
+ -0.1700
+  0.0000
+[torch.DoubleTensor of size 2x4x1]
+```
+the `output` is indeed zeroed for the 3rd sample in the first time-step (`zeroMask[{1,3}] = 1`)
+and for the fourth sample in the second time-step (`zeroMask[{2,4}] = 1`).
+The `gradOutput` would also be zeroed in the same way.
 
+The `setZeroMask()` method can me called on any `nn.Module`.
+
+When `zeroMask=false`, the zero-masking is disabled.
+
+<a name='rnn.AbstractRecurrent.updateOutput'></a>
 ### [output] updateOutput(input) ###
 Forward propagates the input for the current step. The outputs or intermediate
 states of the previous steps are used recurrently. This is transparent to the
@@ -207,7 +266,7 @@ caller as the previous outputs and intermediate states are memorized. This
 method also increments the `step` attribute by 1.
 
 <a name='rnn.AbstractRecurrent.updateGradInput'></a>
-### updateGradInput(input, gradOutput) ###
+### [gradInput] updateGradInput(input, gradOutput) ###
 Like `backward`, this method should be called in the reverse order of
 `forward` calls used to propagate a sequence. So for example :
 
@@ -233,13 +292,13 @@ Like `updateGradInput`, but for accumulating gradients w.r.t. parameters.
 ### recycle(offset) ###
 This method goes hand in hand with `forget`. It is useful when the current
 time-step is greater than `rho`, at which point it starts recycling
-the oldest `recurrentModule` `sharedClones`,
+the oldest `stepmodule` `sharedClones`,
 such that they can be reused for storing the next step. This `offset`
 is used for modules like `nn.Recurrent` that use a different module
 for the first step. Default offset is 0.
 
 <a name='rnn.AbstractRecurrent.forget'></a>
-### forget(offset) ###
+### forget() ###
 This method brings back all states to the start of the sequence buffers,
 i.e. it forgets the current sequence. It also resets the `step` attribute to 1.
 It is highly recommended to call `forget` after each parameter update.
@@ -249,23 +308,15 @@ the result of now changed parameters. It is also good practice to call
 `forget` at the start of each new sequence.
 
 <a name='rnn.AbstractRecurrent.maxBPTTstep'></a>
-###  maxBPTTstep(rho) ###
+### maxBPTTstep(seqlen) ###
 This method sets the maximum number of time-steps for which to perform
-backpropagation through time (BPTT). So say you set this to `rho = 3` time-steps,
+backpropagation through time (BPTT). So say you set this to `seqlen = 3` time-steps,
 feed-forward for 4 steps, and then backpropgate, only the last 3 steps will be
 used for the backpropagation. If your AbstractRecurrent instance is wrapped
-by a [Sequencer](#rnn.Sequencer), this will be handled auto-magically by the Sequencer.
-Otherwise, setting this value to a large value (i.e. 9999999), is good for most, if not all, cases.
-
-<a name='rnn.AbstractRecurrent.backwardOnline'></a>
-### backwardOnline() ###
-This method was deprecated Jan 6, 2016.
-Since then, by default, `AbstractRecurrent` instances use the
-backwardOnline behaviour.
-See [updateGradInput](#rnn.AbstractRecurrent.updateGradInput) for details.
+by a [Sequencer](#rnn.Sequencer), this will be handled auto-magically by the `Sequencer`.
 
 ### training() ###
-In training mode, the network remembers all previous `rho` (number of time-steps)
+In training mode, the network remembers all previous `seqlen` (number of time-steps)
 states. This is necessary for BPTT.
 
 ### evaluate() ###
@@ -274,6 +325,31 @@ only the previous step is remembered. This is very efficient memory-wise,
 such that evaluation can be performed using potentially infinite-length
 sequence.
 
+<a name='rnn.AbstractRecurrent.getHiddenState'></a>
+### [hiddenState] getHiddenState(step, [input]) ###
+Returns the stored hidden state.
+For example, the hidden state `h[step]` would be returned where `h[step] = f(x[step], h[step-1])`.
+The `input` is only required for `step=0` as it is used to initialize `h[0] = 0`.
+See [encoder-decoder-coupling.lua](examples/encoder-decoder-coupling.lua) for an example.
+
+<a name='rnn.AbstractRecurrent.setHiddenState'></a>
+### setHiddenState(step, hiddenState)
+Set the hidden state of the RNN.
+This is useful to implement encoder-decoder coupling to form sequence to sequence networks.
+See [encoder-decoder-coupling.lua](examples/encoder-decoder-coupling.lua) for an example.
+
+<a name='rnn.AbstractRecurrent.getGradHiddenState'></a>
+### getGradHiddenState(step, [input])
+Return stored gradient of the hidden state: `grad(h[t])`
+The `input` is used to initialize the last step of the RNN with zeros.
+See [encoder-decoder-coupling.lua](examples/encoder-decoder-coupling.lua) for an example.
+
+<a name='rnn.AbstractRecurrent.setGradHiddenState'></a>
+### setGradHiddenState(step, gradHiddenState)
+Set the stored grad hidden state for a specific time-`step`.
+This is useful to implement encoder-decoder coupling to form sequence to sequence networks.
+See [encoder-decoder-coupling.lua](examples/encoder-decoder-coupling.lua) for an example.
+
 <a name='rnn.Recurrent.Sequencer'></a>
 <a name='rnn.AbstractRecurrent.Sequencer'></a>
 ### Decorate it with a Sequencer ###
@@ -478,7 +554,6 @@ The `nn.GRU(inputSize, outputSize [,rho [,p [, mono]]])` constructor takes 3 arg
  * `outputSize` : a number specifying the size of the output;
  * `rho` : the maximum amount of backpropagation steps to take back in time. Limits the number of previous steps kept in memory. Defaults to 9999;
  * `p` : dropout probability for inner connections of GRUs.
- * `mono` : Monotonic sample for dropouts inside GRUs. Only needed in a `TrimZero` + `BGRU`(p>0) situation.
 
 ![GRU](http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/10/Screen-Shot-2015-10-23-at-10.36.51-AM.png)
 
@@ -489,7 +564,7 @@ r[t] = σ(W[x->r]x[t] + W[s->r]s[t−1] + b[1->r])            (2)
 h[t] = tanh(W[x->h]x[t] + W[hr->c](s[t−1]r[t]) + b[1->h])  (3)
 s[t] = (1-z[t])h[t] + z[t]s[t-1]                           (4)
 ```
-where `W[s->q]` is the weight matrix from `s` to `q`, `t` indexes the time-step, `b[1->q]` are the biases leading into `q`, `σ()` is `Sigmoid`, `x[t]` is the input and `s[t]` is the output of the module (eq. 4). Note that unlike the [LSTM](#rnn.LSTM), the GRU has no cells.
+where `W[s->q]` is the weight matrix from `s` to `q`, `t` indexes the time-step, `b[1->q]` are the biases leading into `q`, `σ()` is `Sigmoid`, `x[t]` is the input and `s[t]` is the output of the module (eq. 4). Note that unlike the [RecLSTM](#rnn.RecLSTM), the GRU has no cells.
 
 The GRU was benchmark on `PennTreeBank` dataset using [recurrent-language-model.lua](examples/recurrent-language-model.lua) script.
 It slightly outperfomed [FastLSTM](https://github.com/torch/rnn/blob/master/deprecated/README.md#rnn.FastLSTM) (deprecated), however, since LSTMs have more parameters than GRUs,
@@ -666,10 +741,10 @@ training script for an example of its use.
 A extremely general container for implementing pretty much any type of recurrence.
 
 ```lua
-rnn = nn.Recurrence(recurrentModule, outputSize, nInputDim, [rho])
+rnn = nn.Recurrence(stepmodule, outputSize, nInputDim, [rho])
 ```
 
-`Recurrence` manages a single `recurrentModule`, which should
+`Recurrence` manages a single `stepmodule`, which should
 output a Tensor or table : `output(t)`
 given an input table : `{input(t), output(t-1)}`.
 Using a mix of `Recursor` (say, via `Sequencer`) with `Recurrence`, one can implement
@@ -760,6 +835,32 @@ local rnn = nn.Sequential()
 This abstract class implements a light interface shared by
 subclasses like : `Sequencer`, `Repeater`, `RecurrentAttention`, `BiSequencer` and so on.
 
+<a name='rnn.AbstractSequencer.remember'></a>
+### remember([mode]) ###
+When `mode='neither'` (the default behavior of the class), the Sequencer will additionally call [forget](#nn.AbstractRecurrent.forget) before each call to `forward`.
+When `mode='both'` (the default when calling this function), the Sequencer will never call [forget](#nn.AbstractRecurrent.forget).
+In which case, it is up to the user to call `forget` between independent sequences.
+This behavior is only applicable to decorated AbstractRecurrent `modules`.
+Accepted values for argument `mode` are as follows :
+
+ * 'eval' only affects evaluation (recommended for RNNs)
+ * 'train' only affects training
+ * 'neither' affects neither training nor evaluation (default behavior of the class)
+ * 'both' affects both training and evaluation (recommended for LSTMs)
+
+<a name='rnn.AbstractSequencer.hasMemory'></a>
+### [bool] hasMemory()
+
+Returns true if the instance has memory.
+See [remember()](#rnn.AbstractSequencer.remember) for details.
+
+<a name='rnn.AbstractSequencer.setZeroMask'></a>
+### setZeroMask(zeroMask)
+
+Expects a `seqlen x batchsize` `zeroMask`.
+The `zeroMask` is then passed to `seqlen` criterions by indexing `zeroMask[step]`.
+When `zeroMask=false`, the zero-masking is disabled.
+
 <a name='rnn.Sequencer'></a>
 ## Sequencer ##
 
@@ -770,11 +871,13 @@ to be applied from left to right, on each element of the input sequence.
 seq = nn.Sequencer(module)
 ```
 
-This Module is a kind of [decorator](http://en.wikipedia.org/wiki/Decorator_pattern)
+The `Sequencer` is a kind of [decorator](http://en.wikipedia.org/wiki/Decorator_pattern)
 used to abstract away the intricacies of `AbstractRecurrent` modules. While an `AbstractRecurrent` instance
 requires that a sequence to be presented one input at a time, each with its own call to `forward` (and `backward`),
 the `Sequencer` forwards an `input` sequence (a table) into an `output` sequence (a table of the same length).
-It also takes care of calling `forget` on AbstractRecurrent instances.
+It also takes care of calling `forget` on `AbstractRecurrent` instances.
+
+The `Sequencer` inherits [AbstractSequencer](#rnn.AbstractSequencer)
 
 ### Input/Output Format
 
@@ -866,7 +969,7 @@ Accepted values for argument `mode` are as follows :
  * 'both' affects both training and evaluation (recommended for LSTMs)
 
 ### forget() ###
-Calls the decorated AbstractRecurrent module's `forget` method.
+Calls the decorated `AbstractRecurrent` module's `forget` method.
 
 <a name='rnn.SeqLSTM'></a>
 ## SeqLSTM ##
@@ -1129,53 +1232,85 @@ A complete implementation of Ref. A is available [here](examples/recurrent-visua
 
 <a name='rnn.MaskZero'></a>
 ## MaskZero ##
-This module zeroes the `output` rows of the decorated module
-for commensurate `input` rows which are tensors of zeros.
+
+This module implements *zero-masking*.
+Zero-masking implements the zeroing specific rows/samples of a module's `output` and `gradInput` states.
+Zero-masking is used for efficiently processing variable length sequences.
 
 ```lua
-mz = nn.MaskZero(module, nInputDim)
+mz = nn.MaskZero(module, [v1, maskinput, maskoutput])
 ```
 
-The `output` Tensor (or table thereof) of the decorated `module`
-will have each row (samples) zeroed when the commensurate row of the `input`
-is a tensor of zeros.
+This module zeroes the `output` and `gradOutput` rows of the decorated `module` where
+ * the commensurate row of the `input` is a tensor of zeros (version 1 with `v1=true`); or
+ * the commensurate element of the `zeroMask` tensor is 1 (version 2 with `v1=false`, the default).
 
-The `nInputDim` argument must specify the number of non-batch dims
-in the first Tensor of the `input`. In the case of an `input` table,
-the first Tensor is the first one encountered when doing a depth-first search.
+Version 2 (the default), requires that [`setZeroMask(zeroMask)`](#rnn.MaskZero.setZeroMask)
+be called beforehand. The `zeroMask` must be a `torch.ByteTensor` or `torch.CudaByteTensor` of size `batchsize`.
 
-This decorator makes it possible to pad sequences with different lengths in the same batch with zero vectors.
+At a given time-step `t`, a sample `i` is masked when:
+ * the `input[i]` is a row of zeros (version 1) where `input` is a batched time-step; or
+ * the `zeroMask[{t,i}] = 1` (version 2).
+
+When a sample time-step is masked, the hidden state is effectively reset (that is, forgotten) for the next non-mask time-step.
+In other words, it is possible seperate unrelated sequences with a masked element.
+
+When `maskoutput=true` (the default), `output` and `gradOutput` are zero-masked.
+When `maskinput=true` (not the default), `input` and `gradInput` aere zero-masked.
 
-Caveat: `MaskZero` not guarantee that the `output` and `gradInput` tensors of the internal modules
-of the decorated `module` will be zeroed as well when the `input` is zero as well.
-`MaskZero` only affects the immediate `gradInput` and `output` of the module that it encapsulates.
+Zero-masking only supports batch mode.
+
+Caveat: `MaskZero` does not guarantee that the `output` and `gradOutput` tensors of the internal modules
+of the decorated `module` will be zeroed.
+`MaskZero` only affects the immediate `gradOutput` and `output` of the module that it encapsulates.
 However, for most modules, the gradient update for that time-step will be zero because
 backpropagating a gradient of zeros will typically yield zeros all the way to the input.
-In this respect, modules to avoid in encapsulating inside a `MaskZero` are `AbsractRecurrent`
+In this respect, modules that shouldn't be encapsulated inside a `MaskZero` are `AbsractRecurrent`
 instances as the flow of gradients between different time-steps internally.
 Instead, call the [AbstractRecurrent.maskZero](#rnn.AbstractRecurrent.maskZero) method
-to encapsulate the internal `recurrentModule`.
-
-<a name='rnn.TrimZero'></a>
-## TrimZero ##
-
-WARNING : only use this module if your input contains lots of zeros.
-In almost all cases, [`MaskZero`](#rnn.MaskZero) will be faster, especially with CUDA.
+to encapsulate the internal `stepmodule`.
 
-Ref. A : [TrimZero: A Torch Recurrent Module for Efficient Natural Language Processing](https://bi.snu.ac.kr/Publications/Conferences/Domestic/KIIS2016S_JHKim.pdf)
+See the [noise-contrastive-estimate.lua](examples/noise-contrastive-estimate.lua) script for an example implementation of version 2 zero-masking.
+See the [simple-bisequencer-network-variable.lua](examples/simple-bisequencer-network-variable.lua) script for an example implementation of version 1 zero-masking.
 
-The usage is the same with `MaskZero`.
+<a name='rnn.MaskZero.setZeroMask'></a>
+### setZeroMask(zeroMask) ##
 
+Set the `zeroMask` of the `MaskZero` module (required for version 2 forwards).
+For example,
 ```lua
-mz = nn.TrimZero(module, nInputDim)
+batchsize = 3
+inputsize, outputsize = 2, 1
+-- an nn.Linear module decorated with MaskZero (version 2)
+module = nn.MaskZero(nn.Linear(inputsize, outputsize))
+-- zero-mask the second sample/row
+zeroMask = torch.ByteTensor(batchsize):zero()
+zeroMask[2] = 1
+module:setZeroMask(zeroMask)
+-- forward
+input = torch.randn(batchsize, inputsize)
+output = module:forward(input)
+print(output)
+ 0.6597
+ 0.0000
+ 0.8170
+[torch.DoubleTensor of size 3x1]
+```
+The `output` is indeed zeroed for the second sample (`zeroMask[2] = 1`).
+The `gradInput` would also be zeroed in the same way because the `gradOutput` would be zeroed:
+```lua
+gradOutput = torch.randn(batchsize, outputsize)
+gradInput = module:backward(input, gradOutput)
+print(gradInput)
+ 0.8187  0.0534
+ 0.0000  0.0000
+ 0.1742  0.0114
+[torch.DoubleTensor of size 3x2]
 ```
 
-The only difference from `MaskZero` is that it reduces computational costs by varying a batch size, if any, for the case that varying lengths are provided in the input.
-Notice that when the lengths are consistent, `MaskZero` will be faster, because `TrimZero` has an operational cost.
-
-In short, the result is the same with `MaskZero`'s, however, `TrimZero` is faster than `MaskZero` only when sentence lengths is costly vary.
+For `Container` modules, a call to `setZeroMask()` is propagated to all component modules that expect a `zeroMask`.
 
-In practice, e.g. language model, `TrimZero` is expected to be faster than `MaskZero` about 30%. (You can test with it using `test/test_trimzero.lua`.)
+When `zeroMask=false`, the zero-masking is disabled.
 
 <a name='rnn.LookupTableMaskZero'></a>
 ## LookupTableMaskZero ##
@@ -1189,6 +1324,8 @@ The `output` Tensor will have each row zeroed when the commensurate row of the `
 
 This lookup table makes it possible to pad sequences with different lengths in the same batch with zero vectors.
 
+Note that this module ignores version 2 zero-masking, and therefore expects inputs to be zeros where needed.
+
 <a name='rnn.MaskZeroCriterion'></a>
 ## MaskZeroCriterion ##
 
@@ -1285,6 +1422,28 @@ print(output)
 
 The module doesn't support CUDA.
 
+<a name='rnn.AbstractSequencerCriterion'></a>
+## AbstractSequencerCriterion ##
+
+```lua
+asc = nn.AbstractSequencerCriterion(stepcriterion, [sizeAverage])
+```
+
+Similar to the `stepmodule` passed to the [AbstractRecurrent](#rnn.AbstractRecurrent) constructor,
+the `stepcriterion` is internally cloned for each time-step.
+Unlike the `stepmodule` the `stepcriterion` never has any parameters to share.
+
+<a name='rnn.AbstractSequencerCriterion.getStepCriterion'></a>
+### [criterion] getStepCriterion(step)
+
+Returns a `criterion` clone of the `stepcriterion` (stored in `self.clones[1]`) for a specific time-`step`.
+
+<a name='rnn.AbstractSequencerCriterion.setZeroMask'></a>
+### setZeroMask(zeroMask)
+
+Expects a `seqlen x batchsize` `zeroMask`.
+The `zeroMask` is then passed to `seqlen` criterions by indexing `zeroMask[step]`.
+When `zeroMask=false`, the zero-masking is disabled.
 
 <a name='rnn.SequencerCriterion'></a>
 ## SequencerCriterion ##
@@ -1322,7 +1481,7 @@ which are repeatedly presented with the same target.
 ## Module ##
 
 The Module interface has been further extended with methods that facilitate
-stochastic gradient descent like [updateGradParameters](#nn.Module.updageGradParameters) (i.e. momentum learning),
+stochastic gradient descent like [updateGradParameters](#nn.Module.updageGradParameters) (for momentum learning),
 [weightDecay](#nn.Module.weightDecay), [maxParamNorm](#nn.Module.maxParamNorm) (for regularization), and so on.
 
 <a name='nn.Module.dpnn_parameters'></a>
diff --git a/doc/image/zeroMask.png b/doc/image/zeroMask.png
new file mode 100644
index 0000000..7ef8d70
Binary files /dev/null and b/doc/image/zeroMask.png differ
diff --git a/examples/README.md b/examples/README.md
index c9ca2f0..e68be30 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -1,15 +1,33 @@
 # Examples
 
-This directory contains various training scripts.
+This document outlines the variety of training scripts and external resources.
 
-Torch blog posts
- * The torch.ch blog contains detailed posts about the *rnn* package.
- 1. [recurrent-visual-attention.lua](recurrent-visual-attention.lua): training script used in [Recurrent Model for Visual Attention](http://torch.ch/blog/2015/09/21/rmva.html). Implements the REINFORCE learning rule to learn an attention mechanism for classifying MNIST digits, sometimes translated.
- 2. [noise-contrastive-esimate.lua](noise-contrastive-estimate.lua): one of two training scripts used in [Language modeling a billion words](http://torch.ch/blog/2016/07/25/nce.html). Single-GPU script for training recurrent language models on the Google billion words dataset.
- 3. [multigpu-nce-rnnlm.lua](multigpu-nce-rnnlm.lua) : 4-GPU version of `noise-contrastive-estimate.lua` for training larger multi-GPU models. Two of two training scripts used in the [Language modeling a billion words](http://torch.ch/blog/2016/07/25/nce.html).
+## Advanced training scripts
 
-Simple training scripts.
- * Showcases the fundamental principles of the package. In chronological order of introduction date.
+This section lists advanced training scripts that train RNNs on real-world datasets.
+ 1. [recurrent-language-model.lua](recurrent-language-model.lua): trains a stack of LSTM, GRU, MuFuRu, or Simple RNN on the Penn Tree Bank dataset without or without dropout.
+ 2. [recurrent-visual-attention.lua](recurrent-visual-attention.lua): training script used in [Recurrent Model for Visual Attention](http://torch.ch/blog/2015/09/21/rmva.html). Implements the REINFORCE learning rule to learn an attention mechanism for classifying MNIST digits, sometimes translated. Showcases `nn.RecurrentAttention`, `nn.SpatialGlimpse` and `nn.Reinforce`.
+ 3. [noise-contrastive-esimate.lua](noise-contrastive-estimate.lua): one of two training scripts used in [Language modeling a billion words](http://torch.ch/blog/2016/07/25/nce.html). Single-GPU script for training recurrent language models on the Google billion words dataset. This example showcases version 2 zero-masking. Version 2 is more efficient than version 1 as the `zeroMask` is interpolated only once.
+ 4. [multigpu-nce-rnnlm.lua](multigpu-nce-rnnlm.lua) : 4-GPU version of `noise-contrastive-estimate.lua` for training larger multi-GPU models. Two of two training scripts used in the [Language modeling a billion words](http://torch.ch/blog/2016/07/25/nce.html). This script is for training multi-layer [SeqLSTM](/README.md#rnn.SeqLSTM) language models on the [Google Billion Words dataset](https://github.com/Element-Research/dataload#dl.loadGBW). The example uses [MaskZero](/README.md#rnn.MaskZero) to train independent variable length sequences using the [NCEModule](/README.md#nn.NCEModule) and [NCECriterion](/README.md#nn.NCECriterion). This script is our fastest yet boasting speeds of 20,000 words/second (on NVIDIA Titan X) with a 2-layer LSTM having 250 hidden units, a batchsize of 128 and sequence length of a 100. Note that you will need to have [Torch installed with Lua instead of LuaJIT](http://torch.ch/docs/getting-started.html#_);
+ 5. [twitter-sentiment-rnn.lua](twitter-sentiment-rnn.lua) : trains stack of RNNs on a twitter sentiment analysis. The problem is a text classification problem that uses a sequence-to-one architecture. In this architecture, only the last RNN's last time-step is used for classification.
+
+## Simple training scripts
+
+This section lists simple training scripts that train RNNs on dummy datasets.
+These scripts showcases the fundamental principles of the package.
  1. [simple-recurrent-network.lua](simple-recurrent-network.lua): uses the `nn.LookupRNN` module to instantiate a Simple RNN. Illustrates the first AbstractRecurrent instance in action. It has since been surpassed by the more flexible `nn.Recursor` and `nn.Recurrence`. The `nn.Recursor` class decorates any module to make it conform to the nn.AbstractRecurrent interface. The `nn.Recurrence` implements the recursive `h[t] <- forward(h[t-1], x[t])`. Together, `nn.Recursor` and `nn.Recurrence` can be used to implement a wide range of experimental recurrent architectures.
  2. [simple-sequencer-network.lua](simple-sequencer-network.lua): uses the `nn.Sequencer` module to accept a batch of sequences as `input` of size `seqlen x batchsize x ...`. Both tables and tensors are accepted as input and produce the same type of output (table->table, tensor->tensor). The `Sequencer` class abstract away the implementation of back-propagation through time. It also provides a `remember(['neither','both'])` method for triggering what the `Sequencer` remembers between iterations (forward,backward,update).
  3. [simple-recurrence-network.lua](simple-recurrence-network.lua): uses the `nn.Recurrence` module to define the h[t] <- sigmoid(h[t-1], x[t]) Simple RNN. Decorates it using `nn.Sequencer` so that an entire batch of sequences (`input`) can forward and backward propagated per update.
+ 4. [simple-bisequencer-network.lua](simple-bisequencer-network.lua): uses a `nn.BiSequencerLM` and two `nn.LookupRNN` to implement a simple bi-directional language model.
+ 5. [simple-bisequencer-network-variable.lua](simple-bisequencer-network-variable.lua): uses `nn.RecLSTM`, `nn.LookupTableMaskZero`, `nn.ZipTable`, `nn.MaskZero` and `nn.MaskZeroCriterion` to implement a simple bi-directional LSTM language model. This example uses version 1 zero-masking where the `zeroMask` is automatically interpolated from the `input`.
+ 6. [sequence-to-one.lua](sequence-to-one.lua): a simple sequence-to-one example that uses `Recurrence` to build an RNN and `SelectTable(-1)` to select the last time-step for discriminating the sequence.
+ 7. [encoder-decoder-coupling.lua](encoder-decoder-coupling.lua): uses two stacks of `nn.SeqLSTM` to implement an encoder and decoder. The final hidden state of the encoder initializes the hidden state of the decoder. Example of sequence-to-sequence learning.
+ 8. [nested-recurrence-lstm.lua](nested-recurrence-lstm.lua): demonstrates how RNNs can be nested to form complex RNNs.
+ 9. [recurrent-time-series.lua](recurrent-time-series.lua) demonstrates how train a simple RNN to do multi-variate time-series predication.
+
+ ## External resources
+
+  * [rnn-benchmarks](https://github.com/glample/rnn-benchmarks) : benchmarks comparing Torch (using this library), Theano and TensorFlow.
+  * [dataload](https://github.com/Element-Research/dataload) : a collection of torch dataset loaders;
+  * A brief (1 hours) overview of Torch7, which includes some details about the __rnn__ packages (at the end), is available via this [NVIDIA GTC Webinar video](http://on-demand.gputechconf.com/gtc/2015/webinar/torch7-applied-deep-learning-for-vision-natural-language.mp4). In any case, this presentation gives a nice overview of Logistic Regression, Multi-Layer Perceptrons, Convolutional Neural Networks and Recurrent Neural Networks using Torch7;
+  * [Sagar Waghmare](https://github.com/sagarwaghmare69) wrote a nice [tutorial](tutorials/ladder.md) on how to use rnn with nngraph to reproduce the [Lateral Connections in Denoising Autoencoders Support Supervised Learning](http://arxiv.org/pdf/1504.08215.pdf).
diff --git a/examples/simple-bisequencer-network.lua b/examples/simple-bisequencer-network.lua
index 2d87004..cd14ead 100644
--- a/examples/simple-bisequencer-network.lua
+++ b/examples/simple-bisequencer-network.lua
@@ -10,11 +10,7 @@ lr = 0.1
 
 -- forward rnn
 -- build simple recurrent neural network
-local fwd = nn.Recurrent(
-   hiddenSize, nn.LookupTable(nIndex, hiddenSize),
-   nn.Linear(hiddenSize, hiddenSize), nn.Sigmoid(),
-   seqlen
-)
+local fwd = nn.LookupRNN(nIndex, hiddenSize)
 
 -- backward rnn (will be applied in reverse order of input sequence)
 local bwd = fwd:clone()
diff --git a/examples/twitter_sentiment_rnn.lua b/examples/twitter-sentiment-rnn.lua
similarity index 100%
rename from examples/twitter_sentiment_rnn.lua
rename to examples/twitter-sentiment-rnn.lua