Update professor-forcing.md

ghtwht · Nov 3, 2016 · 59f63a3 · 59f63a3
1 parent 53da06b
commit 59f63a3
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/notes/professor-forcing.md b/notes/professor-forcing.md
@@ -31,7 +31,7 @@ TLDR; The authors adopt Generative Adversarial Networks (GANs) to RNNs and train
 
 - Props to the authors for a very clear and well-written paper. This is rarer than it should be :)
 - It's an intersting idea to also match the states of the RNN instead of just the outputs. Intuitively, matching the outputs should implicitly match the state distribution. I wonder if the authors tried this and it didn't work as expected.
-^It's significantly harder to use GAN on outputs because they are discrete as (as opposed to continuous like the hidden states). They would have had to estimate discrete outputs with policy gradient like in seqGAN which is harder to get to converge, which is why they probably just stuck with the hidden states which already contain info about the output (the index of the highest probability in the the distribution) anyway. Professor Forcing method is unique in that one has access to the continuous probability distribution of each token at each timestep of the two sequence generation modes trying to be pushed closer together. Conversely, when applying GANs to pushing real samples and generated samples closer together as is traditionally done in models like seqGAN, one only has access to the next dicrete token (not continuous probability distributions of next token) at each timestep, which prevents straight-forward differentiation (used in professor forcing) from being applied and forces one to use policy gradient estimation.
+- Note from [Ethan Caballero](https://github.com/ethancaballero) about why they chose to match hidden states: It's significantly harder to use GANs on sampled (argmax) output tokens because they are discrete as (as opposed to continuous like the hidden states and their respective softmaxes). They would have had to estimate discrete outputs with policy gradients like in [seqGAN](https://github.com/dennybritz/deeplearning-papernotes/blob/master/notes/seq-gan.md) which is [harder to get to converge](https://www.quora.com/Do-you-have-any-ideas-on-how-to-get-GANs-to-work-with-text), which is why they probably just stuck with the hidden states which already contain info about the discrete sampled outputs (the index of the highest probability in the the distribution) anyway. Professor Forcing method is unique in that one has access to the continuous probability distribution of each token at each timestep of the two sequence generation modes trying to be pushed closer together. Conversely, when applying GANs to pushing real samples and generated samples closer together as is traditionally done in models like seqGAN, one only has access to the next dicrete token (not continuous probability distributions of next token) at each timestep, which prevents straight-forward differentiation (used in professor forcing) from being applied and forces one to use policy gradient estimation. However, there's a chance one might be able to use straight-forward differentiation to train seqGANs in the traditional sampling case if one swaps out each discrete sampled token with its continuous distributional word embedding (from pretrained word2vec, GloVe, etc.), but no one has tried it yet TTBOMK.
 - I would've liked to see a comparison of  the two regularization terms in the generator. The experiments don't make it clear if both or only of them them is used.
 - I'm guessing that this architecture is quite challenging to train. Woul've liked to see a bit more detail about when/how they trade off the training of discriminator and generator.
 - Translation is another obvious task to apply this too. I'm interested whether or not this works for seq2seq.