Reconstruction Network for Video Captioning #6
harshraj22
started this conversation in
Paper Review
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Reconstruction Network for Video Captioning 📓
CVPR 2018
The main pitch of the paper is reconstruction network (they experiment with 2 different proposed reconstruction network and benchmark both in the paper), which tries to reconstruct the feature representation of videos (extracted by Inception V4), by using the hidden representation of decoder RNN. This forces the decoder to have better learning.
The encoder decoder architecture used by them for caption generation is pretty standard, with video features being extracted by InceptionV4, and the captions generated by LSTM. Though, a point to look over is, that in order to capture the future context while generating a word, they use temporal attention over the extracted features, instead of going with popular bidirectional LSTM approach.
1. Global Reconstructor:
The hidden states from the decoder, are mean pooled. The reconstructor is essentially and RNN, that takes hidden states of decoder, along with the mean pooled vector as input. The hiddent states from the reconstructor are mean pooled, and this is used in calculating the loss. The loss for this, is simply the euclidian distance between the final mean pooled output, and the mean pool of video features.
2. Local Reconstructor:
It reconstructs all the features extracted. The reconstructor RNN uses attention over hidden states of decoder, to generate hidden states, which are used for calculating loss.
Beta Was this translation helpful? Give feedback.
All reactions