Reconstruction Network for Video Captioning #6

harshraj22 · 2021-05-08T10:59:04Z

harshraj22
May 8, 2021
Maintainer

Reconstruction Network for Video Captioning 📓

CVPR 2018

The main pitch of the paper is reconstruction network (they experiment with 2 different proposed reconstruction network and benchmark both in the paper), which tries to reconstruct the feature representation of videos (extracted by Inception V4), by using the hidden representation of decoder RNN. This forces the decoder to have better learning.

The encoder decoder architecture used by them for caption generation is pretty standard, with video features being extracted by InceptionV4, and the captions generated by LSTM. Though, a point to look over is, that in order to capture the future context while generating a word, they use temporal attention over the extracted features, instead of going with popular bidirectional LSTM approach.

1. Global Reconstructor:

The hidden states from the decoder, are mean pooled. The reconstructor is essentially and RNN, that takes hidden states of decoder, along with the mean pooled vector as input. The hiddent states from the reconstructor are mean pooled, and this is used in calculating the loss. The loss for this, is simply the euclidian distance between the final mean pooled output, and the mean pool of video features.

2. Local Reconstructor:

It reconstructs all the features extracted. The reconstructor RNN uses attention over hidden states of decoder, to generate hidden states, which are used for calculating loss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconstruction Network for Video Captioning #6

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Reconstruction Network for Video Captioning #6

harshraj22 May 8, 2021 Maintainer

Reconstruction Network for Video Captioning 📓

1. Global Reconstructor:

2. Local Reconstructor:

Replies: 0 comments

harshraj22
May 8, 2021
Maintainer