Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning #11
harshraj22
started this conversation in
Paper Review
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning 📓
CVPR 2017
The authors present ideas to tackle the problem of dense video captioning.
They encode video frames into batches using 3d CNN. These batches are then fed into a bidirectional LSTM, which helps in encoding both past and future context of videos.
For each time step, they use K independent classifiers to produce K scores corresponding to the time intervals for actions. At each time step, these K scores from both forward and backward LSTM are merged and only those intervals are retained whose score is above a fixed threshold. They don't perform non-max separation as the events in videos can be overlapped.
For the captioning part, they use LSTM to generate one word at a time, along with attention over the video features corresponding to the proposed time interval.
Loss Function:
The total loss is combination of two loss:
pending:
[ ] Details about Loss functions
[ ] More details about Captioning module
[ ] Images
Beta Was this translation helpful? Give feedback.
All reactions