Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning #11

harshraj22 · 2021-06-18T06:25:20Z

harshraj22
Jun 18, 2021
Maintainer

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning 📓

CVPR 2017

The authors present ideas to tackle the problem of dense video captioning.
They encode video frames into batches using 3d CNN. These batches are then fed into a bidirectional LSTM, which helps in encoding both past and future context of videos.

For each time step, they use K independent classifiers to produce K scores corresponding to the time intervals for actions. At each time step, these K scores from both forward and backward LSTM are merged and only those intervals are retained whose score is above a fixed threshold. They don't perform non-max separation as the events in videos can be overlapped.

For the captioning part, they use LSTM to generate one word at a time, along with attention over the video features corresponding to the proposed time interval.

Loss Function:

The total loss is combination of two loss:

Proposal Loss: which penalizes model for generating bad proposals
Captioning Loss: Which penalizes model for generating bad captions

pending:
[ ] Details about Loss functions
[ ] More details about Captioning module
[ ] Images

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning #11

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning #11

harshraj22 Jun 18, 2021 Maintainer

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning 📓

Loss Function:

Replies: 0 comments

harshraj22
Jun 18, 2021
Maintainer