Download presentation
Presentation is loading. Please wait.
1
Week 3 Presentation Ngoc Ta Aidean Sharghi
2
Research Papers Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning End-to-End Dense Video Captioning with Masked Transformer Video Captioning via Hierarchical Reinforcement Learning M3 : Multimodal Memory Modelling for Video Captioning
3
End-to-End Dense Video Captioning with Masked Transformer
Code available: Dataset: Activitynet Captions and YouCookII Method: Using an end-to-end model which is composed of an encoder and two decoders. Encoder encodes Proposal decoder Captioning decoder Using self-attention for dense video captioning 1. Encoder encodes the input video to proper visual representations. 2. Proposal decoder then decodes from this representation with different anchors to form video event proposals. 3. Captioning decoder employs a differentiable masking network to restrict its attention to the proposal event, ensures the consistency between the proposal and captioning during training. Using self-attention (enables the use of efficient non-recurrent structure during encoding)
4
Video Captioning via Hierarchical Reinforcement Learning
Code: N/A Dataset: MSR-VTT dataset and Charades dataset Method: Aims at improving the fine-grained generation of video descriptions with rich activities. Gains better improvement on detailed descriptions of longer videos. Pretrained CNN model Low-level Bi-LSTM2 encoder High-level LSTM3 encoder HRL works as a decoder which is composed of three components: a low-level worker, a high-level manager, and an internal critic The whole pipeline terminates once an end of sentence token is reached. Video frame features are first extracted by a pretrained CNN model Passed through a low-level Bi-LSTM2 encoder and a high-level LSTM3 encoder to obtain low-level and high-level encoder output. HRL works as a decoder which is composed of three components: a low-level worker, a high-level manager, and an internal critic The whole pipeline terminates once an end of sentence token is reached.
5
M3 : Multimodal Memory Modelling for Video Captioning
Code: N/A Dataset: MSVD and MSR-VTT Method: Using M3 which contains: CNN-based video encoder Multimodal memory LSTM-based text decoder CNN-based video encoder first extracts video frame/clip features using pretrained 2D/3D CNNs LSTM-based text decoder models the sentence generation then writes the updated representation to the memory. Multimodal memory contains a memory matrix Mem to interact with video and sentence Procedure: 1. Write hidden states to update memory 2. Read the updated memory content to perform soft-attention 3. Write selected visual information to update memory again 4. Read the updated memory content for next word prediction
6
Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning
Code available: Dataset : ActivityNet Captions (20k YouTube untrimmed videos from real life, 120 seconds long on average.) Goals: to automatically localize events in video and describe each one with sentence Method : Using a novel bidirectional proposal framework( Bidirectional SST) to encode both past and future contexts, with the motivation that both past and future contexts help better localize the current event.
7
Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning
In the forward pass, they learn k binary classifiers corresponding to k anchors densely at each time step In backward pass we reverse both video sequence input and predict proposals backward. This means that the forward pass encodes past context and current event information, while the backward pass encodes future context and current event information Finally they merge proposal scores for the same predictions from the two passes and output final proposals.
8
Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning
9
Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning
To construct more discriminative proposal representation, they fused proposal state information and detected video content together which help discriminate highly overlapped events. To output more confident results, they used joint ranking technique to select high-confidence proposal caption pairs by taking both proposal score and caption confidence into consideration.
10
Bidirectional Attentive Fusion With Context Gating for Dense Video Captioning
11
THANK YOU
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.