Presentation is loading. Please wait.

Presentation is loading. Please wait.

Week 7 Presentation Ngoc Ta Aidean Sharghi

Similar presentations


Presentation on theme: "Week 7 Presentation Ngoc Ta Aidean Sharghi"— Presentation transcript:

1 Week 7 Presentation Ngoc Ta Aidean Sharghi
Attention function (mapping a query). Output is computed as a weighted sum of the values where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key

2 End-to-end Video-level Representation Learning for Action Recognition
Deep network with Temporal Pyramid Pooling (DTPP): builds video-level representation(rather than frame-level features) using enough frames sparsely sampled across a video Spatial stream(orange) takes as input RGB images Temporal stream (blue) takes as input optical flow stacks. Perform temporal pyramid pooling (TPP) to get video-level representation from frame features Finally, get average weight from the scores for the two streams

3 RECONSTRUCTION NETWORK FOR VIDEO CAPTIONING
Encoder (CNN): extracts semantic representations of video frames Decoder (LSTM): generates natural language for visual content description Reconstructor: exploits the backward flow to reproduce frame representations Temporal attention mechanism select the key frames/elements for captioning by assigning weight to the representation of each frame Meteor: 34 on SA_LSTM, 32 on S2VT

4 End-to-End Dense Video Captioning with Masked Transformer
Encoder: encodes video to visual representations Proposal decoder: decodes representation with different anchors to form event proposals Captioning decoder: use masking network to restrict its attention to the proposal events. Attention is all you need: Transformer: relies on self-attention to compute representations of the sequence (its input and output) Multi-head attention: allows model to attend to information from different representation subspaces at different positions. METEOR: 10 on ActivityNet Caption, 6.6 on YouCookII

5 M3 : Multimodal Memory Modelling for Video Captioning
CNN-based video encoder Multimodal memory LSTM-based text decoder Soft-attention to select visual info most related to each word Builds a visual and textual shared memory Temporal soft-attention: select visual information most related to each word Procedure: 1. Write hidden states to update memory 2. Read the updated memory content to perform soft-attention 3. Write selected visual information to update memory again 4. Read the updated memory content for next word prediction METEOR: 26.6 on MSR-VTT

6 Jointly Localizing and Describing Events for Dense Video Captioning
base layers reduce the temporal dimension of feature map and increase the size of temporal receptive fields nine anchor layers are stacked on the top of base layer. These anchor layers decrease in temporal dimension of feature map progressively. For each anchor layer, its output feature map is injected into prediction layer to produce a fixed set of predictions in one shot manner Reinforcement learning: optimize LSTM with METEOR-based reward loss, to get better caption METEOR: 13 on ActivityNet caption

7 Weakly Supervised Dense Event Captioning in Videos
Sentence localizer: attention mechanism “Crossing Attention”, which contains two sub-attention computations (One computes attention between the final hidden state of the video and the caption feature, Other one calculate the attention between the final hidden state of the caption and the video features) Input video is divided into multiple anchor segments under multiple scales, and train a fully connected layer to predict the best anchor that produces the highest Meteor score of the generated caption sentence. Then conducting regression around the best anchor that gives the highest score. Caption generator: perform a soft clipping by defining a continuous mask function


Download ppt "Week 7 Presentation Ngoc Ta Aidean Sharghi"

Similar presentations


Ads by Google