Week 7 Presentation Ngoc Ta Aidean Sharghi Attention function (mapping a query). Output is computed as a weighted sum of the values where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key
End-to-end Video-level Representation Learning for Action Recognition Deep network with Temporal Pyramid Pooling (DTPP): builds video-level representation(rather than frame-level features) using enough frames sparsely sampled across a video Spatial stream(orange) takes as input RGB images Temporal stream (blue) takes as input optical flow stacks. Perform temporal pyramid pooling (TPP) to get video-level representation from frame features Finally, get average weight from the scores for the two streams
RECONSTRUCTION NETWORK FOR VIDEO CAPTIONING Encoder (CNN): extracts semantic representations of video frames Decoder (LSTM): generates natural language for visual content description Reconstructor: exploits the backward flow to reproduce frame representations Temporal attention mechanism select the key frames/elements for captioning by assigning weight to the representation of each frame Meteor: 34 on SA_LSTM, 32 on S2VT
End-to-End Dense Video Captioning with Masked Transformer Encoder: encodes video to visual representations Proposal decoder: decodes representation with different anchors to form event proposals Captioning decoder: use masking network to restrict its attention to the proposal events. Attention is all you need: Transformer: relies on self-attention to compute representations of the sequence (its input and output) Multi-head attention: allows model to attend to information from different representation subspaces at different positions. METEOR: 10 on ActivityNet Caption, 6.6 on YouCookII
M3 : Multimodal Memory Modelling for Video Captioning CNN-based video encoder Multimodal memory LSTM-based text decoder Soft-attention to select visual info most related to each word Builds a visual and textual shared memory Temporal soft-attention: select visual information most related to each word Procedure: 1. Write hidden states to update memory 2. Read the updated memory content to perform soft-attention 3. Write selected visual information to update memory again 4. Read the updated memory content for next word prediction METEOR: 26.6 on MSR-VTT
Jointly Localizing and Describing Events for Dense Video Captioning base layers reduce the temporal dimension of feature map and increase the size of temporal receptive fields nine anchor layers are stacked on the top of base layer. These anchor layers decrease in temporal dimension of feature map progressively. For each anchor layer, its output feature map is injected into prediction layer to produce a fixed set of predictions in one shot manner Reinforcement learning: optimize LSTM with METEOR-based reward loss, to get better caption METEOR: 13 on ActivityNet caption
Weakly Supervised Dense Event Captioning in Videos Sentence localizer: attention mechanism “Crossing Attention”, which contains two sub-attention computations (One computes attention between the final hidden state of the video and the caption feature, Other one calculate the attention between the final hidden state of the caption and the video features) Input video is divided into multiple anchor segments under multiple scales, and train a fully connected layer to predict the best anchor that produces the highest Meteor score of the generated caption sentence. Then conducting regression around the best anchor that gives the highest score. Caption generator: perform a soft clipping by defining a continuous mask function