Week 7 Presentation Ngoc Ta Aidean Sharghi

Slides:



Advertisements
Similar presentations
Overview of Back Propagation Algorithm
Advertisements

Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue
Convolutional LSTM Networks for Subcellular Localization of Proteins
Deep Learning Overview Sources: workshop-tutorial-final.pdf
A Hierarchical Deep Temporal Model for Group Activity Recognition
Tofik AliPartha Pratim Roy Department of Computer Science and Engineering Indian Institute of Technology Roorkee CVIP-WM 2017 Paper ID 172 Word Spotting.
Naifan Zhuang, Jun Ye, Kien A. Hua
Attention Model in NLP Jichuan ZENG.
Big data classification using neural network
Convolutional Sequence to Sequence Learning
Unsupervised Learning of Video Representations using LSTMs
Convolutional Neural Network
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Object Detection based on Segment Masks
Deep Learning Amin Sobhani.
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Recurrent Neural Networks for Natural Language Processing
Saliency-guided Video Classification via Adaptively weighted learning
Neural networks (3) Regularization Autoencoder
Supervised Training of Deep Networks
Deep Belief Networks Psychology 209 February 22, 2013.
Deceptive News Prediction Clickbait Score Inference
CS6890 Deep Learning Weizhen Cai
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Textual Video Prediction
Image Question Answering
Attention Is All You Need
"Playing Atari with deep reinforcement learning."
Human-level control through deep reinforcement learning
Attention-based Caption Description Mun Jonghwan.
Grid Long Short-Term Memory
Two-Stream Convolutional Networks for Action Recognition in Videos
Vessel Extraction in X-Ray Angiograms Using Deep Learning
Object Classification through Deconvolutional Neural Networks
Very Deep Convolutional Networks for Large-Scale Image Recognition
The Big Health Data–Intelligent Machine Paradox
LECTURE 35: Introduction to EEG Processing
Papers 15/08.
Neural Speech Synthesis with Transformer Network
Ying Dai Faculty of software and information science,
Ying Dai Faculty of software and information science,
Unsupervised Pretraining for Semantic Parsing
Learning Object Context for Dense Captioning
Natural Language to SQL(nl2sql)
Convolutional Neural Networks
Neural networks (3) Regularization Autoencoder
Ying Dai Faculty of software and information science,
CSC 578 Neural Networks and Deep Learning
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
Learn to Comment Mentor: Mahdi M. Kalayeh
Jointly Generating Captions to Aid Visual Question Answering
Sequence to Sequence Video to Text
Human-object interaction
Motivation State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.
Object Detection Implementations
REU - End to End Self Driving Car
Presented By: Harshul Gupta
Weekly Learning Alex Omar Ruiz Irene.
Learning Deconvolution Network for Semantic Segmentation
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Sequence-to-Sequence Models
Week 3 Volodymyr Bobyr.
WEEK 4 PRESENTATION NGOC TA AIDEAN SHARGHI.
Week 6 Presentation Ngoc Ta Aidean Sharghi.
Point Set Representation for Object Detection and Beyond
Visual Grounding.
Presentation transcript:

Week 7 Presentation Ngoc Ta Aidean Sharghi Attention function (mapping a query). Output is computed as a weighted sum of the values where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key

End-to-end Video-level Representation Learning for Action Recognition Deep network with Temporal Pyramid Pooling (DTPP): builds video-level representation(rather than frame-level features) using enough frames sparsely sampled across a video Spatial stream(orange) takes as input RGB images Temporal stream (blue) takes as input optical flow stacks. Perform temporal pyramid pooling (TPP) to get video-level representation from frame features Finally, get average weight from the scores for the two streams

RECONSTRUCTION NETWORK FOR VIDEO CAPTIONING Encoder (CNN): extracts semantic representations of video frames Decoder (LSTM): generates natural language for visual content description Reconstructor: exploits the backward flow to reproduce frame representations Temporal attention mechanism select the key frames/elements for captioning by assigning weight to the representation of each frame Meteor: 34 on SA_LSTM, 32 on S2VT

End-to-End Dense Video Captioning with Masked Transformer Encoder: encodes video to visual representations Proposal decoder: decodes representation with different anchors to form event proposals Captioning decoder: use masking network to restrict its attention to the proposal events. Attention is all you need: Transformer: relies on self-attention to compute representations of the sequence (its input and output) Multi-head attention: allows model to attend to information from different representation subspaces at different positions. METEOR: 10 on ActivityNet Caption, 6.6 on YouCookII

M3 : Multimodal Memory Modelling for Video Captioning CNN-based video encoder Multimodal memory LSTM-based text decoder Soft-attention to select visual info most related to each word Builds a visual and textual shared memory Temporal soft-attention: select visual information most related to each word Procedure: 1. Write hidden states to update memory 2. Read the updated memory content to perform soft-attention 3. Write selected visual information to update memory again 4. Read the updated memory content for next word prediction METEOR: 26.6 on MSR-VTT

Jointly Localizing and Describing Events for Dense Video Captioning base layers reduce the temporal dimension of feature map and increase the size of temporal receptive fields nine anchor layers are stacked on the top of base layer. These anchor layers decrease in temporal dimension of feature map progressively. For each anchor layer, its output feature map is injected into prediction layer to produce a fixed set of predictions in one shot manner Reinforcement learning: optimize LSTM with METEOR-based reward loss, to get better caption METEOR: 13 on ActivityNet caption

Weakly Supervised Dense Event Captioning in Videos Sentence localizer: attention mechanism “Crossing Attention”, which contains two sub-attention computations (One computes attention between the final hidden state of the video and the caption feature, Other one calculate the attention between the final hidden state of the caption and the video features) Input video is divided into multiple anchor segments under multiple scales, and train a fully connected layer to predict the best anchor that produces the highest Meteor score of the generated caption sentence. Then conducting regression around the best anchor that gives the highest score. Caption generator: perform a soft clipping by defining a continuous mask function