Week 7 Presentation Ngoc Ta Aidean Sharghi

Slides:

Advertisements

Similar presentations

Overview of Back Propagation Algorithm

Advertisements

Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, Xiangyang Xue

Convolutional LSTM Networks for Subcellular Localization of Proteins

Deep Learning Overview Sources: workshop-tutorial-final.pdf

A Hierarchical Deep Temporal Model for Group Activity Recognition

Tofik AliPartha Pratim Roy Department of Computer Science and Engineering Indian Institute of Technology Roorkee CVIP-WM 2017 Paper ID 172 Word Spotting.

Naifan Zhuang, Jun Ye, Kien A. Hua

Attention Model in NLP Jichuan ZENG.

Big data classification using neural network

Convolutional Sequence to Sequence Learning

Unsupervised Learning of Video Representations using LSTMs

Convolutional Neural Network

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Object Detection based on Segment Masks

Deep Learning Amin Sobhani.

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

Recurrent Neural Networks for Natural Language Processing

Saliency-guided Video Classification via Adaptively weighted learning

Neural networks (3) Regularization Autoencoder

Supervised Training of Deep Networks

Deep Belief Networks Psychology 209 February 22, 2013.

Deceptive News Prediction Clickbait Score Inference

CS6890 Deep Learning Weizhen Cai

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017

Textual Video Prediction

Image Question Answering

Attention Is All You Need

"Playing Atari with deep reinforcement learning."

Human-level control through deep reinforcement learning

Attention-based Caption Description Mun Jonghwan.

Grid Long Short-Term Memory

Two-Stream Convolutional Networks for Action Recognition in Videos

Vessel Extraction in X-Ray Angiograms Using Deep Learning

Object Classification through Deconvolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

The Big Health Data–Intelligent Machine Paradox

LECTURE 35: Introduction to EEG Processing

Neural Speech Synthesis with Transformer Network

Ying Dai Faculty of software and information science,

Ying Dai Faculty of software and information science,

Unsupervised Pretraining for Semantic Parsing

Learning Object Context for Dense Captioning

Natural Language to SQL(nl2sql)

Convolutional Neural Networks

Neural networks (3) Regularization Autoencoder

Ying Dai Faculty of software and information science,

CSC 578 Neural Networks and Deep Learning

Attention for translation

-- Ray Mooney, Association for Computational Linguistics (ACL) 2014

Learn to Comment Mentor: Mahdi M. Kalayeh

Jointly Generating Captions to Aid Visual Question Answering

Sequence to Sequence Video to Text

Human-object interaction

Motivation State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.

Object Detection Implementations

REU - End to End Self Driving Car

Presented By: Harshul Gupta

Weekly Learning Alex Omar Ruiz Irene.

Learning Deconvolution Network for Semantic Segmentation

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Sequence-to-Sequence Models

Week 3 Volodymyr Bobyr.

WEEK 4 PRESENTATION NGOC TA AIDEAN SHARGHI.

Week 6 Presentation Ngoc Ta Aidean Sharghi.

Point Set Representation for Object Detection and Beyond

Visual Grounding.

Presentation transcript:

Week 7 Presentation Ngoc Ta Aidean Sharghi Attention function (mapping a query). Output is computed as a weighted sum of the values where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key

End-to-end Video-level Representation Learning for Action Recognition Deep network with Temporal Pyramid Pooling (DTPP): builds video-level representation(rather than frame-level features) using enough frames sparsely sampled across a video Spatial stream(orange) takes as input RGB images Temporal stream (blue) takes as input optical flow stacks. Perform temporal pyramid pooling (TPP) to get video-level representation from frame features Finally, get average weight from the scores for the two streams

RECONSTRUCTION NETWORK FOR VIDEO CAPTIONING Encoder (CNN): extracts semantic representations of video frames Decoder (LSTM): generates natural language for visual content description Reconstructor: exploits the backward flow to reproduce frame representations Temporal attention mechanism select the key frames/elements for captioning by assigning weight to the representation of each frame Meteor: 34 on SA_LSTM, 32 on S2VT

End-to-End Dense Video Captioning with Masked Transformer Encoder: encodes video to visual representations Proposal decoder: decodes representation with different anchors to form event proposals Captioning decoder: use masking network to restrict its attention to the proposal events. Attention is all you need: Transformer: relies on self-attention to compute representations of the sequence (its input and output) Multi-head attention: allows model to attend to information from different representation subspaces at different positions. METEOR: 10 on ActivityNet Caption, 6.6 on YouCookII

M3 : Multimodal Memory Modelling for Video Captioning CNN-based video encoder Multimodal memory LSTM-based text decoder Soft-attention to select visual info most related to each word Builds a visual and textual shared memory Temporal soft-attention: select visual information most related to each word Procedure: 1. Write hidden states to update memory 2. Read the updated memory content to perform soft-attention 3. Write selected visual information to update memory again 4. Read the updated memory content for next word prediction METEOR: 26.6 on MSR-VTT

Jointly Localizing and Describing Events for Dense Video Captioning base layers reduce the temporal dimension of feature map and increase the size of temporal receptive fields nine anchor layers are stacked on the top of base layer. These anchor layers decrease in temporal dimension of feature map progressively. For each anchor layer, its output feature map is injected into prediction layer to produce a fixed set of predictions in one shot manner Reinforcement learning: optimize LSTM with METEOR-based reward loss, to get better caption METEOR: 13 on ActivityNet caption

Weakly Supervised Dense Event Captioning in Videos Sentence localizer: attention mechanism “Crossing Attention”, which contains two sub-attention computations (One computes attention between the final hidden state of the video and the caption feature, Other one calculate the attention between the final hidden state of the caption and the video features) Input video is divided into multiple anchor segments under multiple scales, and train a fully connected layer to predict the best anchor that produces the highest Meteor score of the generated caption sentence. Then conducting regression around the best anchor that gives the highest score. Caption generator: perform a soft clipping by defining a continuous mask function