Sequence to Sequence Video to Text

Slides:

Advertisements

Similar presentations

Deep Learning and Neural Nets Spring 2015

Advertisements

Spatial Pyramid Pooling in Deep Convolutional

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.

Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.

Fully Convolutional Networks for Semantic Segmentation

Image Captioning Approaches

Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong

A Hierarchical Deep Temporal Model for Group Activity Recognition

S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer

Naifan Zhuang, Jun Ye, Kien A. Hua

Recent developments in object detection

Convolutional Sequence to Sequence Learning

Unsupervised Learning of Video Representations using LSTMs

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

What Convnets Make for Image Captioning?

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

End-To-End Memory Networks

ECE 417 Lecture 1: Multimedia Signal Processing

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

The Relationship between Deep Learning and Brain Function

Deep Predictive Model for Autonomous Driving

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

Saliency-guided Video Classification via Adaptively weighted learning

A Pool of Deep Models for Event Recognition

Article Review Todd Hricik.

Relation Extraction CSCI-GA.2591

Neural Machine Translation by Jointly Learning to Align and Translate

Show and Tell: A Neural Image Caption Generator (CVPR 2015)

Regularizing Face Verification Nets To Discrete-Valued Pain Regression

Combining CNN with RNN for scene labeling (segmentation)

CSCI 5922 Neural Networks and Deep Learning: Image Captioning

mengye ren, ryan kiros, richard s. zemel

Deceptive News Prediction Clickbait Score Inference

Neural Machine Translation By Learning to Jointly Align and Translate

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

By: Kevin Yu Ph.D. in Computer Engineering

Visual Question Generation

Attention-based Caption Description Mun Jonghwan.

Wei Liu, Chaofeng Chen and Kwan-Yee K. Wong

Two-Stream Convolutional Networks for Action Recognition in Videos

Recurrent Neural Networks

Vessel Extraction in X-Ray Angiograms Using Deep Learning

The Open World of Micro-Videos

Introduction to Text Generation

Word embeddings based mapping

Word embeddings based mapping

Learning a Policy for Opportunistic Active Learning

The Big Health Data–Intelligent Machine Paradox

Neural Speech Synthesis with Transformer Network

Unsupervised Pretraining for Semantic Parsing

Attention for translation

Learn to Comment Mentor: Mahdi M. Kalayeh

Jointly Generating Captions to Aid Visual Question Answering

Human-object interaction

Visual Question Answering

Presented by: Anurag Paul

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Neural Machine Translation using CNN

Presented By: Harshul Gupta

Baseline Model CSV Files Pandas DataFrame Sentence Lists

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Week 3 Volodymyr Bobyr.

Week 7 Presentation Ngoc Ta Aidean Sharghi

Neural Machine Translation by Jointly Learning to Align and Translate

Visual Grounding.

ICLR, 2019 Jiahe Li

Iterative Projection and Matching: Finding Structure-preserving Representatives and Its Application to Computer Vision.

Do Better ImageNet Models Transfer Better?

Presentation transcript:

Sequence to Sequence Video to Text Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko Presented by Dewal Gupta UCSD CSE 291G, Winter 2019

BACKGROUND Challenge: Create a description for a given video Important in: describing videos for blind human-robot interactions Challenging because: diverse set of scenes, actions necessary to recognize salient action in context images don’t have to face the

PREVIOUS WORK: Template Models Tag video with captions and use as bag of words Two stage pipeline: first: tag video with semantic information on objects, actions treated as a classification problem FGM labels subject, verb, object, place second: generate sentence from semantic information S2VT approach: avoids separating content identification from sentence generation Integrating Language and Vision: to Generate Natural Language Descriptions of Videos in the Wild - Mooney et al., 2014

PREVIOUS WORK: Mean Pooling CNN trained on object classification (subset of ImageNet) 2 layer LSTM with video and previous word as input Ignores video frame ordering Translating Videos to Natural Language Using Deep Recurrent Neural Networks Mooney et al., 2015

PREVIOUS WORK: Exploiting Temporal Structure Encoder: train 3D ConvNet on action recognition fixed frame input exploits local temporal structure Describing Videos by Exploiting Temporal Structure Videos in the Wild Courville et al., 2015

PREVIOUS WORK: Exploiting Temporal Structure Decoder: Similar to our HW 2 Exploits global temporal structure Describing Videos by Exploiting Temporal Structure Videos in the Wild Courville et al., 2015

GOAL End to End differentiable model that can: Handle variable video length (i.e. variable input length) Learn temporal structure Learn a language model that is capable of generating descriptive sentences

MODEL: LSTM Single LSTM network 2 layer LSTM network 1000 hidden units (ht) red layer: models visual elements green layer: models linguistic elements ILSVRC-2012 object classification subset of the ImageNet dataset [30]

MODEL: VGG-16

MODEL: AlexNet Used for RGB & Flow!

MODEL: Details Use Text Embedding (of 500 dimensions) self-trained, simple linear transformation RGB networks are pre-trained on subset of ImageNet data Used networks from the original works Optical Flow pretrained on UCF101 dataset Action Classification Task Original work from ‘Action Tubes’ All layers are frozen except last layers for training Flow and RGB combined by “shallow fusion technique” alpha is tuned on the validation set

DATASETS 3 datasets used: Microsoft Video Description corpus (MSVD) MPII Movie Description Corpus (MPII-MD) Montreal Video Annotation Dataset (M-VAD) MSVD: web clips with human annotations MPII-MD: Hollywood clips with descriptions from script & audio (originally for the visually impaired) M-VAD: Hollywood clips with audio descriptions All three have single sentence descriptions

DATASETS: Metrics Authors use METEOR metric uses exact token, stemmed token and WordNet synonym matches better correlation with human judgement than BLEU or ROUGE out performs CIDEr when fewer references datasets only had 1 reference where: m is unigram (or n-gram) matches after alignment wr is length of reference wt is length of candidate BLEU does not take recall into consideration directly - rather just penalizes brevity

RESULTS: MSVD FGM is template based not very descriptive predicts a noun, verb, object, place builds sentence off template

RESULTS: MSVD Mean Pool based method very similar to author’s method

RESULTS: MSVD Temporal Attention method Encoder/Decoder using attention

RESULTS: Frame ordering Training with random ordering of frames results in “considerably lower” performance

RESULTS: Optical Flow Flow results in better performance only when combined with RGB (& not when used alone) Flow can be very different even for same activities Flow can’t account for polysemous words like “play” - eg. “play guitar” vs “play golf” person vs panda eating

RESULTS: SOTA Authors claim accurate comparison is with GoogleNet with NO 3D-CNN (global temporal attention) questionable claim person vs panda eating

Results: MPII-MD, M-VAD Similar performance to Visual-Labels VL uses more semantic information (eg. object detection) but no temporal information

Results: Edit Distance Levenshtein Distance: represents edit distance between two strings 42.9% of generated samples match exactly with a sentence in the training corpus of MSVD model struggles to learn MVAD

CRITICISM Model fails to learn temporal relations performs nearly as well as mean pooling technique that makes no use of temporal relations Model struggles on MVAD dataset for some reason more than other Authors should have used BLEU and/or CIDEr scores as well (other studies have them) Conduct user study (where human looks at captions)? Could improve by using better text embeddings?

FURTHER WORK Use Inception ResNet v2 as backbone CNN Train CNN against mined video “attributes” Achieve +5% METEOR score on MSVD Same architecture End-to-End Video Captioning with Multitask Reinforcement Learning - Li & Gong, 2019

FURTHER WORK Use 3D CNN to get better clip embeddings instead of LSTMs proven better in activity recognition Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification - Xie et al., 2017

CONCLUSION Authors build an end to end differentiable model that can: Handle variable video length (i.e. variable input length) Learn temporal structure Learn a language model that is capable of generating descriptive sentences Has become a baseline for many video captioning works

EXAMPLES