Presented By: Harshul Gupta

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Deep Learning and Neural Nets Spring 2015

Spatial Pyramid Pooling in Deep Convolutional

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Image Captioning Approaches

Unsupervised Visual Representation Learning by Context Prediction

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

A Hierarchical Deep Temporal Model for Group Activity Recognition

S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer

Naifan Zhuang, Jun Ye, Kien A. Hua

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Convolutional Sequence to Sequence Learning

Unsupervised Learning of Video Representations using LSTMs

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

Learning to Compare Image Patches via Convolutional Neural Networks

What Convnets Make for Image Captioning?

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Summary of “Efficient Deep Learning for Stereo Matching”

Object Detection based on Segment Masks

Compact Bilinear Pooling

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Recurrent Neural Networks for Natural Language Processing

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

Understanding and Predicting Image Memorability at a Large Scale

A Pool of Deep Models for Event Recognition

Neural Machine Translation by Jointly Learning to Align and Translate

Compositional Human Pose Regression

CSCI 5922 Neural Networks and Deep Learning: Image Captioning

Intelligent Information System Lab

Intro to NLP and Deep Learning

Supervised Training of Deep Networks

CS6890 Deep Learning Weizhen Cai

Deep Learning and Newtonian Physics

Image Question Answering

Computer Vision James Hays

Attention-based Caption Description Mun Jonghwan.

Grid Long Short-Term Memory

Paraphrase Generation Using Deep Learning

Two-Stream Convolutional Networks for Action Recognition in Videos

Vessel Extraction in X-Ray Angiograms Using Deep Learning

The Big Health Data–Intelligent Machine Paradox

Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions

Unsupervised Pretraining for Semantic Parsing

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Attention for translation

Learn to Comment Mentor: Mahdi M. Kalayeh

Sequence to Sequence Video to Text

Automatic Handwriting Generation

Human-object interaction

Visual Question Answering

Presented by: Anurag Paul

Deep Object Co-Segmentation

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Neural Machine Translation using CNN

Object Detection Implementations

Learning Deconvolution Network for Semantic Segmentation

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Sequence-to-Sequence Models

Week 7 Presentation Ngoc Ta Aidean Sharghi

Neural Machine Translation by Jointly Learning to Align and Translate

Presentation transcript:

Presented By: Harshul Gupta Long-term Recurrent Convolutional Networks for Visual Recognition and Description Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach Presented By: Harshul Gupta

ConvNet : Time-invariant and independent at each timestep LRCN is a class of models that is both spatially and temporally deep. It uses ConvNet (encoder) to encode deep spatial state vector and an LSTM (decoder) to generate natural language strings. ConvNet : Time-invariant and independent at each timestep ✔ Enables training parallelizable over all timesteps of the input. ✔ Independent batch processing LSTM : Allows modeling of sequential data of varying lengths. LRCN architecture take a sequence of T inputs and generate a sequence of T outputs. As a further note, there is a slight computational advantage to CaffeNet: the max pooling precedes the local response normalization (LRN) so that the LRN takes less compute and memory. However they are virtually indistinguishable, as @seanbell notes, and Caffe includes both CaffeNet and the exact AlexNet architecture. (used CaffeNet - slight modification from AlexNet - Pooling is done before Normalization)

Activity Recognition Sequential Input < F1, F2, F3,...,Ft > (different visual frames) Scalar Output < Y > (final label) How can LRCN model be used for Activity Recognition Task ? ✔ Predict video class at each time step < Y1, Y2, Y3,…,Yt > ✔ Average these predictions for final classification (Y)

Both RGB and Optical Flow are given as inputs and a weighted average of both is used for the final prediction. Advantage of using both RGB and Optical Flow models ? Typing: Strongly correlated with the presence of certain objects, such as a keyboard, and are thus best learned by the RGB model. Juggling: Include more generic objects which are frequently seen in other activities (such as ball, humans) and are thus best identified from class-specific motion cues. Because RGB and flow signals are complementary, the best models take both into account. Variants of the LRCN architecture are used: LSTM is placed after the first fully connected layer of the CNN (LRCN-fc6) LSTM is placed after the second fully connected layer of the CNN (LRCN-fc7).

The LRCN-fc6 net-work yields the best results for both RGB and flow Evaluation Evaluation is done on the UCF-101 dataset which consists of over 12,000 videos categorized into 101 human action classes. The LRCN-fc6 net-work yields the best results for both RGB and flow Since the flow network outperformed the RGB network, weighting the flow network higher lead to better accuracy. LRCN outperforms the baseline single-frame model by 3.82%. Single frame uses single architecture that fuses information from all frames at the last stage. The learnt spatiotemporal features didn’t capture motion features http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review

Image Description Scalar Input < F > Sequential Output < Y1, Y2, Y3,…,Yt > How can LRCN model be used for Image Description Task ? ✔ Simply duplicate the input F at all T timesteps Inputs @ timestep t (Factored) LSTM-1 : embedded ground truth word (“one-hot”) from the previous timestep. LSTM-2 : outputs of the first LSTM + image representation to produce a joint representation of the visual and language inputs Can you think of any advantage this might offer ? It creates a sort of separation of responsibilities by ensuring that the hidden state of lower LSTM is conditionally independent of visual input. This forces all of the capacity of their hidden states to represent only the caption.

Human Evaluator Rankings ( lower is better ) Evaluation Datasets Flickr30k (30,000 training images) and COCO 2014 (80,000 training images) Metrics Recall@K : Number of images for which a correct caption is retrieved within the top K results (higher is better) Median Rank (Medr) : Median of the first retrieved ground truth caption (lower is better) Human Evaluator Rankings ( lower is better ) LRCN with 2 layers LSTM (factored) performs better both in terms of Medr and Recall@K. Generated sentences were evaluated by Amazon Mechanical Turkers to rank them based on correctness, grammar and relevance. These approaches find a set of nearest neighbor images in the training set from which a caption may be borrowed for the query image. We select a caption for the query image by finding the caption that best represents the “consensus” of the set of candidate captions gathered from the nearest neighbor images. , a crowdsourced study shows that humans still prefer a system that generates novel captions https://arxiv.org/pdf/1505.04467.pdf https://arxiv.org/pdf/1410.1090.pdf

Sequential Input < F1, F2, F3,...,Ft > (T inputs) Video Description Sequential Input < F1, F2, F3,...,Ft > (T inputs) Sequential Output {Y1, Y2, Y3,…,Yt’ } (T’ outputs) For different input and output lengths, an “encoder-decoder” approach is taken Encoder : Maps the input sequence to a fixed-length vector Decoder: Unrolls this vector to sequential outputs of arbitrary length Under this model, the system as a whole may be thought of as having T+ T’ timesteps of input and output Semantic representation of the video is obtained using the MAP of a CRF. It is translated to a sentence using encoder-decoder LSTM architecture. They first recognize a semantic representation of the video using the maximum a posterior estimate (MAP) of a CRF taking in video features as unaries. Authors use predictions of objects, subjects, and verbs present in the video from a CRF based on the full video input.

Similar to Image Description Task Semantic representation is encoded as a single fixed length vector. Hence, provide entire visual input representation at each time step to the LSTM decoder. Similar to above but CRF max is replaced with probability

LSTM outperforms an SMT-based approach to video description; Evaluation Datasets TACoS (Textually Annotated Cooking Scenes ) multilevel dataset, which has 44,762 video/sentence pairs (about 40,000 for training/validation). It contains multiple sentence descriptions and longer videos, however they are restricted to the cooking scenario. LSTM outperforms an SMT-based approach to video description; Simpler decoder architecture (b) and (c) achieve better performance than (a) (BLEU-4 scores) https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/tacos-multi-level-corpus/

Conclusions LRCN is a flexible framework for vision problems involving sequences Able to handle: ✔ Sequences in the input (video) ✔ Sequences in the output (natural language description)