Presented By: Harshul Gupta

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Deep Learning and Neural Nets Spring 2015
Spatial Pyramid Pooling in Deep Convolutional
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Image Captioning Approaches
Unsupervised Visual Representation Learning by Context Prediction
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
A Hierarchical Deep Temporal Model for Group Activity Recognition
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Naifan Zhuang, Jun Ye, Kien A. Hua
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
Convolutional Sequence to Sequence Learning
Unsupervised Learning of Video Representations using LSTMs
CNN-RNN: A Unified Framework for Multi-label Image Classification
Learning to Compare Image Patches via Convolutional Neural Networks
What Convnets Make for Image Captioning?
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Summary of “Efficient Deep Learning for Stereo Matching”
Object Detection based on Segment Masks
Compact Bilinear Pooling
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Recurrent Neural Networks for Natural Language Processing
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Understanding and Predicting Image Memorability at a Large Scale
A Pool of Deep Models for Event Recognition
Neural Machine Translation by Jointly Learning to Align and Translate
Compositional Human Pose Regression
CSCI 5922 Neural Networks and Deep Learning: Image Captioning
Intelligent Information System Lab
Intro to NLP and Deep Learning
Supervised Training of Deep Networks
CS6890 Deep Learning Weizhen Cai
Deep Learning and Newtonian Physics
Image Question Answering
Computer Vision James Hays
Attention-based Caption Description Mun Jonghwan.
Grid Long Short-Term Memory
Paraphrase Generation Using Deep Learning
Two-Stream Convolutional Networks for Action Recognition in Videos
Vessel Extraction in X-Ray Angiograms Using Deep Learning
The Big Health Data–Intelligent Machine Paradox
Papers 15/08.
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
Unsupervised Pretraining for Semantic Parsing
Attention.
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
Learn to Comment Mentor: Mahdi M. Kalayeh
Sequence to Sequence Video to Text
Automatic Handwriting Generation
Human-object interaction
Visual Question Answering
Presented by: Anurag Paul
Deep Object Co-Segmentation
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Neural Machine Translation using CNN
Object Detection Implementations
Learning Deconvolution Network for Semantic Segmentation
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Sequence-to-Sequence Models
Week 7 Presentation Ngoc Ta Aidean Sharghi
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

Presented By: Harshul Gupta Long-term Recurrent Convolutional Networks for Visual Recognition and Description Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach Presented By: Harshul Gupta

ConvNet : Time-invariant and independent at each timestep LRCN is a class of models that is both spatially and temporally deep. It uses ConvNet (encoder) to encode deep spatial state vector and an LSTM (decoder) to generate natural language strings. ConvNet : Time-invariant and independent at each timestep ✔ Enables training parallelizable over all timesteps of the input. ✔ Independent batch processing LSTM : Allows modeling of sequential data of varying lengths. LRCN architecture take a sequence of T inputs and generate a sequence of T outputs. As a further note, there is a slight computational advantage to CaffeNet: the max pooling precedes the local response normalization (LRN) so that the LRN takes less compute and memory. However they are virtually indistinguishable, as @seanbell notes, and Caffe includes both CaffeNet and the exact AlexNet architecture. (used CaffeNet - slight modification from AlexNet - Pooling is done before Normalization)

Activity Recognition Sequential Input < F1, F2, F3,...,Ft > (different visual frames) Scalar Output < Y > (final label) How can LRCN model be used for Activity Recognition Task ? ✔ Predict video class at each time step < Y1, Y2, Y3,…,Yt > ✔ Average these predictions for final classification (Y)

Both RGB and Optical Flow are given as inputs and a weighted average of both is used for the final prediction. Advantage of using both RGB and Optical Flow models ? Typing: Strongly correlated with the presence of certain objects, such as a keyboard, and are thus best learned by the RGB model. Juggling: Include more generic objects which are frequently seen in other activities (such as ball, humans) and are thus best identified from class-specific motion cues. Because RGB and flow signals are complementary, the best models take both into account. Variants of the LRCN architecture are used: LSTM is placed after the first fully connected layer of the CNN (LRCN-fc6) LSTM is placed after the second fully connected layer of the CNN (LRCN-fc7).

The LRCN-fc6 net-work yields the best results for both RGB and flow Evaluation Evaluation is done on the UCF-101 dataset which consists of over 12,000 videos categorized into 101 human action classes. The LRCN-fc6 net-work yields the best results for both RGB and flow Since the flow network outperformed the RGB network, weighting the flow network higher lead to better accuracy. LRCN outperforms the baseline single-frame model by 3.82%. Single frame uses single architecture that fuses information from all frames at the last stage. The learnt spatiotemporal features didn’t capture motion features http://blog.qure.ai/notes/deep-learning-for-videos-action-recognition-review

Image Description Scalar Input < F > Sequential Output < Y1, Y2, Y3,…,Yt > How can LRCN model be used for Image Description Task ? ✔ Simply duplicate the input F at all T timesteps Inputs @ timestep t (Factored) LSTM-1 : embedded ground truth word (“one-hot”) from the previous timestep. LSTM-2 : outputs of the first LSTM + image representation to produce a joint representation of the visual and language inputs Can you think of any advantage this might offer ? It creates a sort of separation of responsibilities by ensuring that the hidden state of lower LSTM is conditionally independent of visual input. This forces all of the capacity of their hidden states to represent only the caption.

Human Evaluator Rankings ( lower is better ) Evaluation Datasets Flickr30k (30,000 training images) and COCO 2014 (80,000 training images) Metrics Recall@K : Number of images for which a correct caption is retrieved within the top K results (higher is better) Median Rank (Medr) : Median of the first retrieved ground truth caption (lower is better) Human Evaluator Rankings ( lower is better ) LRCN with 2 layers LSTM (factored) performs better both in terms of Medr and Recall@K. Generated sentences were evaluated by Amazon Mechanical Turkers to rank them based on correctness, grammar and relevance. These approaches find a set of nearest neighbor images in the training set from which a caption may be borrowed for the query image. We select a caption for the query image by finding the caption that best represents the “consensus” of the set of candidate captions gathered from the nearest neighbor images. , a crowdsourced study shows that humans still prefer a system that generates novel captions https://arxiv.org/pdf/1505.04467.pdf https://arxiv.org/pdf/1410.1090.pdf

Sequential Input < F1, F2, F3,...,Ft > (T inputs) Video Description Sequential Input < F1, F2, F3,...,Ft > (T inputs) Sequential Output {Y1, Y2, Y3,…,Yt’ } (T’ outputs) For different input and output lengths, an “encoder-decoder” approach is taken Encoder : Maps the input sequence to a fixed-length vector Decoder: Unrolls this vector to sequential outputs of arbitrary length Under this model, the system as a whole may be thought of as having T+ T’ timesteps of input and output Semantic representation of the video is obtained using the MAP of a CRF. It is translated to a sentence using encoder-decoder LSTM architecture. They first recognize a semantic representation of the video using the maximum a posterior estimate (MAP) of a CRF taking in video features as unaries. Authors use predictions of objects, subjects, and verbs present in the video from a CRF based on the full video input.

Similar to Image Description Task Semantic representation is encoded as a single fixed length vector. Hence, provide entire visual input representation at each time step to the LSTM decoder. Similar to above but CRF max is replaced with probability

LSTM outperforms an SMT-based approach to video description; Evaluation Datasets TACoS (Textually Annotated Cooking Scenes ) multilevel dataset, which has 44,762 video/sentence pairs (about 40,000 for training/validation). It contains multiple sentence descriptions and longer videos, however they are restricted to the cooking scenario. LSTM outperforms an SMT-based approach to video description; Simpler decoder architecture (b) and (c) achieve better performance than (a) (BLEU-4 scores) https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/tacos-multi-level-corpus/

Conclusions LRCN is a flexible framework for vision problems involving sequences Able to handle: ✔ Sequences in the input (video) ✔ Sequences in the output (natural language description)