Unsupervised Learning of Video Representations using LSTMs

Slides:

Advertisements

Similar presentations

Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.

Advertisements

Distributed Representations of Sentences and Documents

Autoencoders Mostafa Heidarpour

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo.

CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.

Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积LSTM网络:利用机器学习预测短期降雨施行健香港科技大学 VALSE 2016/03/23.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

A Hierarchical Deep Temporal Model for Group Activity Recognition

Yann LeCun Other Methods and Applications of Deep Learning Yann Le Cun The Courant Institute of Mathematical Sciences New York University

S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer

Naifan Zhuang, Jun Ye, Kien A. Hua

Attention Model in NLP Jichuan ZENG.

Convolutional Sequence to Sequence Learning

Handwritten Digit Recognition Using Stacked Autoencoders

Learning Deep Generative Models by Ruslan Salakhutdinov

Convolutional Neural Network

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

The Relationship between Deep Learning and Brain Function

Deep Learning Amin Sobhani.

Randomness in Neural Networks

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

Recurrent Neural Networks for Natural Language Processing

Deep Predictive Model for Autonomous Driving

Article Review Todd Hricik.

Matt Gormley Lecture 16 October 24, 2016

A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis

Neural Networks for Machine Learning Lecture 1e Three types of learning Geoffrey Hinton with Nitish Srivastava Kevin Swersky.

Multimodal Learning with Deep Boltzmann Machines

Introductory Seminar on Research: Fall 2017

Intelligent Information System Lab

Intro to NLP and Deep Learning

Neural networks (3) Regularization Autoencoder

Recognition using Nearest Neighbor (or kNN)

Supervised Training of Deep Networks

Deep learning and applications to Natural language processing

Shunyuan Zhang Nikhil Malik

Unsupervised Learning and Autoencoders

Presenter: Hajar Emami

Adversarially Tuned Scene Generation

Image Question Answering

State-of-the-art face recognition systems

A critical review of RNN for sequence learning Zachary C

Grid Long Short-Term Memory

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

Two-Stream Convolutional Networks for Action Recognition in Videos

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Understanding LSTM Networks

Image to Image Translation using GANs

Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions

Lip movement Synthesis from Text

Unsupervised Pretraining for Semantic Parsing

Neural networks (3) Regularization Autoencoder

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Attention for translation

Automatic Handwriting Generation

Neural Machine Translation using CNN

Presented By: Harshul Gupta

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Recurrent Neural Networks

Week 3 Volodymyr Bobyr.

Self-Supervised Cross-View Action Synthesis

Week 7 Presentation Ngoc Ta Aidean Sharghi

Sign Language Recognition With Unsupervised Feature Learning

Presentation transcript:

Unsupervised Learning of Video Representations using LSTMs Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov, University of Toronto

Agenda Quick Intro Supervised vs. Unsupervised Problem Definition Model Description Experiments and Datasets Results

Quick Intro “Videos are an abundant and rich source of visual information and can be seen as a window into the physics of the world we live in”.

Supervised vs. Unsupervised Supervised learning has performed well for visual representations of images. In video representation much more labels are required: Higher dimensional entities Problems are defined over time and not over a single frame. In videos there are many features to collect at each frame. And with each feature there are many labels. Data can be collected in a way that will help us to solve specific problem

Supervised vs. Unsupervised Lets play a game: Shakespeare text generation Predicting video’s next frame Detecting hand gestures from video sequence LSTM - Supervised or Unsupervised

Model Description The model aims to achieve 3 goals: Predict the same sequence as the input Predict the future frames Use the model as pre training for the supervised task of action recognition

Model Description – Previous Work ICA based : Hurri, Jarmo, "Simple-cell-like receptive fields maximize temporal coherence in natural video”, 2003.‏ ISA based: Le, Q. “Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis”, 2011. Generative models based- Memisevic, R. “Learning to represent spatial transformations with factored higher-order boltzmann machines”, 2010. Ranzato, M. “Video (language) modeling: a baseline for generative models of natural videos, 2014. Used recurrent neural networks to predict the next frame. Argued the squared loss function is not optimal. Proposed to quantize the images to a large dictionary.

Model Description - LSTM Using Long Shot Term Memory (LSTM) units to represent the physics of the world over each time step. LSTM units sums activities over time - Each operation is applied at each step of the way

Model Description - LSTM

Model Description - Configuration

Model Description - Autoencoder Used for input reconstruction Two RNNs – the encoder LSTM and the decoder LSTM The target sequence is in reverse order Encoder Decoder

Model Description - Autoencoder Question: What prevents the network from learning the identity function? Answer: Fixed number of hidden units prevents trivial mappings LSTM operations are done recursively. Same dynamics must be applied at any stage.

Model Description – Future Prediction Used for predicting the next sequence Same as Autoencoder but predicts the future frames The hidden state coming out of the encoder will capture the information from previous frames. Encoder Decoder

Model Description – Conditional Decoder During decoding, use the last outputted frame as input. Pros: Allows the decoder to model multiple modes in target. Cons: The model will pick up the strong correlations between the frames and forget about the long term knowledge of the sequence.

Model Description – Composite Model

Model Description – Composite Model Overcomes the shortcomings of each model: Autoencoder tries to learn trivial representations that memorizes the input This memorization is not useful for predicting the input Future predictor tries to store information only about the last few frames For input reconstruction, the knowledge of all the frames is needed

Experiments and Dataset UCF-101 – 13320 videos belonging to 101 different actions. HMDB-51 – 5100 videos belonging to 51 different actions. Moving MNIST – Each video is of 20 frames of 64x64 patches and consists of two randomly sampled digits with random velocities. Sport 1M Dataset- 1 million YouTube clips

Results– Moving MNIST Each layer consists of 2048 LSTM units. The loss function is the cross entropy loss function. Took 10 frames as input and tried to reconstruct them and predict the next 10.

Results– Moving MNIST

Results– Natural Image 32x32 patches from the UCF-101 dataset. The loss function is the squared error loss function. Took 16 frames as input and tried to reconstruct them and predict the next 13.

Results– Natural Image

Results - Comparison

Results – Generalization over time Took the moving MNIST digits with 64x64 patches. The training was done using 10 frames as input. Let the model run for 100 iterations 200 random LSTM units Periodic For reconstruction, after 15 iterations, only blobs are returned.

Results – Out of domain Inputs Doing a good job but tries to hallucinate a second digit For 3 digits, only return blobs

Results – Feature Visualization

Results – Feature Visualization Match more blobs then the input features The stripes are match shorter. Because in the input we want to learn the velocity and the direction of the movement. In the output you just want to make them shorter and fatter to reduce the error. In the input you want the longer features because you want to code and store the information throughout many features In the output you don’t want them to be longer then the digits.

Results – Supervised Learning Unsupervised pre-training performed on a subset of the Sport 1M The trained encoder was taken a the base for the LSTM classifier and a Softmax layer was added. Fine tuning was performed on the UCF-101 and HMDB-51 datasets.

Results – Supervised Learning

Results – Supervised Learning The first model is only RGB data The second is a flow generated features The Third uses both

Questions