Unsupervised Learning of Video Representations using LSTMs

Slides:



Advertisements
Similar presentations
Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
Advertisements

Distributed Representations of Sentences and Documents
Autoencoders Mostafa Heidarpour
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积LSTM网络:利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
A Hierarchical Deep Temporal Model for Group Activity Recognition
Yann LeCun Other Methods and Applications of Deep Learning Yann Le Cun The Courant Institute of Mathematical Sciences New York University
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Naifan Zhuang, Jun Ye, Kien A. Hua
Attention Model in NLP Jichuan ZENG.
Convolutional Sequence to Sequence Learning
Handwritten Digit Recognition Using Stacked Autoencoders
Learning Deep Generative Models by Ruslan Salakhutdinov
Convolutional Neural Network
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
The Relationship between Deep Learning and Brain Function
Deep Learning Amin Sobhani.
Randomness in Neural Networks
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Recurrent Neural Networks for Natural Language Processing
Deep Predictive Model for Autonomous Driving
Article Review Todd Hricik.
Matt Gormley Lecture 16 October 24, 2016
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Neural Networks for Machine Learning Lecture 1e Three types of learning Geoffrey Hinton with Nitish Srivastava Kevin Swersky.
Multimodal Learning with Deep Boltzmann Machines
Introductory Seminar on Research: Fall 2017
Intelligent Information System Lab
Intro to NLP and Deep Learning
Neural networks (3) Regularization Autoencoder
Recognition using Nearest Neighbor (or kNN)
Supervised Training of Deep Networks
Deep learning and applications to Natural language processing
Shunyuan Zhang Nikhil Malik
Unsupervised Learning and Autoencoders
Presenter: Hajar Emami
Adversarially Tuned Scene Generation
Image Question Answering
State-of-the-art face recognition systems
A critical review of RNN for sequence learning Zachary C
Grid Long Short-Term Memory
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Two-Stream Convolutional Networks for Action Recognition in Videos
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Understanding LSTM Networks
Image to Image Translation using GANs
Papers 15/08.
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
Lip movement Synthesis from Text
Unsupervised Pretraining for Semantic Parsing
Neural networks (3) Regularization Autoencoder
Please enjoy.
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
Automatic Handwriting Generation
Neural Machine Translation using CNN
Presented By: Harshul Gupta
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Recurrent Neural Networks
Week 3 Volodymyr Bobyr.
Self-Supervised Cross-View Action Synthesis
Week 7 Presentation Ngoc Ta Aidean Sharghi
Sign Language Recognition With Unsupervised Feature Learning
Presentation transcript:

Unsupervised Learning of Video Representations using LSTMs Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov, University of Toronto

Agenda Quick Intro Supervised vs. Unsupervised Problem Definition Model Description Experiments and Datasets Results

Quick Intro “Videos are an abundant and rich source of visual information and can be seen as a window into the physics of the world we live in”.

Supervised vs. Unsupervised Supervised learning has performed well for visual representations of images. In video representation much more labels are required: Higher dimensional entities Problems are defined over time and not over a single frame. In videos there are many features to collect at each frame. And with each feature there are many labels. Data can be collected in a way that will help us to solve specific problem

Supervised vs. Unsupervised Lets play a game: Shakespeare text generation Predicting video’s next frame Detecting hand gestures from video sequence LSTM - Supervised or Unsupervised

Model Description The model aims to achieve 3 goals: Predict the same sequence as the input Predict the future frames Use the model as pre training for the supervised task of action recognition

Model Description – Previous Work ICA based : Hurri, Jarmo, "Simple-cell-like receptive fields maximize temporal coherence in natural video”, 2003.‏ ISA based: Le, Q. “Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis”, 2011. Generative models based- Memisevic, R. “Learning to represent spatial transformations with factored higher-order boltzmann machines”, 2010. Ranzato, M. “Video (language) modeling: a baseline for generative models of natural videos, 2014. Used recurrent neural networks to predict the next frame. Argued the squared loss function is not optimal. Proposed to quantize the images to a large dictionary.

Model Description - LSTM Using Long Shot Term Memory (LSTM) units to represent the physics of the world over each time step. LSTM units sums activities over time - Each operation is applied at each step of the way

Model Description - LSTM

Model Description - Configuration

Model Description - Autoencoder Used for input reconstruction Two RNNs – the encoder LSTM and the decoder LSTM The target sequence is in reverse order Encoder Decoder

Model Description - Autoencoder Question: What prevents the network from learning the identity function? Answer: Fixed number of hidden units prevents trivial mappings LSTM operations are done recursively. Same dynamics must be applied at any stage.

Model Description – Future Prediction Used for predicting the next sequence Same as Autoencoder but predicts the future frames The hidden state coming out of the encoder will capture the information from previous frames. Encoder Decoder

Model Description – Conditional Decoder During decoding, use the last outputted frame as input. Pros: Allows the decoder to model multiple modes in target. Cons: The model will pick up the strong correlations between the frames and forget about the long term knowledge of the sequence.

Model Description – Composite Model

Model Description – Composite Model Overcomes the shortcomings of each model: Autoencoder tries to learn trivial representations that memorizes the input This memorization is not useful for predicting the input Future predictor tries to store information only about the last few frames For input reconstruction, the knowledge of all the frames is needed

Experiments and Dataset UCF-101 – 13320 videos belonging to 101 different actions. HMDB-51 – 5100 videos belonging to 51 different actions. Moving MNIST – Each video is of 20 frames of 64x64 patches and consists of two randomly sampled digits with random velocities. Sport 1M Dataset- 1 million YouTube clips

Results– Moving MNIST Each layer consists of 2048 LSTM units. The loss function is the cross entropy loss function. Took 10 frames as input and tried to reconstruct them and predict the next 10.

Results– Moving MNIST

Results– Natural Image 32x32 patches from the UCF-101 dataset. The loss function is the squared error loss function. Took 16 frames as input and tried to reconstruct them and predict the next 13.

Results– Natural Image

Results - Comparison

Results – Generalization over time Took the moving MNIST digits with 64x64 patches. The training was done using 10 frames as input. Let the model run for 100 iterations 200 random LSTM units Periodic For reconstruction, after 15 iterations, only blobs are returned.

Results – Out of domain Inputs Doing a good job but tries to hallucinate a second digit For 3 digits, only return blobs

Results – Feature Visualization

Results – Feature Visualization Match more blobs then the input features The stripes are match shorter. Because in the input we want to learn the velocity and the direction of the movement. In the output you just want to make them shorter and fatter to reduce the error. In the input you want the longer features because you want to code and store the information throughout many features In the output you don’t want them to be longer then the digits.

Results – Supervised Learning Unsupervised pre-training performed on a subset of the Sport 1M The trained encoder was taken a the base for the LSTM classifier and a Softmax layer was added. Fine tuning was performed on the UCF-101 and HMDB-51 datasets.

Results – Supervised Learning

Results – Supervised Learning The first model is only RGB data The second is a flow generated features The Third uses both

Questions