Unsupervised Learning of Video Representations using LSTMs

Unsupervised Learning of Video Representations using LSTMs
Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov, University of Toronto

Agenda Quick Intro Supervised vs. Unsupervised Problem Definition
Model Description Experiments and Datasets Results

Quick Intro “Videos are an abundant and rich source of visual information and can be seen as a window into the physics of the world we live in”.

Supervised vs. Unsupervised
Supervised learning has performed well for visual representations of images. In video representation much more labels are required: Higher dimensional entities Problems are defined over time and not over a single frame. In videos there are many features to collect at each frame. And with each feature there are many labels. Data can be collected in a way that will help us to solve specific problem

Supervised vs. Unsupervised
Lets play a game: Shakespeare text generation Predicting video’s next frame Detecting hand gestures from video sequence LSTM - Supervised or Unsupervised

Model Description The model aims to achieve 3 goals:
Predict the same sequence as the input Predict the future frames Use the model as pre training for the supervised task of action recognition

Model Description – Previous Work
ICA based : Hurri, Jarmo, "Simple-cell-like receptive fields maximize temporal coherence in natural video”, 2003.‏ ISA based: Le, Q. “Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis”, 2011. Generative models based- Memisevic, R. “Learning to represent spatial transformations with factored higher-order boltzmann machines”, Ranzato, M. “Video (language) modeling: a baseline for generative models of natural videos, 2014. Used recurrent neural networks to predict the next frame. Argued the squared loss function is not optimal. Proposed to quantize the images to a large dictionary.

Model Description - LSTM
Using Long Shot Term Memory (LSTM) units to represent the physics of the world over each time step. LSTM units sums activities over time - Each operation is applied at each step of the way

Model Description - LSTM

Model Description - Configuration

Model Description - Autoencoder
Used for input reconstruction Two RNNs – the encoder LSTM and the decoder LSTM The target sequence is in reverse order Encoder Decoder

Model Description - Autoencoder
Question: What prevents the network from learning the identity function? Answer: Fixed number of hidden units prevents trivial mappings LSTM operations are done recursively. Same dynamics must be applied at any stage.

Model Description – Future Prediction
Used for predicting the next sequence Same as Autoencoder but predicts the future frames The hidden state coming out of the encoder will capture the information from previous frames. Encoder Decoder

Model Description – Conditional Decoder
During decoding, use the last outputted frame as input. Pros: Allows the decoder to model multiple modes in target. Cons: The model will pick up the strong correlations between the frames and forget about the long term knowledge of the sequence.

Model Description – Composite Model

Model Description – Composite Model
Overcomes the shortcomings of each model: Autoencoder tries to learn trivial representations that memorizes the input This memorization is not useful for predicting the input Future predictor tries to store information only about the last few frames For input reconstruction, the knowledge of all the frames is needed

Experiments and Dataset
UCF-101 – videos belonging to 101 different actions. HMDB-51 – 5100 videos belonging to 51 different actions. Moving MNIST – Each video is of 20 frames of 64x64 patches and consists of two randomly sampled digits with random velocities. Sport 1M Dataset- 1 million YouTube clips

Results– Moving MNIST Each layer consists of 2048 LSTM units.
The loss function is the cross entropy loss function. Took 10 frames as input and tried to reconstruct them and predict the next 10.

Results– Moving MNIST

Results– Natural Image
32x32 patches from the UCF-101 dataset. The loss function is the squared error loss function. Took 16 frames as input and tried to reconstruct them and predict the next 13.

Results– Natural Image

Results - Comparison

Results – Generalization over time
Took the moving MNIST digits with 64x64 patches. The training was done using 10 frames as input. Let the model run for 100 iterations 200 random LSTM units Periodic For reconstruction, after 15 iterations, only blobs are returned.

Results – Out of domain Inputs
Doing a good job but tries to hallucinate a second digit For 3 digits, only return blobs

Results – Feature Visualization

Results – Feature Visualization
Match more blobs then the input features The stripes are match shorter. Because in the input we want to learn the velocity and the direction of the movement. In the output you just want to make them shorter and fatter to reduce the error. In the input you want the longer features because you want to code and store the information throughout many features In the output you don’t want them to be longer then the digits.

Results – Supervised Learning
Unsupervised pre-training performed on a subset of the Sport 1M The trained encoder was taken a the base for the LSTM classifier and a Softmax layer was added. Fine tuning was performed on the UCF-101 and HMDB-51 datasets.

The first model is only RGB data The second is a flow generated features The Third uses both

Questions

Unsupervised Learning of Video Representations using LSTMs

Similar presentations

Presentation on theme: "Unsupervised Learning of Video Representations using LSTMs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unsupervised Learning of Video Representations using LSTMs

Similar presentations

Presentation on theme: "Unsupervised Learning of Video Representations using LSTMs"— Presentation transcript:

Similar presentations

About project

Feedback