Download presentation
Presentation is loading. Please wait.
Published byLindsay O’Connor’ Modified over 6 years ago
1
Unsupervised Learning of Video Representations using LSTMs
Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov, University of Toronto
2
Agenda Quick Intro Supervised vs. Unsupervised Problem Definition
Model Description Experiments and Datasets Results
3
Quick Intro “Videos are an abundant and rich source of visual information and can be seen as a window into the physics of the world we live in”.
4
Supervised vs. Unsupervised
Supervised learning has performed well for visual representations of images. In video representation much more labels are required: Higher dimensional entities Problems are defined over time and not over a single frame. In videos there are many features to collect at each frame. And with each feature there are many labels. Data can be collected in a way that will help us to solve specific problem
5
Supervised vs. Unsupervised
Lets play a game: Shakespeare text generation Predicting video’s next frame Detecting hand gestures from video sequence LSTM - Supervised or Unsupervised
6
Model Description The model aims to achieve 3 goals:
Predict the same sequence as the input Predict the future frames Use the model as pre training for the supervised task of action recognition
7
Model Description – Previous Work
ICA based : Hurri, Jarmo, "Simple-cell-like receptive fields maximize temporal coherence in natural video”, 2003. ISA based: Le, Q. “Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis”, 2011. Generative models based- Memisevic, R. “Learning to represent spatial transformations with factored higher-order boltzmann machines”, Ranzato, M. “Video (language) modeling: a baseline for generative models of natural videos, 2014. Used recurrent neural networks to predict the next frame. Argued the squared loss function is not optimal. Proposed to quantize the images to a large dictionary.
8
Model Description - LSTM
Using Long Shot Term Memory (LSTM) units to represent the physics of the world over each time step. LSTM units sums activities over time - Each operation is applied at each step of the way
9
Model Description - LSTM
10
Model Description - Configuration
11
Model Description - Autoencoder
Used for input reconstruction Two RNNs – the encoder LSTM and the decoder LSTM The target sequence is in reverse order Encoder Decoder
12
Model Description - Autoencoder
Question: What prevents the network from learning the identity function? Answer: Fixed number of hidden units prevents trivial mappings LSTM operations are done recursively. Same dynamics must be applied at any stage.
13
Model Description – Future Prediction
Used for predicting the next sequence Same as Autoencoder but predicts the future frames The hidden state coming out of the encoder will capture the information from previous frames. Encoder Decoder
14
Model Description – Conditional Decoder
During decoding, use the last outputted frame as input. Pros: Allows the decoder to model multiple modes in target. Cons: The model will pick up the strong correlations between the frames and forget about the long term knowledge of the sequence.
15
Model Description – Composite Model
16
Model Description – Composite Model
Overcomes the shortcomings of each model: Autoencoder tries to learn trivial representations that memorizes the input This memorization is not useful for predicting the input Future predictor tries to store information only about the last few frames For input reconstruction, the knowledge of all the frames is needed
17
Experiments and Dataset
UCF-101 – videos belonging to 101 different actions. HMDB-51 – 5100 videos belonging to 51 different actions. Moving MNIST – Each video is of 20 frames of 64x64 patches and consists of two randomly sampled digits with random velocities. Sport 1M Dataset- 1 million YouTube clips
18
Results– Moving MNIST Each layer consists of 2048 LSTM units.
The loss function is the cross entropy loss function. Took 10 frames as input and tried to reconstruct them and predict the next 10.
19
Results– Moving MNIST
20
Results– Natural Image
32x32 patches from the UCF-101 dataset. The loss function is the squared error loss function. Took 16 frames as input and tried to reconstruct them and predict the next 13.
21
Results– Natural Image
22
Results - Comparison
23
Results – Generalization over time
Took the moving MNIST digits with 64x64 patches. The training was done using 10 frames as input. Let the model run for 100 iterations 200 random LSTM units Periodic For reconstruction, after 15 iterations, only blobs are returned.
24
Results – Out of domain Inputs
Doing a good job but tries to hallucinate a second digit For 3 digits, only return blobs
25
Results – Feature Visualization
26
Results – Feature Visualization
Match more blobs then the input features The stripes are match shorter. Because in the input we want to learn the velocity and the direction of the movement. In the output you just want to make them shorter and fatter to reduce the error. In the input you want the longer features because you want to code and store the information throughout many features In the output you don’t want them to be longer then the digits.
27
Results – Supervised Learning
Unsupervised pre-training performed on a subset of the Sport 1M The trained encoder was taken a the base for the LSTM classifier and a Softmax layer was added. Fine tuning was performed on the UCF-101 and HMDB-51 datasets.
28
Results – Supervised Learning
29
Results – Supervised Learning
The first model is only RGB data The second is a flow generated features The Third uses both
30
Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.