Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unsupervised Learning of Video Representations using LSTMs

Similar presentations


Presentation on theme: "Unsupervised Learning of Video Representations using LSTMs"— Presentation transcript:

1 Unsupervised Learning of Video Representations using LSTMs
Nitish Srivastava, Elman Mansimov, Ruslan Salakhutdinov, University of Toronto

2 Agenda Quick Intro Supervised vs. Unsupervised Problem Definition
Model Description Experiments and Datasets Results

3 Quick Intro “Videos are an abundant and rich source of visual information and can be seen as a window into the physics of the world we live in”.

4 Supervised vs. Unsupervised
Supervised learning has performed well for visual representations of images. In video representation much more labels are required: Higher dimensional entities Problems are defined over time and not over a single frame. In videos there are many features to collect at each frame. And with each feature there are many labels. Data can be collected in a way that will help us to solve specific problem

5 Supervised vs. Unsupervised
Lets play a game: Shakespeare text generation Predicting video’s next frame Detecting hand gestures from video sequence LSTM - Supervised or Unsupervised

6 Model Description The model aims to achieve 3 goals:
Predict the same sequence as the input Predict the future frames Use the model as pre training for the supervised task of action recognition

7 Model Description – Previous Work
ICA based : Hurri, Jarmo, "Simple-cell-like receptive fields maximize temporal coherence in natural video”, 2003.‏ ISA based: Le, Q. “Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis”, 2011. Generative models based- Memisevic, R. “Learning to represent spatial transformations with factored higher-order boltzmann machines”, Ranzato, M. “Video (language) modeling: a baseline for generative models of natural videos, 2014. Used recurrent neural networks to predict the next frame. Argued the squared loss function is not optimal. Proposed to quantize the images to a large dictionary.

8 Model Description - LSTM
Using Long Shot Term Memory (LSTM) units to represent the physics of the world over each time step. LSTM units sums activities over time - Each operation is applied at each step of the way

9 Model Description - LSTM

10 Model Description - Configuration

11 Model Description - Autoencoder
Used for input reconstruction Two RNNs – the encoder LSTM and the decoder LSTM The target sequence is in reverse order Encoder Decoder

12 Model Description - Autoencoder
Question: What prevents the network from learning the identity function? Answer: Fixed number of hidden units prevents trivial mappings LSTM operations are done recursively. Same dynamics must be applied at any stage.

13 Model Description – Future Prediction
Used for predicting the next sequence Same as Autoencoder but predicts the future frames The hidden state coming out of the encoder will capture the information from previous frames. Encoder Decoder

14 Model Description – Conditional Decoder
During decoding, use the last outputted frame as input. Pros: Allows the decoder to model multiple modes in target. Cons: The model will pick up the strong correlations between the frames and forget about the long term knowledge of the sequence.

15 Model Description – Composite Model

16 Model Description – Composite Model
Overcomes the shortcomings of each model: Autoencoder tries to learn trivial representations that memorizes the input This memorization is not useful for predicting the input Future predictor tries to store information only about the last few frames For input reconstruction, the knowledge of all the frames is needed

17 Experiments and Dataset
UCF-101 – videos belonging to 101 different actions. HMDB-51 – 5100 videos belonging to 51 different actions. Moving MNIST – Each video is of 20 frames of 64x64 patches and consists of two randomly sampled digits with random velocities. Sport 1M Dataset- 1 million YouTube clips

18 Results– Moving MNIST Each layer consists of 2048 LSTM units.
The loss function is the cross entropy loss function. Took 10 frames as input and tried to reconstruct them and predict the next 10.

19 Results– Moving MNIST

20 Results– Natural Image
32x32 patches from the UCF-101 dataset. The loss function is the squared error loss function. Took 16 frames as input and tried to reconstruct them and predict the next 13.

21 Results– Natural Image

22 Results - Comparison

23 Results – Generalization over time
Took the moving MNIST digits with 64x64 patches. The training was done using 10 frames as input. Let the model run for 100 iterations 200 random LSTM units Periodic For reconstruction, after 15 iterations, only blobs are returned.

24 Results – Out of domain Inputs
Doing a good job but tries to hallucinate a second digit For 3 digits, only return blobs

25 Results – Feature Visualization

26 Results – Feature Visualization
Match more blobs then the input features The stripes are match shorter. Because in the input we want to learn the velocity and the direction of the movement. In the output you just want to make them shorter and fatter to reduce the error. In the input you want the longer features because you want to code and store the information throughout many features In the output you don’t want them to be longer then the digits.

27 Results – Supervised Learning
Unsupervised pre-training performed on a subset of the Sport 1M The trained encoder was taken a the base for the LSTM classifier and a Softmax layer was added. Fine tuning was performed on the UCF-101 and HMDB-51 datasets.

28 Results – Supervised Learning

29 Results – Supervised Learning
The first model is only RGB data The second is a flow generated features The Third uses both

30 Questions


Download ppt "Unsupervised Learning of Video Representations using LSTMs"

Similar presentations


Ads by Google