CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton.


Similar presentations
The Helmholtz Machine P Dayan, GE Hinton, RM Neal, RS Zemel

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Deep Learning Bing-Chen Tsai 1/21.
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Chapter 2.
CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.
CS590M 2008 Fall: Paper Presentation
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
What kind of a Graphical Model is the Brain?
Supervised Learning Recap
CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net
CSC321: Neural Networks Lecture 3: Perceptrons
Wake-Sleep algorithm for Representational Learning
How to do backpropagation in a brain
Restricted Boltzmann Machines and Deep Belief Networks
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Can computer simulations of the brain allow us to see into the mind? Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
CIAR Second Summer School Tutorial Lecture 2b Autoencoders & Modeling time series with Boltzmann machines Geoffrey Hinton.
How to do backpropagation in a brain
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
An efficient way to learn deep generative models Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of.
CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.
Highlights of Hinton's Contrastive Divergence Pre-NIPS Workshop Yoshua Bengio & Pascal Lamblin USING SLIDES FROM Geoffrey Hinton, Sue Becker & Yann Le.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CIAR Second Summer School Tutorial Lecture 1a Sigmoid Belief Nets and Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
CSC2535 Lecture 4 Boltzmann Machines, Sigmoid Belief Nets and Gibbs sampling Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
How to learn a generative model of images Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
Boltzman Machines Stochastic Hopfield Machines Lectures 11e 1.
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Deep learning Tsai bing-chen 10/22.
CSC2535 Lecture 5 Sigmoid Belief Nets
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
CSC2535: Computation in Neural Networks Lecture 8: Hopfield nets Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CSC2535: 2013 Advanced Machine Learning Lecture 2b: Variational Inference and Learning in Directed Graphical Models Geoffrey Hinton.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
Energy models and Deep Belief Networks
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
All lecture slides will be available as .ppt, .ps, & .htm at
Multimodal Learning with Deep Boltzmann Machines
Deep Learning Qing LU, Siyuan CAO.
Deep Belief Networks Psychology 209 February 22, 2013.
Department of Electrical and Computer Engineering
Deep Architectures for Artificial Intelligence
Boltzmann Machine (BM) (§6.4)
2007 NIPS Tutorial on: Deep Belief Nets
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
Products of Experts Products of Experts, ICANN’99, Geoffrey E. Hinton, 1999 Training Products of Experts by Minimizing Contrastive Divergence, GCNU TR.
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
CSC 578 Neural Networks and Deep Learning
Presentation transcript:

CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton

The story so far We want to learn models with multiple layers of non-linear features. Perceptrons: Use a layer of hand-coded, non-adaptive features followed by a layer of adaptive decision units. Needs supervision signal for each training case. Only one layer of adaptive weights. Back-propagation: Use multiple layers of adaptive features and train by backpropagating error derivatives Learning time scales poorly for deep networks. Support Vector Machines: Use a very large set of fixed features Does not learn multiple layers of features

The story so far (continued) Boltzmann Machines: Nice local learning rule that works in arbitrary networks. Inference requires MCMC Maximum likelihood learning requires unbiased samples from the model’s distribution. These are very hard to get. Sigmoid Belief Nets: Its nice to have a generative model that we can generate from Exact inference is intractable, but learning still works if we use approximate inference (e.g. factorial distributions). Restricted Boltzmann Machines: Exact inference is very easy because the posterior over hidden configurations is factorial. Maximum likelihood learning is slow, but contrastive divergence learning is fast and often works well. But we can only learn one layer of adaptive features!

Recursive Restricted Boltzmann Machines First learn a layer of hidden features. Then treat the feature activations as data and learn a second layer of hidden features – and so on for as many hidden layers as we want. Is this just a hack? Can we treat all of the hidden layers as part of one big generative model rather than a hierarchy of separate models? Can we prove that adding more layers will always help?

Learning by dividing and conquering Re-weighting the data: In boosting, we learn a sequence of simple models. After learning each model, we re-weight the data so that the next model learns to deal with the cases that the previous models found difficult. There is a nice guarantee that the overall model gets better. Projecting the data: In PCA, we find the leading eigenvector and then project the data into the orthogonal subspace. Distorting the data: In projection pursuit, we find a non-Gaussian direction and then distort the data so that it is Gaussian along this direction.

Another way to divide and conquer Re-representing the data: Each time the base learner is called, it passes a transformed version of the data to the next learner. Can we learn a deep, dense DAG one layer at a time, starting at the bottom, and still guarantee that learning each layer improves the overall model of the training data? This seems very unlikely. Surely we need to know the weights in higher layers to learn lower layers?

Why its hard to learn one layer at a time To learn W, we need the posterior distribution in the first hidden layer. Problem 1: The posterior is typically intractable because of “explaining away”. Problem 2: The posterior depends on the prior as well as the likelihood. So to learn W, we need to know the weights in higher layers, even if we are only approximating the posterior. All the weights interact. Problem 3: We need to integrate over all possible configurations of the higher variables to get the prior for first hidden layer. Yuk! hidden variables hidden variables prior hidden variables likelihood W data

Using complementary priors to eliminate explaining away A “complementary” prior is defined as one that exactly cancels the correlations created by explaining away. So the posterior factors. Under what conditions do complementary priors exist? Complementary priors do not exist in general: Parameter counting shows that complementary priors cannot exist if the relationship between the hidden variables and the data is defined by a separate conditional probability table for each hidden configuration. hidden variables hidden variables prior hidden variables likelihood data

An example of a complementary prior etc. The distribution generated by this infinite DAG with replicated weights is the equilibrium distribution for a compatible pair of conditional distributions: p(v|h) and p(h|v). An ancestral pass of the DAG is exactly equivalent to letting a Restricted Boltzmann Machine settle to equilibrium. So this infinite DAG defines the same distribution as an RBM. h2 v2 h1 v1 h0 v0

Inference in a DAG with replicated weights etc. h2 The variables in h0 are conditionally independent given v0. Inference is trivial. We just multiply v0 by This is because the model above h0 implements a complementary prior. Inference in the DAG is exactly equivalent to letting a Restricted Boltzmann Machine settle to equilibrium starting at the data. v2 h1 v1 h0 v0

A picture of the Boltzmann machine learning algorithm for an RBM j j j j a fantasy i i i i t = 0 t = 1 t = 2 t = infinity Start with a training vector on the visible units. Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.

etc. h2 v2 h1 v1 h0 v0 The learning rule for a logistic DAG is: With replicated weights this becomes: h2 v2 h1 v1 The derivatives for the recognition weights are zero. h0 v0

Pro’s and con’s of replicating the weights Advantages Disadvantages There are many less parameters. There is an efficient approximate learning procedure. After learning, inference of hidden states is fast and accurate. The model is much less powerful than a deep network that has different weights in each layer. The brain clearly uses deep networks.

Contrastive divergence learning: A quick way to learn an RBM j j Start with a training vector on the visible units. Update all the hidden units in parallel Update the all the visible units in parallel to get a “reconstruction”. Update the hidden units again. i i t = 0 t = 1 data reconstruction This is not following the gradient of the log likelihood. But it works well. It is easy to understand what it does if we consider the infinite directed net.

Multilayer contrastive divergence Start by learning one hidden layer. Then re-present the data as the activities of the hidden units. The same learning algorithm can now be applied to the re-presented data. Can we prove that each step of this greedy learning improves the log probability of the data under the overall model? What is the overall model?

A simplified version with all hidden layers the same size The RBM at the top can be viewed as shorthand for an infinite directed net. When learning W1 we can view the model in two quite different ways: The model is an RBM composed of the data layer and h1. The model is an infinite DAG with tied weights. After learning W1 we untie it from the other weight matrices. We then learn W2 which is still tied to all the matrices above it. h3 h2 h1 data

The generative model h3 h2 h1 data To generate data: Get an equilibrium sample from the top-level RBM by performing alternating Gibbs sampling for a long time. Perform a top-down ancestral pass to get states for all the other layers. The lower-level, bottom-up connections are not part of the generative model. They are there to do fast approximate inference. h3 h2 h1 data

Why does greedy learning work? The weights, W, in the bottom level RBM define p(v|h) and they also, indirectly, define p(h). So we can express the RBM model as If we leave p(v|h) alone and build a better model of p(h), we will improve p(v). We need a better model of the posterior hidden vectors produced by applying W to the data.

Why the hidden configurations should be treated as data when learning the next layer of weights After learning the first layer of weights: If we freeze the generative weights that define the likelihood term and the recognition weights that define the distribution over hidden configurations, we get: Maximizing the RHS is equivalent to maximizing the log prob of “data” that occurs with probability

Why greedy learning works Each time we learn a new layer, the inference at the layer below becomes incorrect, but the variational bound on the log prob of the data improves. Since the bound starts as an equality, learning a new layer never decreases the log prob of the data, provided we start the learning from the tied weights that implement the complementary prior. Now that we have a guarantee we can loosen the restrictions and still feel confident. Allow layers to vary in size. Do not start the learning at each layer from the weights in the layer below.

Back-fitting After we have learned all the layers greedily, the weights in the lower layers will no longer be optimal. We can improve them in two ways: Untie the recognition weights from the generative weights and learn recognition weights that take into account the non-complementary prior implemented by the weights in higher layers. Improve the generative weights to take into account the non-complementary priors implemented by the weights in higher layers. What algorithm should we use for improving on the weights that are learned greedily?

A neural network model of digit recognition The top two layers form a restricted Boltzmann machine whose free energy landscape models the low dimensional manifolds of the digits. The valleys have names: 2000 top-level units 10 label units 500 units The model learns a joint density for labels and images. To perform recognition we can start with a neutral state of the label units and do one or two iterations of the top-level RBM. Or we can just compute the harmony of the RBM with each of the 10 labels 500 units 28 x 28 pixel image

Show the movie

Samples generated by running the top-level RBM with one label clamped Samples generated by running the top-level RBM with one label clamped. There are 1000 iterations of alternating Gibbs sampling between samples.

Examples of correctly recognized MNIST test digits (the 49 closest calls)

How well does it discriminate on MNIST test set with no extra information about geometric distortions? Up-down net with RBM pre-training + CD10 1.25% SVM (Decoste & Scholkopf) 1.4% Backprop with 1000 hiddens (Platt) 1.5% Backprop with 500 -->300 hiddens 1.5% Separate hierarchy of RBM’s per class 1.7% Learned motor program extraction ~1.8% K-Nearest Neighbor ~ 3.3% Its better than backprop and much more neurally plausible because the neurons only need to send one kind of signal, and the teacher can be another sensory input.

All 125 errors

Samples generated by running top-level RBM with one label clamped Samples generated by running top-level RBM with one label clamped. Initialized by an up-pass from a random binary image. 20 iterations between samples.

The wake-sleep algorithm Wake phase: Use the recognition weights to perform a bottom-up pass. Train the generative weights to reconstruct activities in each layer from the layer above. Sleep phase: Use the generative weights to generate samples from the model. Train the recognition weights to reconstruct activities in each layer from the layer below. h3 h2 h1 data

The flaws in the wake-sleep algorithm The recognition weights are trained to invert the generative model in parts of the space where there is no data. This is wasteful. The recognition weights follow the gradient of the wrong divergence. They minimize KL(P||Q) but the variational bound requires minimization of KL(Q||P). This leads to incorrect mode-averaging The posterior over the top hidden layer is very far from independent because the independent prior cannot eliminate explaining away effects.

Mode averaging If we generate from the model, half the instances of a 1 at the data layer will be caused by a (1,0) at the hidden layer and half will be caused by a (0,1). So the recognition weights will learn to produce (0.5,0.5) This represents a distribution that puts half its mass on very improbable hidden configurations. Its much better to just pick one mode and pay one bit. -10 -10 +20 +20 -20 minimum of KL(Q||P) minimum of KL(P||Q) P

The up-down algorithm: A contrastive divergence version of wake-sleep Replace the top layer of the DAG by an RBM This eliminates bad variational approximations caused by top-level units that are independent in the prior. It is nice to have an associative memory at the top. Replace the ancestral pass in the sleep phase by a top- down pass starting with the state of the RBM produced by the wake phase. This makes sure the recognition weights are trained in the vicinity of the data. It also reduces mode averaging. If the recognition weights prefer one mode, they will stick with that mode even if the generative weights like some other mode just as much.

The receptive fields of the first hidden layer

The generative fields of the first hidden layer