CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.
Advertisements

Deep Learning Bing-Chen Tsai 1/21.
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.
CS590M 2008 Fall: Paper Presentation
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
What kind of a Graphical Model is the Brain?
CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net
Visual Recognition Tutorial
How to do backpropagation in a brain
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Restricted Boltzmann Machines and Deep Belief Networks
Computer vision: models, learning and inference Chapter 10 Graphical Models.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Can computer simulations of the brain allow us to see into the mind? Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models
CIAR Second Summer School Tutorial Lecture 2b Autoencoders & Modeling time series with Boltzmann machines Geoffrey Hinton.
How to do backpropagation in a brain
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto (Training MRFs using new algorithm.
CSC Lecture 8a Learning Multiplicative Interactions Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1a Sigmoid Belief Nets and Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto.
CSC2535 Lecture 4 Boltzmann Machines, Sigmoid Belief Nets and Gibbs sampling Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
Csc Lecture 8 Modeling image covariance structure Geoffrey Hinton.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
Lecture 2: Statistical learning primer for biologists
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
How to learn a generative model of images Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
Boltzman Machines Stochastic Hopfield Machines Lectures 11e 1.
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2535 Lecture 5 Sigmoid Belief Nets
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Energy models and Deep Belief Networks
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Multimodal Learning with Deep Boltzmann Machines
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton.
CSC2535: 2011 Lecture 5a More ways to fit energy-based models
Products of Experts Products of Experts, ICANN’99, Geoffrey E. Hinton, 1999 Training Products of Experts by Minimizing Contrastive Divergence, GCNU TR.
Presentation transcript:

CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton

Restricted Boltzmann Machines We restrict the connectivity to make inference and learning easier. Only one layer of hidden units. No connections between hidden units. In an RBM it only takes one step to reach thermal equilibrium when the visible units are clamped. So we can quickly get the exact value of : j hidden visible i

A picture of the Boltzmann machine learning algorithm for an RBM j j j j a fantasy i i i i t = 0 t = 1 t = 2 t = infinity Start with a training vector on the visible units. Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.

The short-cut j j Start with a training vector on the visible units. Update all the hidden units in parallel Update the all the visible units in parallel to get a “reconstruction”. Update the hidden units again. i i t = 0 t = 1 data reconstruction This is not following the gradient of the log likelihood. But it works very well.

Contrastive divergence Aim is to minimize the amount by which a step toward equilibrium improves the data distribution. distribution after one step of Markov chain data distribution model’s distribution Minimize divergence between data distribution and model’s distribution Maximize the divergence between confabulations and model’s distribution Minimize Contrastive Divergence

Contrastive divergence Contrastive divergence makes the awkward terms cancel changing the parameters changes the distribution of confabulations

How to learn a set of features that are good for reconstructing images of the digit 2 50 binary feature neurons 50 binary feature neurons Increment weights between an active pixel and an active feature Decrement weights between an active pixel and an active feature 16 x 16 pixel image 16 x 16 pixel image Bartlett data (reality) reconstruction (lower energy than reality)

The final 50 x 256 weights Each neuron grabs a different feature.

How well can we reconstruct the digit images from the binary feature activations? Reconstruction from activated binary features Reconstruction from activated binary features Data Data New test images from the digit class that the model was trained on Images from an unfamiliar digit class (the network tries to see every image as a 2)

Another use of contrastive divergence CD is an efficient way to learn Restricted Boltzmann Machines. But it can also be used for learning other types of energy-based model that have multiple hidden layers. Methods very similar to CD have been used for learning non-probabilistic energy-based models (LeCun, Hertzmann).

Energy-Based Models with deterministic hidden units Use multiple layers of deterministic hidden units with non-linear activation functions. Hidden activities contribute additively to the global energy, E. Familiar features help, violated constraints hurt. Ek k Ej j data

Frequently Approximately Satisfied constraints On a smooth intensity patch the sides balance the middle The intensities in a typical image satisfy many different linear constraints very accurately, and violate a few constraints by a lot. The constraint violations fit a heavy-tailed distribution. The negative log probabilities of constraint violations can be used as energies. - + - Gauss energy Cauchy Violation

Reminder: Maximum likelihood learning is hard To get high log probability for d we need low energy for d and high energy for its main rivals, c To sample from the model use Markov Chain Monte Carlo. But what kind of chain can we use when the hidden units are deterministic and the visible units are real-valued.

Hybrid Monte Carlo We could find good rivals by repeatedly making a random perturbation to the data and accepting the perturbation with a probability that depends on the energy change. Diffuses very slowly over flat regions Cannot cross energy barriers easily In high-dimensional spaces, it is much better to use the gradient to choose good directions. HMC adds a random momentum and then simulates a particle moving on an energy surface. Beats diffusion. Scales well. Can cross energy barriers. Back-propagation can give us the gradient of the energy surface.

Trajectories with different initial momenta

Backpropagation can compute the gradient that Hybrid Monte Carlo needs Do a forward pass computing hidden activities. Do a backward pass all the way to the data to compute the derivative of the global energy w.r.t each component of the data vector. works with any smooth non-linearity Ek k Ej j data

The online HMC learning procedure Start at a datavector, d, and use backprop to compute for every parameter Run HMC for many steps with frequent renewal of the momentum to get equilibrium sample, c. Each step involves a forward and backward pass to get the gradient of the energy in dataspace. Use backprop to compute Update the parameters by :

The shortcut Instead of taking the negative samples from the equilibrium distribution, use slight corruptions of the datavectors. Only add random momentum once, and only follow the dynamics for a few steps. Much less variance because a datavector and its confabulation form a matched pair. Gives a very biased estimate of the gradient of the log likelihood. Gives a good estimate of the gradient of the contrastive divergence (i.e. the amount by which F falls during the brief HMC.) Its very hard to say anything about what this method does to the log likelihood because it only looks at rivals in the vicinity of the data. Its hard to say exactly what this method does to the contrastive divergence because the Markov chain defines what we mean by “vicinity”, and the chain keeps changing as the parameters change. But its works well empirically, and it can be proved to work well in some very simple cases.

A simple 2-D dataset The true data is uniformly distributed within the 4 squares. The blue dots are samples from the model.

The network for the 4 squares task Each hidden unit contributes an energy equal to its activity times a learned scale. E 3 logistic units 20 logistic units 2 input units

Learning the constraints on an arm 3-D arm with 4 links and 5 joints Energy for non-zero outputs squared outputs _ + linear For each link:

4.19 4.66 -7.12 13.94 -5.03 -4.24 -4.61 7.27 -13.97 5.01 Biases of top-level units Mean total input from layer below Weights of a top-level unit Weights of a hidden unit Negative weight Positive weight Coordinates of joint 4 Coordinates of joint 5

Superimposing constraints A unit in the second layer could represent a single constraint. But it can model the data just as well by representing a linear combination of constraints.

Dealing with missing inputs The network learns the constraints even if 10% of the inputs are missing. First fill in the missing inputs randomly Then use the back-propagated energy derivatives to slowly change the filled-in values until they fit in with the learned constraints. Why don’t the corrupted inputs interfere with the learning of the constraints? The energy function has a small slope when the constraint is violated by a lot. So when a constraint is violated by a lot it does not adapt. Don’t learn when things don’t make sense.

Learning constraints from natural images (Yee-Whye Teh) We used 16x16 image patches and a single layer of 768 hidden units (3 x over-complete). Confabulations are produced from data by adding random momentum once and simulating dynamics for 30 steps. Weights are updated every 100 examples. A small amount of weight decay helps.

A random subset of 768 basis functions

The distribution of all 768 learned basis functions

How to learn a topographic map The outputs of the linear filters are squared and locally pooled. This makes it cheaper to put filters that are violated at the same time next to each other. Pooled squared filters Local connectivity Cost of second violation Linear filters Global connectivity Cost of first violation image

Density models Causal models Energy-Based Models Tractable posterior mixture models, sparse bayes nets factor analysis Compute exact posterior Intractable posterior Densely connected DAG’s Markov Chain Monte Carlo or Minimize variational free energy Stochastic hidden units Full Boltzmann Machine Full MCMC Restricted Boltzmann Machine Minimize contrastive divergence Deterministic hidden units Hybrid MCMC Fix the features as in CRF’s so it is tractable. Minimize contrastive divergence or

THE END

Independence relationships of hidden variables in three types of model that have one hidden layer Causal Product of Square model experts (RBM) ICA Hidden states unconditional on data Hidden states conditional on data independent (generation is easy) dependent (rejecting away) independent (by definition) independent (the posterior collapses to a single point) dependent (explaining away) independent (inference is easy) We can use an almost complementary prior to reduce this dependency so that variational inference works

Faster mixing chains Hybrid Monte Carlo can only take small steps because the energy surface is curved. With a single layer of hidden units, it is possible to use alternating parallel Gibbs sampling. Step 1: each student-t hidden unit picks a variance from the posterior distribution over variances given the violation produced by the current datavector. If the violation is big, it picks a big variance This is equivalent to picking a Gaussian from an infinite mixture of Gaussians (because that’s what a student-t is). With the variances fixed, each hidden unit defines a one-dimensional Gaussians in the dataspace. Step 2: pick a visible vector from the product of all the one-dimensional Gaussians.

Pro’s and Con’s of Gibbs sampling Advantages of Gibbs sampling Much faster mixing Can be extended to use pooled second layer (Max Welling) Disadvantages of Gibbs sampling Can only be used in deep networks by learning hidden layers (or pairs of layers) greedily. But maybe this is OK. Its scales better than contrastive backpropagation.

Over-complete ICA using a causal model What if we have more independent sources than data components? (independent \= orthogonal) The data no longer specifies a unique vector of source activities. It specifies a distribution. This also happens if we have sensor noise in square case. The posterior over sources is non-Gaussian because the prior is non-Gaussian. So we need to approximate the posterior: MCMC samples MAP (plus Gaussian around MAP?) Variational

Over-complete ICA using an energy-based model Causal over-complete models preserve the unconditional independence of the sources and abandon the conditional independence. Energy-based overcomplete models preserve the conditional independence (which makes perception fast) and abandon the unconditional independence. Over-complete EBM’s are easy if we use contrastive divergence to deal with the intractable partition function.