CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
Deep Learning Bing-Chen Tsai 1/21.
Advertisements

CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.
Stochastic Neural Networks Deep Learning and Neural Nets Spring 2015.
CS590M 2008 Fall: Paper Presentation
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
What kind of a Graphical Model is the Brain?
CS 678 –Boltzmann Machines1 Boltzmann Machine Relaxation net with visible and hidden units Learning algorithm Avoids local minima (and speeds up learning)
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net
Restricted Boltzmann Machines and Deep Belief Networks
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Can computer simulations of the brain allow us to see into the mind? Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models
How to do backpropagation in a brain
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
The Boltzmann Machine Psych 419/719 March 1, 2001.
Varieties of Helmholtz Machine Peter Dayan and Geoffrey E. Hinton, Neural Networks, Vol. 9, No. 8, pp , 1996.
Highlights of Hinton's Contrastive Divergence Pre-NIPS Workshop Yoshua Bengio & Pascal Lamblin USING SLIDES FROM Geoffrey Hinton, Sue Becker & Yann Le.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto (Training MRFs using new algorithm.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CSC321: Introduction to Neural Networks and machine Learning Lecture 16: Hopfield nets and simulated annealing Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1a Sigmoid Belief Nets and Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto.
CSC2535 Lecture 4 Boltzmann Machines, Sigmoid Belief Nets and Gibbs sampling Geoffrey Hinton.
CS Statistical Machine learning Lecture 24
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Lecture 2: Statistical learning primer for biologists
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
Introduction to Neural Networks Introduction to Neural Networks Applied to OCR and Speech Recognition An actual neuron A crude model of a neuron Computational.
How to learn a generative model of images Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
Boltzman Machines Stochastic Hopfield Machines Lectures 11e 1.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2535 Lecture 5 Sigmoid Belief Nets
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at
CSC321: Computation in Neural Networks Lecture 21: Stochastic Hopfield nets and simulated annealing Geoffrey Hinton.
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC2535: Computation in Neural Networks Lecture 8: Hopfield nets Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
CSC321 Lecture 18: Hopfield nets and simulated annealing
Energy models and Deep Belief Networks
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
A Practical Guide to Training Restricted Boltzmann Machines
Restricted Boltzmann Machines for Classification
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Multimodal Learning with Deep Boltzmann Machines
Read Chapter of Russell and Norvig
Deep Belief Nets and Ising Model-Based Network Construction
Boltzmann Machine (BM) (§6.4)
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton.
CSC2535: 2011 Lecture 5a More ways to fit energy-based models
CSC 578 Neural Networks and Deep Learning
Presentation transcript:

CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton

The goal of learning Maximize the product of the probabilities that the Boltzmann machine assigns to the vectors in the training set. –This is equivalent to maximizing the sum of the log probabilities of the training vectors. –It is also equivalent to maximizing the probabilities that we will observe those vectors on the visible units if we take random samples after the whole network has reached thermal equilibrium with no external input.

w2 w3 w4 Why the learning could be difficult Consider a chain of units with visible units at the ends If the training set is (1,0) and (0,1) we want the product of all the weights to be negative. So to know how to change w1 or w5 we must know w3. hidden visible w1w5

A very surprising fact Everything that one weight needs to know about the other weights and the data is contained in the difference of two correlations. Derivative of log probability of one training vector Expected value of product of states at thermal equilibrium when the training vector is clamped on the visible units Expected value of product of states at thermal equilibrium when nothing is clamped

The batch learning algorithm Positive phase –Clamp a datavector on the visible units. –Let the hidden units reach thermal equilibrium at a temperature of 1 (may use annealing to speed this up) –Sample for all pairs of units –Repeat for all datavectors in the training set. Negative phase –Do not clamp any of the units –Let the whole network reach thermal equilibrium at a temperature of 1 (where do we start?) –Sample for all pairs of units –Repeat many times to get good estimates Weight updates –Update each weight by an amount proportional to the difference in in the two phases.

Why is the derivative so simple? The probability of a global configuration at thermal equilibrium is an exponential function of its energy. –So settling to equilibrium makes the log probability a linear function of the energy The energy is a linear function of the weights and states The process of settling to thermal equilibrium propagates information about the weights.

Why do we need the negative phase? The positive phase finds hidden configurations that work well with v and lowers their energies. The negative phase finds the joint configurations that are the best competitors and raises their energies.

Restricted Boltzmann Machines We restrict the connectivity to make inference and learning easier. –Only one layer of hidden units. –No connections between hidden units. In an RBM it only takes one step to reach thermal equilibrium when the visible units are clamped. –So we can quickly get the exact value of : hidden visible i j

A picture of the Boltzmann machine learning algorithm for an RBM i j i j i j i j t = 0 t = 1 t = 2 t = infinity Start with a training vector on the visible units. Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel. a fantasy

A surprising short-cut i j i j t = 0 t = 1 Start with a training vector on the visible units. Update all the hidden units in parallel Update the all the visible units in parallel to get a “reconstruction”. Update the hidden units again. This is not following the gradient of the log likelihood. But it works very well. reconstruction data

Why does the shortcut work? If we start at the data, the Markov chain wanders away from them data and towards things that it likes more. We can see what direction it is wandering in after only a few steps. It’s a big waste of time to let it go all the way to equilibrium. –All we need to do is lower the probability of the “confabulations” it produces and raise the probability of the data. Then it will stop wandering away. The learning cancels out once the confabulations and the data have the same distribution. We need to worry about regions of the data-space that the model likes but which are very far from any data. –These regions cause the normalization term to be big and we cannot sense them if we use the shortcut.