CSC321 2007 Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
The Helmholtz Machine P Dayan, GE Hinton, RM Neal, RS Zemel
Advertisements

Deep Belief Nets and Restricted Boltzmann Machines
Deep Learning Bing-Chen Tsai 1/21.
CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.
CS590M 2008 Fall: Paper Presentation
Tutorial on: Deep Belief Nets
What kind of a Graphical Model is the Brain?
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net
Wake-Sleep algorithm for Representational Learning
Lecture 5: Learning models using EM
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Restricted Boltzmann Machines and Deep Belief Networks
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Can computer simulations of the brain allow us to see into the mind? Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models
CSC2535 Spring 2013 Lecture 2a: Inference in factor graphs Geoffrey Hinton.
CSC2535 Spring 2013 Lecture 1: Introduction to Machine Learning and Graphical Models Geoffrey Hinton.
How to do backpropagation in a brain
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
CSC2515 Fall 2007 Introduction to Machine Learning Lecture 5: Mixture models, EM and variational inference All lecture slides will be available as.ppt,.ps,
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
An efficient way to learn deep generative models Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of.
CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.
Varieties of Helmholtz Machine Peter Dayan and Geoffrey E. Hinton, Neural Networks, Vol. 9, No. 8, pp , 1996.
Highlights of Hinton's Contrastive Divergence Pre-NIPS Workshop Yoshua Bengio & Pascal Lamblin USING SLIDES FROM Geoffrey Hinton, Sue Becker & Yann Le.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CIAR Second Summer School Tutorial Lecture 1a Sigmoid Belief Nets and Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
CSC2535 Lecture 4 Boltzmann Machines, Sigmoid Belief Nets and Gibbs sampling Geoffrey Hinton.
CSC2535 Spring 2011 Lecture 1: Introduction to Machine Learning and Graphical Models Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
Lecture 2: Statistical learning primer for biologists
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
How to learn a generative model of images Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
Boltzman Machines Stochastic Hopfield Machines Lectures 11e 1.
CSC2515 Lecture 10 Part 2 Making time-series models with RBM’s.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
CSC2535: 2011 Advanced Machine Learning Lecture 2: Variational Inference and Learning in Directed Graphical Models Geoffrey Hinton.
Preliminary version of 2007 NIPS Tutorial on: Deep Belief Nets Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science.
Deep learning Tsai bing-chen 10/22.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2535 Lecture 5 Sigmoid Belief Nets
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at
CSC2535: Computation in Neural Networks Lecture 8: Hopfield nets Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
CSC2535: 2013 Advanced Machine Learning Lecture 2b: Variational Inference and Learning in Directed Graphical Models Geoffrey Hinton.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
Energy models and Deep Belief Networks
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
All lecture slides will be available as .ppt, .ps, & .htm at
Multimodal Learning with Deep Boltzmann Machines
Latent Variables, Mixture Models and EM
Read Chapter of Russell and Norvig
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton.
Products of Experts Products of Experts, ICANN’99, Geoffrey E. Hinton, 1999 Training Products of Experts by Minimizing Contrastive Divergence, GCNU TR.
CSC 578 Neural Networks and Deep Learning
Presentation transcript:

CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton

Bayes Nets: Directed Acyclic Graphical models The model generates data by picking states for each node using a probability distribution that depends on the values of the node’s parents. The model defines a probability distribution over all the nodes. This can be used to define a distribution over the leaf nodes. Hidden cause Visible effect

Ways to define the conditional probabilities For nodes that have discrete values, we could use conditional probability tables. For nodes that have real values we could let the parents define the parameters of a Gaussian State configurations of all relevant parents states of the node p sums to 1 Multinomial variable that has N discrete states, each with its own probability. Gaussian variable whose mean and variance are determined by the state of the parent.

Sigmoid belief nets If the nodes have binary states, we could use a sigmoid to determine the probability of a node being on as a function of the states of its parents: j i This uses the same type of stochastic units as Boltzmann machines but the directed connection make it into a very different type of model

What is easy and what is hard in a DAG? It is easy to generate an unbiased example at the leaf nodes. It is typically hard to compute the posterior distribution over all possible configurations of hidden causes. It is also hard to compute the probability of an observed vector. Given samples from the posterior, it is easy to learn the conditional probabilities that define the model. Hidden cause Visible effect

Explaining away Even if two hidden causes are independent, they can become dependent when we observe an effect that they can both influence. –If we learn that there was an earthquake it reduces the probability that the house jumped because of a truck. truck hits houseearthquake house jumps

The learning rule for sigmoid belief nets Suppose we could “observe” the states of all the hidden units when the net was generating the observed data. –E.g. Generate randomly from the net and ignore all the times when it does not generate data in the training set. –Keep n examples of the hidden states for each datavector in the training set. For each node, maximize the log probability of its “observed” state given the observed states of its parents. j i

An apparently crazy idea Its hard to learn complicated models like Sigmoid Belief Nets because its hard to infer (or sample from) the posterior distribution over hidden configurations. Crazy idea: do inference wrong. –Maybe learning will still work –This turns out to be true for SBN’s. At each hidden layer, we assume the posterior over hidden configurations factorizes into a product of distributions for each separate hidden unit.

The wake-sleep algorithm Wake phase: Use the recognition weights to perform a bottom-up pass. –Train the generative weights to reconstruct activities in each layer from the layer above. Sleep phase: Use the generative weights to generate samples from the model. –Train the recognition weights to reconstruct activities in each layer from the layer below. h 2 data h 1 h 3

The recognition weights are trained to invert the generative model in parts of the space where there is no data. –This is wasteful. The recognition weights do not follow the gradient of the log probability of the data. Nor do they follow the gradient of a bound on this probability. –This leads to incorrect mode-averaging The posterior over the top hidden layer is very far from independent because the independent prior cannot eliminate explaining away effects. The flaws in the wake-sleep algorithm

Mode averaging If we generate from the model, half the instances of a 1 at the data layer will be caused by a (1,0) at the hidden layer and half will be caused by a (0,1). –So the recognition weights will learn to produce (0.5,0.5) –This represents a distribution that puts half its mass on very improbable hidden configurations. Its much better to just pick one mode. This is the best recognition model you can get if you assume that the posterior over hidden states factorizes. A better solution Mode averaging True posterior

Why its hard to learn sigmoid belief nets one layer at a time To learn W, we need the posterior distribution in the first hidden layer. Problem 1: The posterior is typically intractable because of “explaining away”. Problem 2: The posterior depends on the prior as well as the likelihood. –So to learn W, we need to know the weights in higher layers, even if we are only approximating the posterior. All the weights interact. Problem 3: We need to integrate over all possible configurations of the higher variables to get the prior for first hidden layer. Yuk! data hidden variables likelihoodW prior

Using complementary priors to eliminate explaining away A “complementary” prior is defined as one that exactly cancels the correlations created by explaining away. So the posterior factors. –Under what conditions do complementary priors exist? –Is there a simple way to compute the product of the likelihood term and the prior term from the data? Yes! In one kind of sigmoid belief net, we can simply use data hidden variables likelihood prior

An example of a complementary prior The distribution generated by this infinite DAG with replicated weights is the equilibrium distribution for a compatible pair of conditional distributions: p(v|h) and p(h|v). –An ancestral pass of the DAG is exactly equivalent to letting a Restricted Boltzmann Machine settle to equilibrium. –So this infinite DAG defines the same distribution as an RBM. v 1 h 1 v 0 h 0 v 2 h 2 etc.

The variables in h0 are conditionally independent given v0. –Inference is trivial. We just multiply v0 by W transpose. –The model above h0 implements a complementary prior so multiplying v0 by W transpose gives the product of the likelihood term and the prior term. Inference in the DAG is exactly equivalent to letting a Restricted Boltzmann Machine settle to equilibrium starting at the data. Inference in a DAG with replicated weights v 1 h 1 v 0 h 0 v 2 h 2 etc

A picture of the Boltzmann machine learning algorithm for an RBM i j i j i j i j t = 0 t = 1 t = 2 t = infinity Start with a training vector on the visible units. Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel. a fantasy

The learning rule for a logistic DAG is: With replicated weights this becomes: v 1 h 1 v 0 h 0 v 2 h 2 etc.

Another explanation of the contrastive divergence learning procedure Think of an RBM as an infinite sigmoid belief net with tied weights. If we start at the data, alternating Gibbs sampling computes samples from the posterior distribution in each hidden layer of the infinite net. In deeper layers the derivatives w.r.t. the weights are very small. –Contrastive divergence just ignores these small derivatives in the deeper layers of the infinite net. –Its silly to compute the derivatives exactly when you know the weights are going to change a lot.

The up-down algorithm: A contrastive divergence version of wake-sleep Replace the top layer of the DAG by an RBM –This eliminates bad approximations caused by top-level units that are independent in the prior. –It is nice to have an associative memory at the top. Replace the ancestral pass in the sleep phase by a top- down pass starting with the state of the RBM produced by the wake phase. –This makes sure the recognition weights are trained in the vicinity of the data. –It also reduces mode averaging. If the recognition weights prefer one mode, they will stick with that mode even if the generative weights would be just as happy to generate the data from some other mode.