Products of Experts Products of Experts, ICANN’99, Geoffrey E. Hinton, 1999 Training Products of Experts by Minimizing Contrastive Divergence, GCNU TR.

Slides:



Advertisements
Similar presentations
Deep Learning Bing-Chen Tsai 1/21.
Advertisements

CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.
CS590M 2008 Fall: Paper Presentation
Supervised Learning Recap
CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net
Structure learning with deep neuronal networks 6 th Network Modeling Workshop, 6/6/2013 Patrick Michl.
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
Pattern Recognition and Machine Learning
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Restricted Boltzmann Machines and Deep Belief Networks
Computer vision: models, learning and inference Chapter 10 Graphical Models.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models
Crash Course on Machine Learning
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
How to do backpropagation in a brain
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto (Training MRFs using new algorithm.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1a Sigmoid Belief Nets and Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto.
Fields of Experts: A Framework for Learning Image Priors (Mon) Young Ki Baik, Computer Vision Lab.
CSC2535 Lecture 4 Boltzmann Machines, Sigmoid Belief Nets and Gibbs sampling Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
CS Statistical Machine learning Lecture 24
Lecture 2: Statistical learning primer for biologists
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
CSC2535 Lecture 5 Sigmoid Belief Nets
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
CSC2535: 2013 Advanced Machine Learning Lecture 2b: Variational Inference and Learning in Directed Graphical Models Geoffrey Hinton.
Machine Learning Supervised Learning Classification and Regression
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Deep Feedforward Networks
Energy models and Deep Belief Networks
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Multimodal Learning with Deep Boltzmann Machines
Structure learning with deep autoencoders
Latent Variables, Mixture Models and EM
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
SMEM Algorithm for Mixture Models
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Boltzmann Machine (BM) (§6.4)
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton.
CSC2535: 2011 Lecture 5a More ways to fit energy-based models
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Introduction to Neural Networks
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

Products of Experts Products of Experts, ICANN’99, Geoffrey E. Hinton, 1999 Training Products of Experts by Minimizing Contrastive Divergence, GCNU TR 2000-004, Geoffrey E. Hinton, 2000. Rate-coded Restricted Boltzmann Machines for Face Recognition, NIPS’2000, Yee Whye The, Geoffrey E. Hinton, 2000. Recognizing Hand-written Digits Using Hierarchical Products of Experts, NIPS’2000, Guy Mayraz and Geoffrey E. Hinton, 2000.

Introduction (1) Model combining by mixtures Goodness Weakness Easy to fit mixtures of tractable models to data using EM or gradient ascent. If the individual models differ a lot, the mixture is likely to be better fit to the true distribution of data. If sufficiently many models are included in the mixture, it is possible to approximate complicated smooth distributions arbitrarily accurately. Weakness Very inefficient in high-dimensional spaces. The posterior distribution cannot be sharper than the individual models in the mixture. The individuals must be broadly tuned to allow them to cover the full dimensional space.

Introduction (2) Products of Experts (PoE) Multiply individual distributions and renormalize. Can produce much sharper distributions than the individual expert models. Each expert model can constrain a different subset of the dimensions in a high-dimensional space and their product will constrain all of the dimensions.

Learning by Maximizing Likelihood (1) Products of experts Fitting the data Individual expert Must give high probability to the observed data and must waste as little probability as possible on the rest of the data space. PoE Can fit the data well even if each expert wastes a lot of its probability on inappropriate regions of the data space provided different experts waste probability in different regions.

Learning by Maximizing Likelihood (2) Fitting the data The difficulty in estimating the derivative under the PoE is generating correctly distributed fantasy data. Rejection sampling Can be used for discrete data. Each expert generates a data vector independently and this process is repeated until all the experts happen to agree. Typically very inefficient. 식 (2)를 유도해 볼 것.

Learning by Maximizing Likelihood (3) MCMC using Gibbs sampling Alternate between parallel updates of the hidden and visible variables. Due to the product formulation, the hidden states of all the experts can always be updated in parallel. If the hidden and visible variables form a bipartite graph, it is possible to update all of the components of the data vector in parallel. The samples from the equilibrium distribution generally have very high variance, and this variance swamps the derivative. Since the variance in the samples depends on the parameters of the model, the parameters tend to be repelled from regions of high variance even if the gradient is zero. 두 번째 내용과 세 번째 내용은 잘 이해가 가지 않음…

Learning by Contrastive Divergence (1) An equivalence Maximizing the log likelihood of the data is equivalent to minimizing the KL divergence between the data distribution, , and the equilibrium distribution over the visible variables, . is just another way of writing .

Learning by Contrastive Divergence (2) An alternative using contrastive divergence Minimize the difference between and . the distribution over “one-step” reconstruction of the data vectors. Generated by one full step of Gibbs sampling. Simply run the Markov chain for one full step and update the parameters to reduce the tendency of the chain to wander away from the initial distribution on the first step. The contrastive divergence can never be negative. is one step closer to the equilibrium distribution than , and it is guaranteed that . For Markov chains in which all transitions have non-zero probability, implies so the contrastive divergence can only be zero if the model is perfect.

Learning by Contrastive Divergence (3) Convergence divergence The third term can be safely be ignored because it is small and it seldom opposes the resultant of the other two terms.

Learning by Contrastive Divergence (4) Unbiased sampling of from . Pick a data vector, , from the distribution of the data . Compute, for each expert separately, the posterior probability distribution over its latent variables given the data vector, . Pick a value for each latent variables from its posterior distribution. Given the chosen values of all the latent variables, compute the conditional distribution over all the visible variables by multiplying together the conditional distributions specified by each expert. Pick a value for each visible variable from the conditional distribution. These constitute the reconstructed data vector, .

A Simple Example (1) PoE’s work very well on data distributions that can be factorized into a product of lower dimensional distribution. Example

A Simple Example (2) 15 unigauss experts each of which is a mixture of a uniform and a single axis-aligned Gaussian. In the fitted model, each tight data cluster is represented by the intersection of two Gaussians which are elongated different axes. Using the conservative learning rate, the fitting required 2,000 updates of the parameters. For each update of the parameters Given , calculate the posterior probability of selecting the Gaussian rather than the uniform in each expert and compute . For each expert, stochastically select the Gaussian or the uniform according to the posterior. Compute the normalized product of the selected Gaussians and sample from it is used to get a . Compute

Initializing the Experts Separate initialization of the experts Force the experts to differ by giving them different or differently weighted training cases or by training them on different subsets of the data dimensions or by using different model classes for the different experts. According to the simulations the PoE is far more likely to become trapped in poor local optima. Better solutions are obtained by simply initializing the experts randomly with very vague distributions.

PoE’s and Boltzmann Machines (1) Restricted Boltzmann machine (RBM) one visible layer, one hidden layer, and no intralayer connections (Smolensky, 1986). The probability of generating a visible vector is proportional to the product of the probabilities that the visible vector would be generated by each of the hidden units acting alone (Freund and Haussler, 1992). Can be considered a PoE with one expert per hidden unit. The weight specify the log odds that the visible unit is on. Multiplying together the distributions over the visible states specified by different experts is achieved by simply adding the log odds. Exact inference is tractable in an RBM because the states of the hidden units are conditionally independent given the data.

PoE’s and Boltzmann Machines (2) Model Fitting

Learning the Features of Handwritten Digits (1) Setting Model An RBM with 500 hidden units and 256 visible units Data 8,000 1616 real-valued images of handwritten digits from all 10 classes on the USPS Cedar ROM. The pixel intensities were normalized to lie between 0 and 1. Modification of learning algorithm Learning procedure Took two days in matlab on a 500 MHz workstation to perform 658 epochs. In each epoch, the weights were updated 80 times by mini-batches of size 100 that contained 10 exemplars of each digit class. Momentum was used.

Learning the Features of Handwritten Digits (2)

Discrimination of Handwritten Digits (1) Model A restricted Boltzmann machines with 784 visible units for each digit class. A Logistic classification network for classification. Data set MNIST database. 5,400 training images for each digit, and 10,000 test images in total. 4,400 examples for the PoE, and 1,000 example for the logistic classification network. 24  24 pixel images.

Discrimination of Handwritten Digits (2) PoE’s fitting 4,400 examples were divided into 44 mini-batches. One epoch of learning consist of a pass through all 44 minibatches in fixed order with the weights being updated after each minibatch. Momentum and weight decay is used. Three-layer hierarchy of hidden features in each digit model. After training one-hidden-layer PoE on a set of images, an another PoE having the the upper layer as the hidden layer is trained, and proceed in the same way for another layer.

Discrimination of Handwritten Digits (3)

Discrimination of Handwritten Digits (4)

Discrimination of Handwritten Digits (5)

Discrimination of Handwritten Digits (6) The discriminative power of the added hidden layers.

Discrimination of Handwritten Digits (7) Dealing with multiple class After training each PoE, the logistic classification network is training using a 10,000 validation examples (1,000 examples for each digit class). Input : unnormalized log probabilities given by the PoE’s. With three hidden layer, 30 input units (3 for each digit class) are needed. Each input is the goodness score with each hidden layer in each model. Ouput : 10 output units

Discrimination of Handwritten Digits (8) Results

Discrimination of Handwritten Digits (9)

How Good Is the Approximation? (1) Approximation in updating weights Ignore the term that comes from the change in the distribution . Simulation An RBM is used. For an individual weight, “B” occasionally differs in sign from the “A”. When averaged over the training data, the vector of parameter updates given by “B” is almost certain to have a positive cosine with the true gradient defined by “A”. B A

How good is the approximation? (2)

How good is the approximation? (3)

Other Types of Expert (1) Beyond binary node. replica for image data Replicate each visible unit so that a pixel corresponds to a whole set of binary visible units that all have identical weights to the hidden units. The number of active units can approximate a real-valued intensity. During reconstruction, the number of active units is binomially distributed. If the number of replications is m and the probability of a unit’s being active is pi, then the distribution can be approximated by just adding a little Gaussian noise to mpi and rounding. The same trick can be applied to hidden nodes. Example : face recognition (Yee Whye Teh, NIPS2000)

Other Types of Expert (2) “unifac” experts each expert consists of a mixture of a uniform distribution and a factor analyser with just one factor. A binary latent variable that specifies whether to use the uniform or the factor analyser. A real-valued latent variable that specifies the value of the factor. Using complex experts an alternative to using a large number of relatively simple expert is to make each expert as complicated as possible. Must retain the ability to compute the exact derivative of the log likelihood of the data w.r.t. the parameters of an expert.

Discussion (1) Logarithmic opinion pools The geometric mean of a set of probability distributions has the property that its KL divergence from the true distribution, P, is smaller than the average of the KL divergence of the individual distributions, Q. When all of the individual models are identical, Z=1. Otherwise, Z<1 and the difference is –logZ. The benefit of combining experts comes from the fact that they make logZ small by disagreeing on unobserved data.

Discussion (2) Comparison with DAG models Data generation Inference In a PoE, inference is trivial. The experts are individually tractable The product formulation ensures that the hidden states of different experts are conditionally independent given the data. DAG models suffer from “explaining away” phenomenon. If the models are densely connected exact inference is intractable, so an approximation method is needed. Mean field theory for sigmoid belief networks. Helmholtz machines. Data generation It needs an iterative procedure to generate fantasy data in a PoE, while it can be done trivially in one ancestral pass in a DAG model. When learning, the difficulty of generating samples from the model is not a major problem.

Discussion (3) Dependence between latent variables For generative models that work by first choosing the latent values and then generating a data vector, If the model has a single hidden layer and the latent variables have independent prior distributions, there is a strong tendency for the posterior values of the latent variables to be approximately marginally independent after the model has been fitted to the data. With PoE’s, even though the experts have independent priors the latent variables in different experts will be marginally dependent. After the first hidden layer has been learned greedily there may still be lots of statistical structure in the latent variables for the second hidden layer to capture.