Download presentation
Presentation is loading. Please wait.
Published byAnnabel Stevenson Modified over 5 years ago
1
Products of Experts Products of Experts, ICANN’99, Geoffrey E. Hinton, 1999 Training Products of Experts by Minimizing Contrastive Divergence, GCNU TR , Geoffrey E. Hinton, 2000. Rate-coded Restricted Boltzmann Machines for Face Recognition, NIPS’2000, Yee Whye The, Geoffrey E. Hinton, 2000. Recognizing Hand-written Digits Using Hierarchical Products of Experts, NIPS’2000, Guy Mayraz and Geoffrey E. Hinton, 2000.
2
Introduction (1) Model combining by mixtures Goodness Weakness
Easy to fit mixtures of tractable models to data using EM or gradient ascent. If the individual models differ a lot, the mixture is likely to be better fit to the true distribution of data. If sufficiently many models are included in the mixture, it is possible to approximate complicated smooth distributions arbitrarily accurately. Weakness Very inefficient in high-dimensional spaces. The posterior distribution cannot be sharper than the individual models in the mixture. The individuals must be broadly tuned to allow them to cover the full dimensional space.
3
Introduction (2) Products of Experts (PoE)
Multiply individual distributions and renormalize. Can produce much sharper distributions than the individual expert models. Each expert model can constrain a different subset of the dimensions in a high-dimensional space and their product will constrain all of the dimensions.
4
Learning by Maximizing Likelihood (1)
Products of experts Fitting the data Individual expert Must give high probability to the observed data and must waste as little probability as possible on the rest of the data space. PoE Can fit the data well even if each expert wastes a lot of its probability on inappropriate regions of the data space provided different experts waste probability in different regions.
5
Learning by Maximizing Likelihood (2)
Fitting the data The difficulty in estimating the derivative under the PoE is generating correctly distributed fantasy data. Rejection sampling Can be used for discrete data. Each expert generates a data vector independently and this process is repeated until all the experts happen to agree. Typically very inefficient. 식 (2)를 유도해 볼 것.
6
Learning by Maximizing Likelihood (3)
MCMC using Gibbs sampling Alternate between parallel updates of the hidden and visible variables. Due to the product formulation, the hidden states of all the experts can always be updated in parallel. If the hidden and visible variables form a bipartite graph, it is possible to update all of the components of the data vector in parallel. The samples from the equilibrium distribution generally have very high variance, and this variance swamps the derivative. Since the variance in the samples depends on the parameters of the model, the parameters tend to be repelled from regions of high variance even if the gradient is zero. 두 번째 내용과 세 번째 내용은 잘 이해가 가지 않음…
7
Learning by Contrastive Divergence (1)
An equivalence Maximizing the log likelihood of the data is equivalent to minimizing the KL divergence between the data distribution, , and the equilibrium distribution over the visible variables, is just another way of writing
8
Learning by Contrastive Divergence (2)
An alternative using contrastive divergence Minimize the difference between and the distribution over “one-step” reconstruction of the data vectors. Generated by one full step of Gibbs sampling. Simply run the Markov chain for one full step and update the parameters to reduce the tendency of the chain to wander away from the initial distribution on the first step. The contrastive divergence can never be negative. is one step closer to the equilibrium distribution than , and it is guaranteed that For Markov chains in which all transitions have non-zero probability, implies so the contrastive divergence can only be zero if the model is perfect.
9
Learning by Contrastive Divergence (3)
Convergence divergence The third term can be safely be ignored because it is small and it seldom opposes the resultant of the other two terms.
10
Learning by Contrastive Divergence (4)
Unbiased sampling of from . Pick a data vector, , from the distribution of the data Compute, for each expert separately, the posterior probability distribution over its latent variables given the data vector, . Pick a value for each latent variables from its posterior distribution. Given the chosen values of all the latent variables, compute the conditional distribution over all the visible variables by multiplying together the conditional distributions specified by each expert. Pick a value for each visible variable from the conditional distribution. These constitute the reconstructed data vector,
11
A Simple Example (1) PoE’s work very well on data distributions that can be factorized into a product of lower dimensional distribution. Example
12
A Simple Example (2) 15 unigauss experts each of which is a mixture of a uniform and a single axis-aligned Gaussian. In the fitted model, each tight data cluster is represented by the intersection of two Gaussians which are elongated different axes. Using the conservative learning rate, the fitting required 2,000 updates of the parameters. For each update of the parameters Given , calculate the posterior probability of selecting the Gaussian rather than the uniform in each expert and compute For each expert, stochastically select the Gaussian or the uniform according to the posterior. Compute the normalized product of the selected Gaussians and sample from it is used to get a . Compute
13
Initializing the Experts
Separate initialization of the experts Force the experts to differ by giving them different or differently weighted training cases or by training them on different subsets of the data dimensions or by using different model classes for the different experts. According to the simulations the PoE is far more likely to become trapped in poor local optima. Better solutions are obtained by simply initializing the experts randomly with very vague distributions.
14
PoE’s and Boltzmann Machines (1)
Restricted Boltzmann machine (RBM) one visible layer, one hidden layer, and no intralayer connections (Smolensky, 1986). The probability of generating a visible vector is proportional to the product of the probabilities that the visible vector would be generated by each of the hidden units acting alone (Freund and Haussler, 1992). Can be considered a PoE with one expert per hidden unit. The weight specify the log odds that the visible unit is on. Multiplying together the distributions over the visible states specified by different experts is achieved by simply adding the log odds. Exact inference is tractable in an RBM because the states of the hidden units are conditionally independent given the data.
15
PoE’s and Boltzmann Machines (2)
Model Fitting
16
Learning the Features of Handwritten Digits (1)
Setting Model An RBM with 500 hidden units and 256 visible units Data 8,000 1616 real-valued images of handwritten digits from all 10 classes on the USPS Cedar ROM. The pixel intensities were normalized to lie between 0 and 1. Modification of learning algorithm Learning procedure Took two days in matlab on a 500 MHz workstation to perform 658 epochs. In each epoch, the weights were updated 80 times by mini-batches of size 100 that contained 10 exemplars of each digit class. Momentum was used.
17
Learning the Features of Handwritten Digits (2)
18
Discrimination of Handwritten Digits (1)
Model A restricted Boltzmann machines with 784 visible units for each digit class. A Logistic classification network for classification. Data set MNIST database. 5,400 training images for each digit, and 10,000 test images in total. 4,400 examples for the PoE, and 1,000 example for the logistic classification network. 24 24 pixel images.
19
Discrimination of Handwritten Digits (2)
PoE’s fitting 4,400 examples were divided into 44 mini-batches. One epoch of learning consist of a pass through all 44 minibatches in fixed order with the weights being updated after each minibatch. Momentum and weight decay is used. Three-layer hierarchy of hidden features in each digit model. After training one-hidden-layer PoE on a set of images, an another PoE having the the upper layer as the hidden layer is trained, and proceed in the same way for another layer.
20
Discrimination of Handwritten Digits (3)
21
Discrimination of Handwritten Digits (4)
22
Discrimination of Handwritten Digits (5)
23
Discrimination of Handwritten Digits (6)
The discriminative power of the added hidden layers.
24
Discrimination of Handwritten Digits (7)
Dealing with multiple class After training each PoE, the logistic classification network is training using a 10,000 validation examples (1,000 examples for each digit class). Input : unnormalized log probabilities given by the PoE’s. With three hidden layer, 30 input units (3 for each digit class) are needed. Each input is the goodness score with each hidden layer in each model. Ouput : 10 output units
25
Discrimination of Handwritten Digits (8)
Results
26
Discrimination of Handwritten Digits (9)
27
How Good Is the Approximation? (1)
Approximation in updating weights Ignore the term that comes from the change in the distribution . Simulation An RBM is used. For an individual weight, “B” occasionally differs in sign from the “A”. When averaged over the training data, the vector of parameter updates given by “B” is almost certain to have a positive cosine with the true gradient defined by “A”. B A
28
How good is the approximation? (2)
29
How good is the approximation? (3)
30
Other Types of Expert (1)
Beyond binary node. replica for image data Replicate each visible unit so that a pixel corresponds to a whole set of binary visible units that all have identical weights to the hidden units. The number of active units can approximate a real-valued intensity. During reconstruction, the number of active units is binomially distributed. If the number of replications is m and the probability of a unit’s being active is pi, then the distribution can be approximated by just adding a little Gaussian noise to mpi and rounding. The same trick can be applied to hidden nodes. Example : face recognition (Yee Whye Teh, NIPS2000)
31
Other Types of Expert (2)
“unifac” experts each expert consists of a mixture of a uniform distribution and a factor analyser with just one factor. A binary latent variable that specifies whether to use the uniform or the factor analyser. A real-valued latent variable that specifies the value of the factor. Using complex experts an alternative to using a large number of relatively simple expert is to make each expert as complicated as possible. Must retain the ability to compute the exact derivative of the log likelihood of the data w.r.t. the parameters of an expert.
32
Discussion (1) Logarithmic opinion pools
The geometric mean of a set of probability distributions has the property that its KL divergence from the true distribution, P, is smaller than the average of the KL divergence of the individual distributions, Q. When all of the individual models are identical, Z=1. Otherwise, Z<1 and the difference is –logZ. The benefit of combining experts comes from the fact that they make logZ small by disagreeing on unobserved data.
33
Discussion (2) Comparison with DAG models Data generation Inference
In a PoE, inference is trivial. The experts are individually tractable The product formulation ensures that the hidden states of different experts are conditionally independent given the data. DAG models suffer from “explaining away” phenomenon. If the models are densely connected exact inference is intractable, so an approximation method is needed. Mean field theory for sigmoid belief networks. Helmholtz machines. Data generation It needs an iterative procedure to generate fantasy data in a PoE, while it can be done trivially in one ancestral pass in a DAG model. When learning, the difficulty of generating samples from the model is not a major problem.
34
Discussion (3) Dependence between latent variables
For generative models that work by first choosing the latent values and then generating a data vector, If the model has a single hidden layer and the latent variables have independent prior distributions, there is a strong tendency for the posterior values of the latent variables to be approximately marginally independent after the model has been fitted to the data. With PoE’s, even though the experts have independent priors the latent variables in different experts will be marginally dependent. After the first hidden layer has been learned greedily there may still be lots of statistical structure in the latent variables for the second hidden layer to capture.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.