Download presentation
Presentation is loading. Please wait.
Published byLoren Joseph Modified over 9 years ago
1
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML 2009 presented by Jorge Silva Department of Electrical and Computer Engineering, Duke University
2
2/17 Problems of interest: Density Estimation and Classification using RBMs RBM = Restricted Boltzmann Machine: a stochastic version of a Hopfield network (i.e., recurrent neural network); often used as an associative memory Can also be seen as a particular case of a Deep Belief Network (DBN) Why “restricted”? Because we restrict connectivity: no intra-layer connections (Hinton, 2002; Smolensky 1986) adapted from www.iro.montreal.ca data pattern (binary vector) internal, or hidden representations hidden units visible units
3
3/17 Notation Define the following energy function: The joint probability P(v,h) and the marginal P(v) are visible state hidden state state of the i-th visible unit weight of the i-j connection state of the j-th hidden unit biases
4
4/17 Training with gradient descent Training data likelihood (using just one datum for simplicity) The positive gradient is easy: But the negative gradient is intractable: We can’t even sample from the model, so no MC approximation
5
5/17 Contrastive Divergence (CD) However, we can approximately sample from the model. The existing Contrastive Divergence (CD) algorithm is one way to do it CD gets the direction of the gradient approximately right, though not the magnitude The rough idea behind CD is to: –start a Markov chain at one of the training points used to estimate –perform one Gibbs update, i.e., get –treat the configuration (h,v) as a sample from the model What about “Persistent” CD? (Hinton, 2002)
6
6/17 Persistent Contrastive Divergence (PCD) Use a persistent Markov chain that is not reinitialized at each time the parameters are changed The learning rate should be small compared to the mixing rate of the Markov chain Many persistent chains can be run in parallel; the corresponding (h,v) pairs are called “fantasy particles” For a fixed amount of computation, RBMs can learn better models using PCD Again, PCD is a previously existing algorithm (Neal, 1992; Tieleman, 2008)
7
7/17 Contributions and outline Theoretical: show the interaction between the mixing rates and the weight updates in PCD Practical: introduce fast weights, in addition to the regular weights. This improves the performance/speed tradeoff Outline for the rest of the talk: –Mixing rates vs weight updates –Fast weights –PCD algorithm with fast weights (FPCD) –Experiments
8
8/17 Mixing rates vs weight updates Consider M persistent chains The states (v,h) of the chains define a distribution R consisting of M point masses Assume M is large enough that we can ignore sampling noise The weights are updated in the direction of the negative gradient of P is the data distribution and is the intractable model distribution (being approximated by R) is the vector of parameters (weights)
9
9/17 Mixing rates vs weight updates Terms in the objective function: The weight updates increase (which is bad), but This is compensated by an increase in the mixing rates, making decrease rapidly (which is good) Essentially, the fantasy particles quickly “rule out” large portions of the search space where Q is negligible this term is the neg. log-likelihood (minus the fixed entropy of P) this term is being maximized w.r.t. \theta
10
10/17 Fast weights In addition to the regular weights, the paper introduces fast weights Fast weights are only used for fantasy particles; their learning rate is larger and their weight-decay is much stronger (weight-decay = ridge regression) The role of the fast weights is to make the (combined) energy increase faster in the vicinity of the fantasy particles, making them mix faster This way, the fantasy particles can escape low-energy local modes; this counteracts the progressive reduction in learning rates, which is otherwise desirable as learning progresses The learning rate of the fast weights stays constant, but the weights themselves decay fast, so their effect is temporary (Bharath & Borkar, 1999)
11
11/17 PCD algorithm with fast weights (FPCD) weight decay
12
12/17 Experiments: MNIST dataset Small-scale task: density estimation using an RBM with 25 hidden units Larger task: classification using an RBM with 500 hidden units In classification RBMs, there are two types of visible units: image units and label units. The RBM learns a joint density over both types. In the plots, each point corresponds to 10 runs; in each run, the network was trained for a predetermined amount of time Performance is measured on a held-out test set The learning rate (for regular weights) decays linearly to zero over the computation time; for fast weights it is constant=1/e (Hinton et al., 2006; Larochelle & Bengio, 2008)
13
13/17 Experiments: MNIST dataset (fixed RBM size)
14
14/17 Experiments: MNIST dataset (optimized RBM size) FPCD: 1200 hidden units PCD: 700 hiden units
15
15/17 Experiments: Micro-NORB dataset Classification task on 96x96 images, downsampled to 32x32 MNORB dimensionality (before downsampling) is 18432, while MNIST is 784 Learning rate decays as 1/t for regular weights (LeCun et al., 2004)
16
16/17 Experiments: Micro-NORB dataset non-monotonicity indicates overfitting problems
17
17/17 Conclusion FPCD outperforms PCD, especially when the number of weight updates is small FPCD allows more flexible learning rate schedules than PCD Results on the MNORB data also indicate outperformance in datasets where overfitting is a concern Logistic regression on the full 18432-dimensional MNORB dataset had 23% misclassification; the RBM with FPCD achieved 26% on the reduced dataset Future work: run FPCD for a longer time on an established dataset
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.