Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.

Slides:



Advertisements
Similar presentations
Greedy Layer-Wise Training of Deep Networks
Advertisements

Deep Belief Nets and Restricted Boltzmann Machines
Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.
Deep Learning Bing-Chen Tsai 1/21.
CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.
Stochastic Neural Networks Deep Learning and Neural Nets Spring 2015.
CS590M 2008 Fall: Paper Presentation
Advanced topics.
Stacking RBMs and Auto-encoders for Deep Architectures References:[Bengio, 2009], [Vincent et al., 2008] 2011/03/03 강병곤.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CS 678 –Boltzmann Machines1 Boltzmann Machine Relaxation net with visible and hidden units Learning algorithm Avoids local minima (and speeds up learning)
Tuomas Sandholm Carnegie Mellon University Computer Science Department
Presented by: Mingyuan Zhou Duke University, ECE September 18, 2009
CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net
Deep Learning.
How to do backpropagation in a brain
Deep Belief Networks for Spam Filtering
Expectation Maximization Algorithm
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Restricted Boltzmann Machines and Deep Belief Networks
Computer vision: models, learning and inference Chapter 10 Graphical Models.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Submitted by:Supervised by: Ankit Bhutani Prof. Amitabha Mukerjee (Y )Prof. K S Venkatesh.
Deep Boltzman machines Paper by : R. Salakhutdinov, G. Hinton Presenter : Roozbeh Gholizadeh.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
How to do backpropagation in a brain
6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM):
Deep Boltzmann Machines
Boltzmann Machines and their Extensions S. M. Ali Eslami Nicolas Heess John Winn March 2013 Heriott-Watt University.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto (Training MRFs using new algorithm.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto.
CSC2535 Lecture 4 Boltzmann Machines, Sigmoid Belief Nets and Gibbs sampling Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
Linear Models for Classification
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
How to learn a generative model of images Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
Boltzman Machines Stochastic Hopfield Machines Lectures 11e 1.
Cognitive models for emotion recognition: Big Data and Deep Learning
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Neural networks and support vector machines
Welcome deep loria !.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
Energy models and Deep Belief Networks
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
A Practical Guide to Training Restricted Boltzmann Machines
Restricted Boltzmann Machines for Classification
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Multimodal Learning with Deep Boltzmann Machines
Deep Learning Qing LU, Siyuan CAO.
Department of Electrical and Computer Engineering
CSC2535: 2011 Lecture 5a More ways to fit energy-based models
CSC 578 Neural Networks and Deep Learning
Presentation transcript:

Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML 2009 presented by Jorge Silva Department of Electrical and Computer Engineering, Duke University

2/17 Problems of interest: Density Estimation and Classification using RBMs RBM = Restricted Boltzmann Machine: a stochastic version of a Hopfield network (i.e., recurrent neural network); often used as an associative memory Can also be seen as a particular case of a Deep Belief Network (DBN) Why “restricted”? Because we restrict connectivity: no intra-layer connections (Hinton, 2002; Smolensky 1986) adapted from data pattern (binary vector) internal, or hidden representations hidden units visible units

3/17 Notation Define the following energy function: The joint probability P(v,h) and the marginal P(v) are visible state hidden state state of the i-th visible unit weight of the i-j connection state of the j-th hidden unit biases

4/17 Training with gradient descent Training data likelihood (using just one datum for simplicity) The positive gradient is easy: But the negative gradient is intractable: We can’t even sample from the model, so no MC approximation

5/17 Contrastive Divergence (CD) However, we can approximately sample from the model. The existing Contrastive Divergence (CD) algorithm is one way to do it CD gets the direction of the gradient approximately right, though not the magnitude The rough idea behind CD is to: –start a Markov chain at one of the training points used to estimate –perform one Gibbs update, i.e., get –treat the configuration (h,v) as a sample from the model What about “Persistent” CD? (Hinton, 2002)

6/17 Persistent Contrastive Divergence (PCD) Use a persistent Markov chain that is not reinitialized at each time the parameters are changed The learning rate should be small compared to the mixing rate of the Markov chain Many persistent chains can be run in parallel; the corresponding (h,v) pairs are called “fantasy particles” For a fixed amount of computation, RBMs can learn better models using PCD Again, PCD is a previously existing algorithm (Neal, 1992; Tieleman, 2008)

7/17 Contributions and outline Theoretical: show the interaction between the mixing rates and the weight updates in PCD Practical: introduce fast weights, in addition to the regular weights. This improves the performance/speed tradeoff Outline for the rest of the talk: –Mixing rates vs weight updates –Fast weights –PCD algorithm with fast weights (FPCD) –Experiments

8/17 Mixing rates vs weight updates Consider M persistent chains The states (v,h) of the chains define a distribution R consisting of M point masses Assume M is large enough that we can ignore sampling noise The weights are updated in the direction of the negative gradient of P is the data distribution and is the intractable model distribution (being approximated by R) is the vector of parameters (weights)

9/17 Mixing rates vs weight updates Terms in the objective function: The weight updates increase (which is bad), but This is compensated by an increase in the mixing rates, making decrease rapidly (which is good) Essentially, the fantasy particles quickly “rule out” large portions of the search space where Q is negligible this term is the neg. log-likelihood (minus the fixed entropy of P) this term is being maximized w.r.t. \theta

10/17 Fast weights In addition to the regular weights, the paper introduces fast weights Fast weights are only used for fantasy particles; their learning rate is larger and their weight-decay is much stronger (weight-decay = ridge regression) The role of the fast weights is to make the (combined) energy increase faster in the vicinity of the fantasy particles, making them mix faster This way, the fantasy particles can escape low-energy local modes; this counteracts the progressive reduction in learning rates, which is otherwise desirable as learning progresses The learning rate of the fast weights stays constant, but the weights themselves decay fast, so their effect is temporary (Bharath & Borkar, 1999)

11/17 PCD algorithm with fast weights (FPCD) weight decay

12/17 Experiments: MNIST dataset Small-scale task: density estimation using an RBM with 25 hidden units Larger task: classification using an RBM with 500 hidden units In classification RBMs, there are two types of visible units: image units and label units. The RBM learns a joint density over both types. In the plots, each point corresponds to 10 runs; in each run, the network was trained for a predetermined amount of time Performance is measured on a held-out test set The learning rate (for regular weights) decays linearly to zero over the computation time; for fast weights it is constant=1/e (Hinton et al., 2006; Larochelle & Bengio, 2008)

13/17 Experiments: MNIST dataset (fixed RBM size)

14/17 Experiments: MNIST dataset (optimized RBM size) FPCD: 1200 hidden units PCD: 700 hiden units

15/17 Experiments: Micro-NORB dataset Classification task on 96x96 images, downsampled to 32x32 MNORB dimensionality (before downsampling) is 18432, while MNIST is 784 Learning rate decays as 1/t for regular weights (LeCun et al., 2004)

16/17 Experiments: Micro-NORB dataset non-monotonicity indicates overfitting problems

17/17 Conclusion FPCD outperforms PCD, especially when the number of weight updates is small FPCD allows more flexible learning rate schedules than PCD Results on the MNORB data also indicate outperformance in datasets where overfitting is a concern Logistic regression on the full dimensional MNORB dataset had 23% misclassification; the RBM with FPCD achieved 26% on the reduced dataset Future work: run FPCD for a longer time on an established dataset