Deep Belief Nets and Restricted Boltzmann Machines

Slides:



Advertisements
Similar presentations
The Helmholtz Machine P Dayan, GE Hinton, RM Neal, RS Zemel
Advertisements

Greedy Layer-Wise Training of Deep Networks
Deep Learning Bing-Chen Tsai 1/21.
CIAR Second Summer School Tutorial Lecture 2a Learning a Deep Belief Net Geoffrey Hinton.
Stochastic Neural Networks Deep Learning and Neural Nets Spring 2015.
CS590M 2008 Fall: Paper Presentation
Advanced topics.
Stacking RBMs and Auto-encoders for Deep Architectures References:[Bengio, 2009], [Vincent et al., 2008] 2011/03/03 강병곤.
What kind of a Graphical Model is the Brain?
CS 678 –Boltzmann Machines1 Boltzmann Machine Relaxation net with visible and hidden units Learning algorithm Avoids local minima (and speeds up learning)
Supervised Learning Recap
An Introduction to Variational Methods for Graphical Models.
Presented by: Mingyuan Zhou Duke University, ECE September 18, 2009
CIAR Summer School Tutorial Lecture 2b Learning a Deep Belief Net
Deep Learning.
Structure learning with deep neuronal networks 6 th Network Modeling Workshop, 6/6/2013 Patrick Michl.
Deep Belief Networks for Spam Filtering
Restricted Boltzmann Machines and Deep Belief Networks
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
Can computer simulations of the brain allow us to see into the mind? Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
How to do backpropagation in a brain
Boltzmann Machines and their Extensions S. M. Ali Eslami Nicolas Heess John Winn March 2013 Heriott-Watt University.
Erte Pan Wireless Eng. Group Advisor: Dr. Han Department of Electrical and Computer Engineering University of Houston, Houston, TX. Erte Pan Wireless Eng.
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
Varieties of Helmholtz Machine Peter Dayan and Geoffrey E. Hinton, Neural Networks, Vol. 9, No. 8, pp , 1996.
Highlights of Hinton's Contrastive Divergence Pre-NIPS Workshop Yoshua Bengio & Pascal Lamblin USING SLIDES FROM Geoffrey Hinton, Sue Becker & Yann Le.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CIAR Second Summer School Tutorial Lecture 1a Sigmoid Belief Nets and Boltzmann Machines Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto.
CSC2535 Lecture 4 Boltzmann Machines, Sigmoid Belief Nets and Gibbs sampling Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
How to learn a generative model of images Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto.
Boltzman Machines Stochastic Hopfield Machines Lectures 11e 1.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
Deep learning Tsai bing-chen 10/22.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2535 Lecture 5 Sigmoid Belief Nets
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 8 Deep Belief Nets All lecture slides will be available as.ppt,.ps, &.htm at
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
CSC2535: Computation in Neural Networks Lecture 8: Hopfield nets Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Learning Amin Sobhani.
Energy models and Deep Belief Networks
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Restricted Boltzmann Machines for Classification
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Multimodal Learning with Deep Boltzmann Machines
Deep Learning Qing LU, Siyuan CAO.
Deep Belief Networks Psychology 209 February 22, 2013.
Structure learning with deep autoencoders
Deep Architectures for Artificial Intelligence
Boltzmann Machine (BM) (§6.4)
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 9 Learning Multiple Layers of Features Greedily Geoffrey Hinton.
CSC 578 Neural Networks and Deep Learning
Presentation transcript:

Deep Belief Nets and Restricted Boltzmann Machines Erte Pan Wireless Eng. Group Advisor: Dr. Han Department of Electrical and Computer Engineering University of Houston, Houston, TX.

Graphical Model hidden j i visible hidden j i visible Generative model: graphical model captures the causal process by which the observed data was generated, so it is also called generative model. hidden i j visible Undirected graphical model: links have no directional significance inference(infer the states of unobserved variables) is easy learning(adjust weights between variables to make the network more likely to generate the observed data) and generating processes are tricky hidden i j visible Directed graphical model: links have a particular directionality indicated by arrows inference is difficult learning and generating processes are simple

Boltzmann Machine Model one input layer and one hidden layer typically binary states for every unit stochastic (vs. deterministic) recurrent (vs. feed-forward) generative model (vs. discriminative): estimate the distribution of observations(say p(image)), while traditional discriminative networks only estimate the labels(say p(label|image)) defined Energy of the network and Probability of a unit’s state(scalar T is referred to as the “temperature”): Boltzmann Machine

Restricted Boltzmann Machine Model a bipartite graph: no intralayer connections, feed- forward RBM does not have T factor, the rest are the same as BM one important feature of RBM is that the visible units and hidden units are conditionally independent, which will lead to a beautiful result later on: Restricted Boltzmann Machine

Stochastic Search Why BM? Different optimization criteria in traditional networks and RBM for optimization purpose: Traditional: Error criterion. BP method strictly goes along the gradient descent direction. Any direction that enlarge error is NOT acceptable. Easy to get stuck in local minima. BM: associate the network with “Energy”. Simulated Annealing enables the energy to grow under certain probability.

Simulated Annealing Simulated Annealing for BM: 1. Create initial solution S (global states of the network) Initialize temperature T>>1 2. Repeat until T =T-lower-bound Repeat until thermal equilibrium is reached at the current T Generate a random transition from S to S’ Let E = E(S’)  E(S) if E < 0 then S = S’ else if exp[E/T] > rand(0,1) then S = S’ Reduce temperature T according to the cooling schedule 3. Return S This term allows “thermal disturbance” which facilitate finding global minimum

Restricted Boltzmann Machine Two characters to define a Restricted Boltzmann Machine: states of all the units: obtained through probability distribution. weights of the network: obtained through training(Contrastive Divergence). As mentioned before, the objective of RBM is to estimate the distribution of input data. And this goal is fully determined by the weights, given the input. Energy defined for the RBM: Distribution of visible layer of the RBM(Boltzmann Distribution): Restricted Boltzmann Machine Z is the partition function defined as the sum of over all possible configurations of {v,h} Probability that unit i is on(binary state 1): σ(.) is the logistic/sigmoid function

Restricted Boltzmann Machine Training for RBM: Maximum Likelihood learning the probability over a vector x with parameter W(weights) is: Given i.i.d. samples , The objective is to maximize the average log-likelihood: the <.>0 denotes an average w.r.t. the data distribution the gradient is then computed as: the <.>∞ denotes an average w.r.t. the model distribution

Restricted Boltzmann Machine Then the update of weights, W, can be computed as: the <.>0 term can be computed using the input samples the <.>∞ term can be solved by MCMC but very slow and suffering from large variance of estimated gradient Solution: Contrastive Divergence maximizing the log probability of the data is the same as minimizing the KL divergence, ( define CD to be: where n is the small number that we run the MC use CDn multiplied by learning rate as the update of the weights Note: this update direction is NOT the gradient of ANY function, yet it is successful in application…

Restricted Boltzmann Machine In RBM, the previous equations then become(for calculating a particular weight between two units): these two equations are obtained by substituting the energy function into the learning rule. Summarized algorithm for training RBM: take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution. compute the expectation of vh and call this the positive gradient. (clamped phase, or positive phase) From h , sample a reconstruction  v’ of the visible units, then resample the hidden activations h’ from this. Compute the expectation of v’h’ and call this the negative gradient.(free phase, or negative phase) Let the weight update Wij to be the positive gradient minus the negative gradient, times some learning rate. Abandoned Simulated Annealing maybe because: 1> heavy computation 2> p(s,W) can be optimized well merely w.r.t. W 3> there should be some relations between Gibbs Sampling and Simulated Annealing

General Deep Belief Nets Problem with DBNs: Since DBNs are directed graph model, given input data, the posterior of hidden units is intractable due to the “explaining away” effect. Solution: Complementary Priors to ensure the posterior of hidden units are under the independent constraints. truck hits house earthquake house jumps 20 -20 -10 General Deep Belief Nets Sigmoid DBNs; bias of -10 on the earth quake node means: in absence of observations, this node is exp{10} times more likely to be off than on. p(1,1)is NOT equal to p(1,.)*p(.,1), so the Posterior is Not INDEPENDENT! p(1,1)=.0001 p(1,0)=.4999 p(0,1)=.4999 p(0,0)=.0001 posterior Explaining Away Effect

Explaining Away Effect v1 h1 v0 h0 v2 h2 etc. + Brief summary for explaining away effect: Given the observations, the posterior of associated hidden variables are actually NOT independent(the probability that one hidden variable is on or off influences the states of others), even though the hidden variables are assumed to be independent in their priors. The reason is that we have non-independence in the likelihood term: Posterior(non-indep) = prior(indep.) * likelihood (non- indep.) Eliminate Explaining Away by Complementary Priors Add extra hidden layers to create CP that has opposite correlations with the likelihood term, so (when likelihood is multiplied by the prior), posterior will become factorial

Complementary Priors Definition of Complementary Priors: Consider observations x and hidden variables y, for a given likelihood function P(x|y), the priors of y, P(y) is called the complementary priors of P(x|y), provided that P(x,y)=P(x|y) P(y) leads to the posteriors P(y|x) that exactly factorises. Infinite directed model with tied weights & Complementary Priors & Gibbs sampling: Recall that the RBMs have the property The definition of energy function of RBM makes it proper model that has two sets of conditional independencies(complementary priors for both v and h) Since we need to estimate the distribution of data, P(v), we can perform Gibbs sampling alternatively from P(v,h) for infinite times. This procedure is analogous to unroll the single RBM into infinite directed stacks of RBMs with tied weights(due to “complementary priors”) where each RBM takes input from the hidden layer of the lower level RBM.

DBNs based on RBMs h3 h2 h1 data RBM DBNs based on stacks of RBMs: The top two hidden layers form an undirected associative memory(regarded as a shorthand for infinite stacks) and the remained hidden layers form a directed acyclic graph. h2 data h1 h3 RBM The red arrows are NOT part of the generative model. They are just for inference purpose

Training Deep Belief Nets Previous discussion gives an intuition of training stacks of RBMs one layer at a time. This greedy learning algorithm is proved to be efficient in the sense of expected variance by Hinton. First, learn all the weights tied. Learn as a single RBM

Training Deep Belief Nets Then freeze bottom layer and relearn all the other layers. Then freeze bottom two layers and relearn all the other layers. Learn as a single RBM Learn as a single RBM

Fine-tuning Deep Belief Nets Each time we learn a new layer, the inference at the lower layers will become incorrect, but the variational bound on the log probability of the data improves, proved by Hinton. Since the inference at lower layers becomes incorrect, Hinton uses a fine- tuning procedure to adjust the weights, called wake-sleep algorithm. Wake-sleep algorithm: wake phase: do a down-top pass, sample h using the recognition weight based on input v for each RBM, and then adjust the generative weight by the RBM learning rule. sleep phase: do a top-down pass, start by a random state of h at the top layer and generate v. Then the recognition weights are modified. h2 data h1 h3 RBM

Deep Belief Nets Analogs for wake-sleep algorithm: wake phase: if the reality is different with the imagination, then modify the generative weights to make what is imagined as close as the reality. sleep phase: if the illusions produced by the concepts learned during wake phase are different with the concepts, then modify the recognition weight to make the illusions as close as the concepts. Questions on DBNs: training vector vs. training set(Patch Training) How to perform unsupervised classification?

Performances from DBNs A: 2-D coded representation of hand-written database MNIST by PCA B: 2-D coded representation of MNIST by DBNs Results produced by Hinton etc.

Performances from DBNs A: 2-D coded representation of documents retrieval data by LSA B: 2-D coded representation of the same data by DBNs B A Results produced by Hinton etc.

Convolutional DBNs Limitations of DBNs: unable to process high dimensional data(DBNs transform 2D images into vectors and then input them into the networks, thus certain spatial information is lost) even if using vectors as the input instead, DBNs are unable to be scaled up properly for real image sizes. They are only suitable for small images directly extending the DBNs to fit the high dimensional data suffers from inefficient computation(millions of weights to estimate) Advantages of CDBNs: feature detectors are shared through all locations in an image, therefore they form the convolution kernels and reduce computation max-pooling: shrink the representation to be translation-invariant and reduce computation

Architecture of CDBNs Energy term and Probability are defined similarly to RBM: All units are 2D binary images, within one unit of detection layer, the weights/convolutional kernels are shared, leading to the convolution operation :

CDBNs Training of CDBNs is done by optimizing the networks’ energy via sparsity regularization(imposed by max-pooling): This yields a similar updating strategy for the weights and biases as the Contrastive Divergence. The sparse constraints also give rise to a simple inference of the network: where

Performance of CDBNs Hierarchical representations of Caltech-101 object classification database by CDBNs. Top: first layer CDBN output. Bottom: second layer CDBN output. Results produced by Andrew Y. Ng etc.

References Review: Learning deep architectures for AI, Y. Bengio 2009 Foundations: A fast learning algorithm for deep belief nets, Hinton 2006 Reducing the dimensionality of data with neural networks, Hinton 2006 A practical guide to training restricted Boltzmann machines, Hinton 2010 On contrastive divergence learning, Hinton, 2005 On the convergence property of contrastive divergence, Tieleman, 2010 Training products of experts by minimizing contrastive divergence, Hinton, 2002 Learning multiple layers of representation, Hinton, 2007 Applications: Sparse deep belief net model for visual area V2, H. Lee 2008 Convolutional deep belief network for scalable unsupervised learning of hierarchical representations, H. Lee 2009 Unsupervised learning of invariant feature hierarchies with applications to object recognition, Y. LeCun 2007