CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
Deep Learning Bing-Chen Tsai 1/21.
Advertisements

CS590M 2008 Fall: Paper Presentation
CS 678 –Boltzmann Machines1 Boltzmann Machine Relaxation net with visible and hidden units Learning algorithm Avoids local minima (and speeds up learning)
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Deep Boltzman machines Paper by : R. Salakhutdinov, G. Hinton Presenter : Roozbeh Gholizadeh.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models
6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM):
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
CIAR Second Summer School Tutorial Lecture 1b Contrastive Divergence and Deterministic Energy-Based Models Geoffrey Hinton.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
1 Chapter 14 Probabilistic Reasoning. 2 Outline Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
CSC321: Introduction to Neural Networks and machine Learning Lecture 16: Hopfield nets and simulated annealing Geoffrey Hinton.
Markov Random Fields Probabilistic Models for Images
CIAR Second Summer School Tutorial Lecture 1a Sigmoid Belief Nets and Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
CSC2535 Lecture 4 Boltzmann Machines, Sigmoid Belief Nets and Gibbs sampling Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 16: Hidden Markov Models
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 19: Learning Restricted Boltzmann Machines Geoffrey Hinton.
Boltzman Machines Stochastic Hopfield Machines Lectures 11e 1.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2535 Lecture 5 Sigmoid Belief Nets
CSC321: Computation in Neural Networks Lecture 21: Stochastic Hopfield nets and simulated annealing Geoffrey Hinton.
CSC2535: Computation in Neural Networks Lecture 8: Hopfield nets Geoffrey Hinton.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
CSC2535: Lecture 4: Autoencoders, Free energy, and Minimum Description Length Geoffrey Hinton.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Learning Deep Generative Models by Ruslan Salakhutdinov
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
CSC321 Lecture 18: Hopfield nets and simulated annealing
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Classification of unlabeled data:
Multimodal Learning with Deep Boltzmann Machines
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Structure learning with deep autoencoders
Markov Chain Monte Carlo
Hidden Markov Models Part 2: Algorithms
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Regulation Analysis using Restricted Boltzmann Machines
Expectation-Maximization & Belief Propagation
Boltzmann Machine (BM) (§6.4)
Model generalization Brief summary of methods
Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics
CSC321 Winter 2007 Lecture 21: Some Demonstrations of Restricted Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
Sequential Learning with Dependency Nets
CSC 578 Neural Networks and Deep Learning
Presentation transcript:

CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton

Modeling binary data Given a training set of binary vectors, fit a model that will assign a probability to other binary vectors. Useful for deciding if other binary vectors come from the same distribution. This can be used for monitoring complex systems to detect unusual behavior. If we have models of several different distributions it can be used to compute the posterior probability that a particular distribution produced the observed data.

A naïve model for binary data For each component, j, compute its probability, pj, of being on in the training set. Model the probability of test vector alpha as the product of the probabilities of each of its components: If component j of vector alpha is on If component j of vector alpha is off Binary vector alpha

A mixture of naïve models Assume that the data was generated by first picking a particular naïve model and then generating a binary vector from this naïve model. This is just like the mixture of Gaussians, but for binary data.

Limitations of mixture models Mixture models assume that the whole of each data vector was generated by exactly one of the models in the mixture. This makes is easy to compute the posterior distribution over models when given a data vector. But it cannot deal with data in which there are several things going on at once. mixture of 10 models mixture of 100 models

Dealing with compositional structure Consider a dataset in which each image contains N different things: A distributed representation requires a number of neurons that is linear in N. A localist representation (i.e. a mixture model) requires a number of neurons that is exponential in N. Mixtures require one model for each possible combination. Distributed representations are generally much harder to fit to data, but they are the only reasonable solution. Boltzmann machines use distributed representations to model binary data.

How a Boltzmann Machine models data It is not a causal generative model (like a mixture model) in which we first pick the hidden states and then pick the visible states given the hidden ones. Instead, everything is defined in terms of energies of joint configurations of the visible and hidden units.

The Energy of a joint configuration binary state of unit i in joint configuration alpha, beta bias of unit i weight between units i and j Energy with configuration alpha on the visible units and beta on the hidden units indexes every non-identical pair of i and j once

Using energies to define probabilities The probability of a joint configuration over both visible and hidden units depends on the energy of that joint configuration compared with the energy of all other joint configurations. The probability of a configuration of the visible units is the sum of the probabilities of all the joint configurations that contain it. partition function configuration alpha on the visible units

An example of how weights define a distribution 1 1 1 1 2 7.39 .186 1 1 1 0 2 7.39 .186 1 1 0 1 1 2.72 .069 1 1 0 0 0 1 .025 1 0 1 1 1 2.72 .069 1 0 1 0 2 7.39 .186 1 0 0 1 0 1 .025 1 0 0 0 0 1 .025 0 1 1 1 0 1 .025 0 1 1 0 0 1 .025 0 1 0 1 1 2.72 .069 0 1 0 0 0 1 .025 0 0 1 1 -1 0.37 .009 0 0 1 0 0 1 .025 0 0 0 1 0 1 .025 0 0 0 0 0 1 .025 total =39.70 0.466 -1 h1 h2 +2 +1 v1 v2 0.305 0.144 0.084

Getting a sample from the model If there are more than a few hidden units, we cannot compute the normalizing term (the partition function) because it has exponentially many terms. So use Markov Chain Monte Carlo to get samples from the model: Start at a random global configuration Keep picking units at random and allowing them to stochastically update their states based on their energy gaps. Use simulated annealing to reduce the time required to approach thermal equilibrium. At thermal equilibrium, the probability of a global configuration is given by the Boltzmann distribution.

Getting a sample from the posterior distribution over distributed representations for a given data vector The number of possible hidden configurations is exponential so we need MCMC to sample from the posterior. It is just the same as getting a sample from the model, except that we keep the visible units clamped to the given data vector. Only the hidden units are allowed to change states Samples from the posterior are required for learning the weights.