Frank Wood - Training Products of Experts by Minimizing Contrastive Divergence Geoffrey E. Hinton presented by Frank Wood.

Slides:



Advertisements
Similar presentations
Generalised linear mixed models in WinBUGS
Advertisements

Contrastive Divergence Learning
Gentle Introduction to Infinite Gaussian Mixture Modeling
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Gaussian Mixture.
Image Modeling & Segmentation
Deep Learning Bing-Chen Tsai 1/21.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood
CS590M 2008 Fall: Paper Presentation
EM Algorithm Jur van den Berg.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 143, Brown James Hays 02/22/11 Many slides from Derek Hoiem.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.
GrabCut Interactive Image (and Stereo) Segmentation Joon Jae Lee Keimyung University Welcome. I will present Grabcut – an Interactive tool for foreground.
Optimization methods Morten Nielsen Department of Systems biology, DTU.
Simulation Modeling and Analysis
Lecture 5: Learning models using EM
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference (Sec. )
Statistics.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
PROBABILITY AND SAMPLES: THE DISTRIBUTION OF SAMPLE MEANS.
Restricted Boltzmann Machines and Deep Belief Networks
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
EM and expected complete log-likelihood Mixture of Experts
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Machine Learning Introduction Study on the Coursera All Right Reserved : Andrew Ng Lecturer:Much Database Lab of Xiamen University Aug 12,2014.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
Varieties of Helmholtz Machine Peter Dayan and Geoffrey E. Hinton, Neural Networks, Vol. 9, No. 8, pp , 1996.
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto (Training MRFs using new algorithm.
Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a.
Markov Random Fields Probabilistic Models for Images
Monte Carlo Methods So far we have discussed Monte Carlo methods based on a uniform distribution of random numbers on the interval [0,1] p(x) = 1 0  x.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Genetic Algorithms. Evolutionary Methods Methods inspired by the process of biological evolution. Main ideas: Population of solutions Assign a score or.
Fields of Experts: A Framework for Learning Image Priors (Mon) Young Ki Baik, Computer Vision Lab.
Mixture of Gaussians This is a probability distribution for random variables or N-D vectors such as… –intensity of an object in a gray scale image –color.
HMM - Part 2 The EM algorithm Continuous density HMM.
Intro. ANN & Fuzzy Systems Lecture 23 Clustering (4)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
Lecture 2: Statistical learning primer for biologists
Flat clustering approaches
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Lecture 9 State Space Gradient Descent Gibbs Sampler with Simulated Annealing.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
RADFORD M. NEAL GEOFFREY E. HINTON 발표: 황규백
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
SIR method continued. SIR: sample-importance resampling Find maximum likelihood (best likelihood × prior), Y Randomly sample pairs of r and N 1973 For.
CS479/679 Pattern Recognition Dr. George Bebis
Restricted Boltzmann Machines for Classification
Classification of unlabeled data:
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Multimodal Learning with Deep Boltzmann Machines
Structure learning with deep autoencoders
Latent Variables, Mixture Models and EM
Markov chain monte carlo
Markov Networks.
Akio Utsugi National Institute of Bioscience and Human-technology,
Test the series for convergence or divergence. {image}
Flower Pollination Algorithm
Test the series for convergence or divergence. {image}
Population Proportion
Products of Experts Products of Experts, ICANN’99, Geoffrey E. Hinton, 1999 Training Products of Experts by Minimizing Contrastive Divergence, GCNU TR.
Determine whether the sequence converges or diverges. {image}
Markov Networks.
Logistic Regression Geoff Hulten.
Presentation transcript:

Frank Wood - Training Products of Experts by Minimizing Contrastive Divergence Geoffrey E. Hinton presented by Frank Wood

Frank Wood - Goal Learn parameters for probability distribution models of high dimensional data – (Images, Population Firing Rates, Securities Data, NLP data, etc) Mixture Model Use EM to learn parameters Product of Experts Use Contrastive Divergence to learn parameters.

Frank Wood - Take Home Contrastive divergence is a general MCMC gradient ascent learning algorithm particularly well suited to learning Product of Experts (PoE) and energy- based (Gibbs distributions, etc.) model parameters. The general algorithm: –Repeat Until “Convergence” Draw samples from the current model starting from the training data. Compute the expected gradient of the log probability w.r.t. all model parameters over both samples and the training data. Update the model parameters according to the gradient.

Frank Wood - Sampling – Critical to Understanding Uniform –rand() Linear Congruential Generator x(n) = a * x(n-1) + b mod M Normal –randn()Box-Mueller x1,x2 ~ U(0,1) -> y1,y2 ~N(0,1) –y1 = sqrt( - 2 ln(x1) ) cos( 2 pi x2 ) –y2 = sqrt( - 2 ln(x1) ) sin( 2 pi x2 ) Binomial(p) –if(rand()<p) More Complicated Distributions –Mixture Model Sample from a Gaussian Sample from a multinomial (CDF + uniform) –Product of Experts Metropolis and/or Gibbs

Frank Wood - The Flavor of Metropolis Sampling Given some distribution, a random starting point, and a symmetric proposal distribution. Calculate the ratio of densities where is sampled from the proposal distribution. With probability accept. Given sufficiently many iterations Only need to know the distribution up to a proportionality!

Frank Wood - Contrastive Divergence (Final Result!) Law of Large Numbers, compute expectations using samples. Now you know how to do it, let’s see why this works! Model parameters. Training data (empirical distribution). Samples from model.

Frank Wood - But First: The last vestige of concreteness. Looking towards the future: –Take f to be a Student-t. –Then (for instance) Dot product  Projection  1-D Marginal

Frank Wood - Maximizing the training data log likelihood We want maximizing parameters Differentiate w.r.t. to all parameters and perform gradient ascent to find optimal parameters. The derivation is somewhat nasty. Assuming d’s drawn independently from p() Standard PoE form Over all training data.

Frank Wood - Maximizing the training data log likelihood Remember this equivalence!

Frank Wood - Maximizing the training data log likelihood

Frank Wood - Maximizing the training data log likelihood log(x)’ = x’/x

Frank Wood - Maximizing the training data log likelihood log(x)’ = x’/x

Frank Wood - Maximizing the training data log likelihood

Frank Wood - Maximizing the training data log likelihood Phew! We’re done! So:

Frank Wood - Equilibrium Is Hard to Achieve With: we can now train our PoE model. But… there’s a problem: – is computationally infeasible to obtain (esp. in an inner gradient ascent loop). –Sampling Markov Chain must converge to target distribution. Often this takes a very long time!

Frank Wood - Solution: Contrastive Divergence! Now we don’t have to run the sampling Markov Chain to convergence, instead we can stop after 1 iteration (or perhaps a few iterations more typically) Why does this work? –Attempts to minimize the ways that the model distorts the data.

Frank Wood - Equivalence of argmax log P() and argmax KL() This is what we got out of the nasty derivation!

Frank Wood - Contrastive Divergence We want to “update the parameters to reduce the tendency of the chain to wander away from the initial distribution on the first step”.

Frank Wood - Contrastive Divergence (Final Result!) Law of Large Numbers, compute expectations using samples. Now you know how to do it and why it works! Model parameters. Training data (empirical distribution). Samples from model.