CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Bayes rule, priors and maximum a posteriori
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Probabilistic models Haixu Tang School of Informatics.
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Integration of sensory modalities
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Parameter Estimation using likelihood functions Tutorial #1
Visual Recognition Tutorial
Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem, random variables, pdfs 2Functions.
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Bayesian learning finalized (with high probability)
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Computer vision: models, learning and inference
Introduction to Bayesian Parameter Estimation
Thanks to Nir Friedman, HU
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
INTRODUCTION TO Machine Learning 3rd Edition
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
CSC321: Neural Networks Lecture 16: Hidden Markov Models
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
Machine Learning 5. Parametric Methods.
Univariate Gaussian Case (Cont.)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
CSC2535 Lecture 5 Sigmoid Belief Nets
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Data Modeling Patrice Koehl Department of Biological Sciences
Univariate Gaussian Case (Cont.)
Bayesian Estimation and Confidence Intervals
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
Bayes Net Learning: Bayesian Approaches
Special Topics In Scientific Computing
Integration of sensory modalities
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
BTEC Computing Unit 1: Principles of Computer Science
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Mathematical Foundations of BME Reza Shadmehr
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton

The Bayesian framework The Bayesian framework assumes that we always have a prior distribution for everything. –The prior may be very vague. –When we see some data, we combine our prior distribution with a likelihood term to get a posterior distribution. –The likelihood term takes into account how probable the observed data is given the parameters of the model. It favors parameter settings that make the data likely. It fights the prior With enough data the likelihood terms always win.

A coin tossing example Suppose we know nothing about coins except that each tossing event produces a head with some unknown probability p and a tail with probability 1-p. Our model of a coin has one parameter, p. Suppose we observe 100 tosses and there are 53 heads. What is p? The frequentist answer: Pick the value of p that makes the observation of 53 heads and 47 tails most probable. probability of a particular sequence

Some problems with picking the parameters that are most likely to generate the data What if we only tossed the coin once and we got 1 head? –Is p=1 a sensible answer? Surely p=0.5 is a much better answer. Is it reasonable to give a single answer? –If we don’t have much data, we are unsure about p. – Our computations of probabilities will work much better if we take this uncertainty into account.

Using a distribution over parameter values Start with a prior distribution over p. In this case we used a uniform distribution. Multiply the prior probability of each parameter value by the probability of observing a head given that value. Then scale up all of the probability densities so that their integral comes to 1. This gives the posterior distribution. probability density p area= probability density

Lets do it again: Suppose we get a tail Start with a prior distribution over p. Multiply the prior probability of each parameter value by the probability of observing a tail given that value. Then renormalize to get the posterior distribution. Look how sensible it is! probability density p area=

Lets do it another 98 times After 53 heads and 47 tails we get a very sensible posterior distribution that has its peak at 0.53 (assuming a uniform prior). probability density p area=

Bayes Theorem Prior probability of weight vector W Posterior probability of weight vector W given training data D Probability of observed data given W joint probability conditional probability

A cheap trick to avoid computing the posterior probabilities of all weight vectors Suppose we just try to find the most probable weight vector. –We can do this by starting with a random weight vector and then adjusting it in the direction that improves p( W | D ). It is easier to work in the log domain. If we want to minimize a cost we use negative log probabilities:

Why we maximize sums of log probs We want to maximize the product of the probabilities of the outputs on all the different training cases –Assume the output errors on different training cases, c, are independent. Because the log function is monotonic, it does not change where the maxima are. So we can maximize sums of log probabilities

A even cheaper trick Suppose we completely ignore the prior over weight vectors –This is equivalent to giving all possible weight vectors the same prior probability density. Then all we have to do is to maximize: This is called maximum likelihood learning. It is very widely used for fitting models in statistics.

Supervised Maximum Likelihood Learning Minimizing the squared residuals is equivalent to maximizing the log probability of the correct answer under a Gaussian centered at the model’s guess. d = the correct answer y = model’s estimate of most probable value

Supervised Maximum Likelihood Learning Finding a set of weights, W, that minimizes the squared errors is exactly the same as finding a W that maximizes the log probability that the model would produce the desired outputs on all the training cases. –We implicitly assume that zero-mean Gaussian noise is added to the model’s actual output. –We do not need to know the variance of the noise because we are assuming it’s the same in all cases. So it just scales the squared error.