Bayesian Learning & Estimation Theory

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.
Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.
Neural Networks and Kernel Methods
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
What is Statistical Modeling
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Visual Recognition Tutorial
Computer vision: models, learning and inference
Lecture II-2: Probability Review
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
INTRODUCTION TO Machine Learning 3rd Edition
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Univariate Gaussian Case (Cont.)
EE 551/451, Fall, 2006 Communication Systems Zhu Han Department of Electrical and Computer Engineering Class 15 Oct. 10 th, 2006.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
SYSTEMS Identification Ali Karimpour Assistant Professor Ferdowsi University of Mashhad.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Presentation : “ Maximum Likelihood Estimation” Presented By : Jesu Kiran Spurgen Date :
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Lecture 1.31 Criteria for optimal reception of radio signals.
CS479/679 Pattern Recognition Dr. George Bebis
12. Principles of Parameter Estimation
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
Lecture 09: Gaussian Processes
Special Topics In Scientific Computing
Bias and Variance of the Estimator
Probabilistic Models for Linear Regression
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Generally Discriminant Analysis
Lecture 10: Gaussian Processes
LECTURE 23: INFORMATION THEORY REVIEW
LECTURE 09: BAYESIAN LEARNING
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
12. Principles of Parameter Estimation
Applied Statistics and Probability for Engineers
Presentation transcript:

Bayesian Learning & Estimation Theory

Maximum likelihood estimation Example: For Gaussian likelihood P(x|q) = N (x|,2), Objective of regression: Minimize error E(w) = ½ Sn ( tn - y(xn,w) )2

A probabilistic view of linear regression Precision b =1/s 2 Compare to error function: E(w) = ½ Sn ( tn - y(xn,w) )2 Since argminw E(w) = argmaxw , regression is equivalent to ML estimation of w

Bayesian learning P(q |D) = P(D,q) / P(D)  P(D,q) View the data D and parameter q as random variables (for regression, D = (x, t) and q = w) The data induces a distribution over the parameter: P(q |D) = P(D,q) / P(D)  P(D,q) Substituting P(D,q) = P(D |q) P(q), we obtain Bayes’ theorem: P(q |D)  P(D |q) P(q) Posterior  Likelihood x Prior

Bayesian prediction Predictions (eg, predict t from x using data D) are mediated through the parameter: P(prediction|D) = q P(prediction|q ) P(q |D) dq Maximum a posteriori (MAP) estimation: q MAP = argmaxq P(q |D) P(prediction|D)  P(prediction| q MAP ) Accurate when P(q |D) is concentrated on q MAP

A probabilistic view of regularized regression E(w) = ½ Sn ( tn - y(xn,w) )2 + l/2 Sm wm2 Prior: w’s are IID Gaussian p(w) = Pm (1/ 2pl-1 ) exp{- l wm2 / 2 } Since argminw E(w) = argmaxw p(t|x,w) p(w), regularized regression is equivalent to MAP estimation of w ln p(t|x,w) ln p(w)

Bayesian linear regression Likelihood: b specifies precision of data noise Prior: a specifies precision of weights Posterior: This is an M+1 dimensional Gaussian density Prediction: m = 0 M wm| 0,a -1 Computed using linear algebra (see textbook)

y(x) sampled from posterior Example: y(x) = w0 + w1x y(x) sampled from posterior Likelihood Prior Data Posterior No data 1st point 2nd point ... 20th point

Example: y(x) = w0 + w1x + … + wMxM M = 9, a = 5x10-3: Gives a reasonable range of functions b = 11.1: Known precision of noise Mean and one std dev of the predictive distribution

Example: y(x) = w0 + w1f1(x) + … + wMfM(x) Gaussian basis functions: 1

How are we doing on the pass sequence? Least squares regression… Choosing a particular M and w seems wrong – we should hedge our bets Cross validation reduced the training data, so the red line isn’t as accurate as it should be Hand-labeled horizontal coordinate, t The red line doesn’t reveal different levels of uncertainty in predictions

How are we doing on the pass sequence? Hand-labeled horizontal coordinate, t The red line doesn’t reveal different levels of uncertainty in predictions Cross validation reduced the training data, so the red line isn’t as accurate as it should be Choosing a particular M and w seems wrong – we should hedge our bets Hand-labeled horizontal coordinate, t Bayesian regression

Estimation theory Provided with a predictive distribution p(t|x), how do we estimate a single value for t? Example: In the pass sequence, Cupid must aim at and hit the man in the white shirt, without hitting the man in the striped shirt Define L(t,t*) as the loss incurred by estimating t* when the true value is t Assuming p(t|x) is correct, the expected loss is E[L] = t L(t,t*) p(t|x) dt The minimum loss estimate is found by minimizing E[L] w.r.t. t*

Squared loss E[L] = t ( t - t* )2 p(t|x) dt A common choice: L(t,t*) = ( t - t* )2 E[L] = t ( t - t* )2 p(t|x) dt Not appropriate for Cupid’s problem To minimize E[L] , set its derivative to zero: dE[L]/dt* = -2t ( t - t* ) p(t|x) dt = 0 -2t t p(t|x)dt + t* = 0 Minimum mean squared error (MMSE) estimate: t* = E[t|x] = t t p(t|x)dt For regression: t* = y(x,w)

Other loss functions Absolute loss Squared loss

Absolute loss e t1 t2 t3 t4 t5 t6 t* t7 t L = |t*-t1| + |t*-t2| + |t*-t3| + |t*-t4| + |t*-t5| + |t*-t6| + |t*-t7| Consider moving t* to the left by e L decreases by 6e and increases by e Changes in L are balanced when t* = t4 The median of t under p(t|x) minimizes absolute loss Important: The median is invariant to monotonic transformations of t Mean and median Median Mean

D-dimensional estimation Suppose t is D-dimensional, t = (t1,…,tD) Example: 2-dimensional tracking Approach 1: Minimum marginal loss estimation Find td* that minimizes t L(td,td*) p(td|x) dtd Approach 2: Minimum joint loss estimation Define joint loss L(t,t*) Find t* that minimizes t L(t,t*) p(t|x) dt

Questions?

How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the striped shirt based on labeled data Can we track the man in the white shirt? t = 290 Feature, x Hand-labeled horizontal coordinate, t Compute 1st moment: x = 224 Man in white shirt is occluded Horizontal location Fraction of pixels in column with intensity > 0.9 320

How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the striped shirt based on labeled data Can we track the man in the white shirt? Not very well. Regression fails to identify that there really are two classes of solution Hand-labeled horizontal coordinate, t Feature, x