ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Visual Recognition Tutorial
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Maximum likelihood (ML) and likelihood ratio (LR) test
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Visual Recognition Tutorial
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Introduction to Bayesian Parameter Estimation
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
PATTERN RECOGNITION AND MACHINE LEARNING
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,
Machine Learning 5. Parametric Methods.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Univariate Gaussian Case (Cont.)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Review of statistical modeling and probability theory Alan Moses ML4bio.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Univariate Gaussian Case (Cont.)
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch3: Model Building through Regression
Parameter Estimation 主講人:虞台文.
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Learning From Observed Data
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Applied Statistics and Probability for Engineers
Presentation transcript:

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources: D.H.S.: Chapter 3 (Part 2) Wiki: Maximum Likelihood M.Y.: Maximum Likelihood Tutorial J.O.S.: Bayesian Parameter Estimation J.H.: Euro Coin Audio: URL:

ECE 8443: Lecture 06, Slide 1 Consider the case where only the mean,  = , is unknown: which implies: because: Gaussian Case: Unknown Mean (Review)

ECE 8443: Lecture 06, Slide 2 Rearranging terms: Significance??? Substituting into the expression for the total likelihood: Gaussian Case: Unknown Mean (Review)

ECE 8443: Lecture 06, Slide 3 Let  = [ ,  2 ]. The log likelihood of a SINGLE point is: The full likelihood leads to: Gaussian Case: Unknown Mean and Variance (Review)

ECE 8443: Lecture 06, Slide 4 This leads to these equations: In the multivariate case: The true covariance is the expected value of the matrix, which is a familiar result. Gaussian Case: Unknown Mean and Variance (Review)

ECE 8443: Lecture 06, Slide 5 Does the maximum likelihood estimate of the variance converge to the true value of the variance? Let’s start with a few simple results we will need later. Expected value of the ML estimate of the mean: Convergence of the Mean (Review)

ECE 8443: Lecture 06, Slide 6 The expected value of x i x j will be  2 for j  k since the two random variables are independent. The expected value of x i 2 will be  2 +  2. Hence, in the summation above, we have n 2 -n terms with expected value  2 and n terms with expected value  2 +  2. Thus, We see that the variance of the estimate goes to zero as n goes to infinity, and our estimate converges to the true estimate (error goes to zero). which implies: Variance of the ML Estimate of the Mean (Review)

ECE 8443: Lecture 06, Slide 7 Note that this implies: Now we can combine these results. Recall our expression for the ML estimate of the variance: We will need one more result: Variance Relationships

ECE 8443: Lecture 06, Slide 8 Expand the covariance and simplify: One more intermediate term to derive: Covariance Expansion

ECE 8443: Lecture 06, Slide 9 Substitute our previously derived expression for the second term: Biased Variance Estimate

ECE 8443: Lecture 06, Slide 10 An unbiased estimator is: These are related by: which is asymptotically unbiased. See Burl, AJWills and AWM for excellent examples and explanations of the details of this derivation.BurlAJWillsAWM Therefore, the ML estimate is biased: However, the ML estimate converges (and is MSE). Expectation Simplification

ECE 8443: Lecture 06, Slide 11 In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(  i ), and class-conditional densities, p(x|  i ). Bayes: treat the parameters as random variables having some known prior distribution. Observations of samples converts this to a posterior. Bayesian learning: sharpen the a posteriori density causing it to peak near the true value. Supervised vs. unsupervised: do we know the class assignments of the training data. Bayesian estimation and ML estimation produce very similar results in many cases. Reduces statistical inference (prior knowledge or beliefs about the world) to probabilities. Introduction to Bayesian Parameter Estimation

ECE 8443: Lecture 06, Slide 12 Posterior probabilities, P(  i |x), are central to Bayesian classification. Bayes formula allows us to compute P(  i |x) from the priors, P(  i ), and the likelihood, p(x|  i ). But what If the priors and class-conditional densities are unknown? The answer is that we can compute the posterior, P(  i |x), using all of the information at our disposal (e.g., training data). For a training set, D, Bayes formula becomes: We assume priors are known: P(  i |D) = P(  i ). Also, assume functional independence: D i have no influence on This gives: Class-Conditional Densities

ECE 8443: Lecture 06, Slide 13 Assume the parametric form of the evidence, p(x), is known: p(x|  ). Any information we have about  prior to collecting samples is contained in a known prior density p(  ). Observation of samples converts this to a posterior, p(  |D), which we hope is peaked around the true value of . Our goal is to estimate a parameter vector: We can write the joint distribution as a product: because the samples are drawn independently. The Parameter Distribution This equation links the class-conditional density to the posterior,. But numerical solutions are typically required!

ECE 8443: Lecture 06, Slide 14 Case: only mean unknown Known prior density: Using Bayes formula: Rationale: Once a value of  is known, the density for x is completely known.  is a normalization factor that depends on the data, D. Univariate Gaussian Case

ECE 8443: Lecture 06, Slide 15 Applying our Gaussian assumptions: Univariate Gaussian Case

ECE 8443: Lecture 06, Slide 16 Now we need to work this into a simpler form: Univariate Gaussian Case (Cont.)

ECE 8443: Lecture 06, Slide 17 Univariate Gaussian Case (Cont.) p(  |D) is an exponential of a quadratic function, which makes it a normal distribution. Because this is true for any n, it is referred to as a reproducing density. p(  ) is referred to as a conjugate prior. Write p(  |D) ~ N(  n,  n 2 ): Expand the quadratic term: Equate coefficients of our two functions:

ECE 8443: Lecture 06, Slide 18 Univariate Gaussian Case (Cont.) Rearrange terms so that the dependencies on  are clear: Associate terms related to  2 and  : There is actually a third equation involving terms not related to  : but we can ignore this since it is not a function of  and is a complicated equation to solve.

ECE 8443: Lecture 06, Slide 19 Two equations and two unknowns. Solve for  n and  n 2. First, solve for  n 2 : Univariate Gaussian Case (Cont.) Next, solve for  n : Summarizing:

ECE 8443: Lecture 06, Slide 20  n represents our best guess after n samples.  n 2 represents our uncertainty about this guess.  n 2 approaches  2 /n for large n – each additional observation decreases our uncertainty. The posterior, p(  |D), becomes more sharply peaked as n grows large. This is known as Bayesian learning. Bayesian Learning

ECE 8443: Lecture 06, Slide 21 Getting ahead a bit, let’s see how we can put these ideas to work on a simple example due to David MacKay, and explained by Jon Hamaker.David MacKayJon Hamaker “The Euro Coin”

ECE 8443: Lecture 06, Slide 22 Summary Review of maximum likelihood parameter estimation in the Gaussian case, with an emphasis on convergence and bias of the estimates. Introduction of Bayesian parameter estimation. The role of the class-conditional distribution in a Bayesian estimate. Estimation of the posterior and probability density function assuming the only unknown parameter is the mean, and the conditional density of the “features” given the mean, p(x|  ), can be modeled as a Gaussian distribution.