CS479/679 Pattern Recognition Dr. George Bebis

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Pattern Recognition and Machine Learning
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Machine Learning CMPT 726 Simon Fraser University
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Linear Discriminant Functions Chapter 5 (Duda et al.)
Introduction to Bayesian Parameter Estimation
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
PATTERN RECOGNITION AND MACHINE LEARNING
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Model Inference and Averaging
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 32: HIERARCHICAL CLUSTERING Objectives: Unsupervised.
1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,
Machine Learning 5. Parametric Methods.
Univariate Gaussian Case (Cont.)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 4.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Univariate Gaussian Case (Cont.)
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Parameter Estimation 主講人:虞台文.
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Learning From Observed Data
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

CS479/679 Pattern Recognition Dr. George Bebis Parameter Estimation: Bayesian Estimation Chapter 3 (Duda et al.) – Sections 3.3-3.7 CS479/679 Pattern Recognition Dr. George Bebis

Bayesian Estimation Assumes that the parameters q are random variables that have some known a-priori distribution p(q). Estimates a distribution rather than making point estimates like ML. BE solution might not be of the parametric form assumed.

The Role of Training Examples in computing P(ωi /x) If p(x/ωi) and P(ωi) are known, the Bayes’ rule allows us to compute the posterior probabilities P(ωi /x): Emphasize the role of the training examples D by introducing them in the computation of the posterior probabilities:

The Role of Training Examples (cont’d) chain rule Using only the samples from class i marginalize chain rule

The Role of Training Examples (cont’d) The training examples Di can help us to determine both the class-conditional densities and the prior probabilities: For simplicity, replace P(ωi /D) with P(ωi):

Bayesian Estimation (BE) Need to estimate p(x/ωi,Di) for every class ωi If the samples in Dj give no information about qi, we need to solve c independent problems of the following form: “Given D, estimate p(x/D)”

BE Approach Estimate p(x/D) as follows: Since , we have: Important equation: it links p(x/D) with p(θ/D)

BE Main Steps (1) Compute p(θ/D) : (2) Compute p(x/D) : where

Interpretation of BE Solution Suppose p(θ/D) peaks sharply at , then p(x/D) can be approximated as follows: (i.e., the best estimate is obtained by setting ) (assuming that p(x/ θ) is smooth)

Interpretation of BE Solution (cont’d) If we are less certain about the exact value of θ, we should consider a weighted average of p(x / θ) over the possible values of θ. The samples exert their influence on p(x / D) through P(θ / D).

Relation to ML solution If p(D/ θ) peaks sharply at , then p(θ /D) will, in general, peak sharply at too (i.e., close to ML solution):

Case 1: Univariate Gaussian, Unknown μ D={x1,x2,…,xn} (independently drawn) (1)

Case 1: Univariate Gaussian, Unknown μ (cont’d) It can be shown that p(μ/D) has the form: x const. peaks at μn p(μ/D)

Case 1: Univariate Gaussian, Unknown μ (cont’d) (i.e., lies between them) as (ML estimate) (ML estimate) implies more samples!

Case 1: Univariate Gaussian, Unknown μ (cont’d)

Case 1: Univariate Gaussian, Unknown μ (cont’d) Bayesian Learning

Case 1: Univariate Gaussian, Unknown μ (cont’d) not dependent on μ (2) As the number of samples increases, p(x/D) converges to p(x/μ)

Case 2: Multivariate Gaussian, Unknown μ Assume p(x/μ)~N(μ,Σ) and p(μ)~N(μ0,Σ0) (known) D={x1,x2,…,xn} (independently drawn) (1) Compute p(μ/D):

Case 2: Multivariate Gaussian, Unknown μ (cont’d) Substituting the expressions for p(xk/μ) and p(μ): where

Case 2: Multivariate Gaussian (cont’d) Compute p(x/D): (2)

Recursive Bayes Learning Develop an incremental learning algorithm: Dn: (x1, x2, …., xn-1, xn) Rewrite as follows: Dn-1

Recursive Bayes Learning (cont’d) Then, can be written as follows: n=1,2,…

Recursive Bayes Learning -Example

Recursive Bayesian Learning (cont’d) (x4=8) In general:

Recursive Bayesian Learning (cont’d) p(θ/D4) peaks at p(θ/D0) Iterations ML estimate: Bayesian estimate:

Multiple Peaks For most p(x/θ) choices, p(θ/Dn) will have peak strongly at given enough samples; in this case: There might be cases, however, where cannot be determined uniquely from p(x/θ); in this case, p(θ/Dn) will contain multiple peaks The solution p(x/θ) should be obtained by integration in this case:

ML vs Bayesian Estimation Number of training data The two methods are equivalent assuming infinite number of training data (and prior distributions that do not exclude the true solution). For small training data sets, they give different results in most cases. Computational complexity ML uses differential calculus or gradient search for maximizing the likelihood. Bayesian estimation requires complex multidimensional integration techniques.

ML vs Bayesian Estimation (cont’d) Solution complexity Easier to interpret ML solutions (i.e., must be of the assumed parametric form). A Bayesian estimation solution might not be of the parametric form assumed. Prior distribution If the prior distribution p(θ) is uniform, Bayesian estimation solutions are equivalent to ML solutions. In general, the two methods will give different solutions.

ML vs Bayesian Estimation (cont’d) General comments There are strong theoretical and methodological arguments supporting Bayesian estimation. In practice, ML estimation is simpler and can lead to comparable performance.

Computational Complexity ML estimation dimensionality: d # training data: n # classes: c Learning complexity O(dn) O(d2n) d(d+1)/2 O(dn) O(d2) O(n) O(d3) O(1) These computations must be repeated c times! (n>d)

Computational Complexity dimensionality: d # training data: n # classes: c Classification complexity O(d2) O(1) These computations must be repeated c times and take max Bayesian Estimation: higher learning complexity, same classification complexity

Main Sources of Error in Classifier Design Bayes error The error due to overlapping densities p(x/ωi) Model error The error due to choosing an incorrect model. Estimation error The error due to incorrectly estimated parameters (e.g., due to small number of training examples)

Overfitting When the number of training examples is inadequate, the solution obtained might not be optimal. Consider the problem of fitting a curve to some data: Points were selected from a parabola (plus noise). A 10th degree polynomial fits the data perfectly but does not generalize well. A greater error on training data might improve generalization! Need more training examples than number or model parameters!

Overfitting (cont’d) Control model complexity. Shrinkage technique: Assume diagonal covariance matrix (i.e., uncorrelated features). Use the same covariance matrix for all classes and consolidate data. Shrinkage technique: Shrink individual covariance matrices to same covariance: Shrink common covariance matrix to identity matrix: