Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher
Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) l Introduction l Maximum-Likelihood Estimation l Example of a Specific Case l The Gaussian Case: unknown and l Bias l Appendix: ML Problem Statement
Pattern Classification, Chapter 3 l Introduction l Data availability in a Bayesian framework l We could design an optimal classifier if we knew: l P( i ) (priors) l P(x | i ) (class-conditional densities) Unfortunately, we rarely have this complete information! l Design a classifier from a training sample l No problem with prior estimation l Samples are often too small for class-conditional estimation (large dimension of feature space!)
Pattern Classification, Chapter 3 l A priori information about the problem l Normality of P(x | i ) P(x | i ) ~ N( i, i ) l Characterized by 2 parameters l Estimation techniques l Maximum-Likelihood (ML) and the Bayesian estimations l Results are nearly identical, but the approaches are different
Pattern Classification, Chapter 3 l Parameters in ML estimation are fixed but unknown! l Best parameters are obtained by maximizing the probability of obtaining the samples observed l Bayesian methods view the parameters as random variables having some known distribution l In either approach, we use P( i | x) for our classification rule!
Pattern Classification, Chapter 3 l Maximum-Likelihood Estimation l Has good convergence properties as the sample size increases l Simpler than any other alternative techniques l General principle l Assume we have c classes and P(x | j ) ~ N( j, j ) P(x | j ) P (x | j, j ) where:
Pattern Classification, Chapter 3 l Use the information provided by the training samples to estimate = ( 1, 2, …, c ), each i (i = 1, 2, …, c) is associated with each category l Suppose that D contains n samples, x 1, x 2,…, x n l ML estimate of is, by definition the value that maximizes P(D | ) “It is the value of that best agrees with the actually observed training sample”
Pattern Classification, Chapter 3
l Optimal estimation l Let = ( 1, 2, …, p ) t and let be the gradient operator l We define l( ) as the log-likelihood function l( ) = ln P(D | ) l New problem statement: determine that maximizes the log-likelihood
Pattern Classification, Chapter 3 Set of necessary conditions for an optimum is: l = 0
Pattern Classification, Chapter 3 l Example of a specific case: unknown l P(x i | ) ~ N( , ) (Samples are drawn from a multivariate normal population) = therefore: The ML estimate for must satisfy:
Pattern Classification, Chapter 3 Multiplying by and rearranging, we obtain: Just the arithmetic average of the samples of the training samples! Conclusion: If P(x k | j ) (j = 1, 2, …, c) is supposed to be Gaussian in a d- dimensional feature space; then we can estimate the vector = ( 1, 2, …, c ) t and perform an optimal classification!
Pattern Classification, Chapter 3 l ML Estimation: l Gaussian Case: unknown and = ( 1, 2 ) = ( , 2 )
Pattern Classification, Chapter 3 Summation: Combining (1) and (2), one obtains:
Pattern Classification, Chapter 3 l Bias l ML estimate for 2 is biased l An elementary unbiased estimator for is:
Pattern Classification, Chapter 3 l Appendix: ML Problem Statement l Let D = {x 1, x 2, …, x n } P(x 1,…, x n | ) = 1,n P(x k | ); |D| = n Our goal is to determine (value of that makes this sample the most representative!)
Pattern Classification, Chapter 3 |D| = n x1x1 x2x2 xnxn x 11 x 20 x 10 x8x8 x9x9 x1x1 N( j, j ) = P(x j, 1 ) D1D1 DcDc DkDk P(x j | 1 ) P(x j | k )
Pattern Classification, Chapter 3 = ( 1, 2, …, c ) Problem: find such that: