Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher * Some slides are modified 1

Maximum-Likelihood And Bayesian Parameter Estimation (2) 2

Reading Assignment  R. O.Duda,P. E. Hart, and D.G.Stork, Pattern Classiﬁcation. John Wiley & Sons, 2nd ed., 2001.  Chapter 3, Sections 4,5 and 7 3

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 2) Bayesian Parameter Estimation: Gaussian Case Bayesian Parameter Estimation: General Estimation Problems of Dimensionality Computational Complexity

Pattern Classification, Chapter 1 5  Goal: Estimate p(x) by computing p(x|D)  This formula links the desired class conditional density p(x|D) to the posterior density p(  |D). 3 The Parameter Distribution

Pattern Classification, Chapter 1 6  Bayesian Parameter Estimation: Gaussian Case  The univariate case: (  is the only unknown parameter) (  0 and  0 are known!) Therefore, we need to find: 4

Pattern Classification, Chapter 1 7  Let D = {x 1, x 2, …, x n } 4

Pattern Classification, Chapter 1 8  Reproducing density 4

Pattern Classification, Chapter 1 9 4

10  The univariate case P(x | D)  P(  | D) computed  P(x | D) remains to be computed! Which is equal to Where 4

Pattern Classification, Chapter 1 11 It provides: (Desired class-conditional density P(x | D j,  j )) Therefore: P(x | D j,  j ) together with P(  j ) And using Bayes formula, we obtain the Bayesian classification rule: 4

Pattern Classification, Chapter 1 12 The Multivariate case: 4

Pattern Classification, Chapter 1 13  Bayesian Parameter Estimation: General Theory  P(x | D) computation can be applied to any situation in which the unknown density can be parameterized: the basic assumptions are:  The form of P(x |  ) is assumed known, but the value of  is not known exactly  Our knowledge about  is assumed to be contained in a known prior density P(  )  The rest of our knowledge  is contained in a set D of n random variables x 1, x 2, …, x n that follows P(x) 5

Pattern Classification, Chapter 1 14 The basic problem is: “Compute the posterior density P(  | D)” then “Derive P(x | D)” Using Bayes formula, we have: And by independence assumption: 5

Pattern Classification, Chapter 1 15  Bayesian Parameter Estimation: Incremental Learning  To indicate explicitly the number of samples in a set for a single category, we shall write D n = {x 1,..., x n }. Then, if n > 1  Using we get 5

Pattern Classification, Chapter 1 16  Problems of Dimensionality  Problems involving 50 or 100 features (binary valued)  Classification accuracy depends upon the dimensionality and the amount of training data  Case of two classes multivariate normal with the same covariance 7

Pattern Classification, Chapter 1 17  If features are independent then:  Most useful features are the ones for which the difference between the means is large relative to the standard deviation  It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model ! 7

Pattern Classification, Chapter 1 18 77 7

Pattern Classification, Chapter 1 19  Computational Complexity  Our design methodology is affected by the computational difficulty  “big oh” notation f(x) = O(h(x)) “big oh of h(x)” If: (An upper bound on f(x) grows no worse than h(x) for sufficiently large x!) f(x) = 2+3x+4x 2 g(x) = x 2 f(x) = O(x 2 ) 7

Pattern Classification, Chapter 1 20 “big oh” is not unique! f(x) = O(x 2 ); f(x) = O(x 3 ); f(x) = O(x 4 )  “big theta” notation f(x) =  (h(x)) If: f(x) =  (x 2 ) but f(x)   (x 3 ) 7

Pattern Classification, Chapter 1 21  Complexity of the ML Estimation  Gaussian priors in d dimensions classifier with n training samples for each of c classes  For each category, we have to compute the discriminant function Total = O(d 2.. n) Total for c classes = O(cd 2.n)  O(d 2.n)  Cost increase when d and n are large! 7

Pattern Classification, Chapter 1 22  Overfitting  Frequently, no of available samples usually inadequate, how to proceed?  Solutions:  Reduce dimensionality: redesign feature extractor, or use a subset of features  Assume all c classes share same covariance matrix: pool available data  Assume statistical independence: all off-diagonal elements of covariance matrix are zero 7

Pattern Classification, Chapter 1 23  Overfitting  Classifier performs better than if we overfit the data, why? The “training data” (black dots) were selected from a quadratic function plus Gaussian noise. The 10 th degree polynomial shown fits the data perfectly, but we desire instead the second-order function f(x), since it would lead to better predictions for new samples. 7

Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Similar presentations

Presentation on theme: "Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Similar presentations

Presentation on theme: "Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley."— Presentation transcript:

Similar presentations

About project

Feedback