0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Advertisements

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Visual Recognition Tutorial
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Chapter 5: Linear Discriminant Functions
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 2 (part 3) Bayesian Decision Theory Discriminant Functions for the Normal Density Bayes Decision Theory – Discrete Features All materials used.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:
Bayesian Decision Theory (Classification) 主講人:虞台文.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification Chapter 2(Part 3) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Lecture 2. Bayesian Decision Theory
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability theory retro
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

1 Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) Introduction Maximum-Likelihood Estimation Example of a Specific Case Gaussian Case: unknown  and  Bias Appendix: ML Problem Statement

2 Data availability in a Bayesian framework To design optimal classifier, need: P(  i ) (priors) P(x |  i ) (class-conditional densities) Unfortunately, rarely have this complete information! Design a classifier from a training sample Easy to estimate prior Samples are often too small to estimate class-conditional (large dimension of feature space!) 1 Introduction

3 Normality of P(x |  i ) P(x |  i ) ~ N(  i,  i ) Characterized by 2 parameters Estimation techniques Maximum-Likelihood (ML) and the Bayesian estimations Results are nearly identical, but the approaches are different 1 A priori information about the problem

4 Ml Estimation Parameters are fixed but unknown! Obtain best parameters by maximizing probability of obtaining the samples observed -- argmax theta { P( D | theta ) } Bayesian methods view parameters as random variables having some known distribution compute POSTERIOR distribution In either approach, classification rule == P(  i | x) 1 ML vs Bayesian Methods

5 Good convergence properties as the sample size increases Simpler than any other alternative techniques General principle Assume we have c classes and P(x |  j ) ~ N(  j,  j ) P(x |  j )  P (x |  j,  j ) where: 2 Maximum-Likelihood Estimation

6 Use training samples to estimate  = (  1,  2, …,  c ),  i is associated with category i (i = 1, 2, …, c) Suppose that D contains n samples, {x 1, x 2,…, x n } ML estimate of  is, by definition, the value that maximizes P(D |  ) “It is the value of  that best agrees with the actually observed training sample” 2 Details of ML Estimation

7 2

8  = (  1,  2, …,  p ) t   = gradient operator l(  ) = ln P(D |  ) is log-likelihood function New problem statement: determine  that maximizes log-likelihood 2 Optimal Estimation

9   l = 0 Not sufficient (local opt, …) Check 2 nd derivative 2 Necessary conditions for Optimum

10 P(x i |  ) ~ N( ,  ) (Samples drawn from multivariate normal population)  =  ML estimate for  must satisfy: 2 Specific case: unknown 

11 Multiply by , rearranging… Just arithmetic average of training sampls! Conclusion: If P(x k |  j ) (j = 1, 2, …, c) is d-dimensional Gaussian; then estimate  = (  1,  2, …,  c ) t to perform optimal classification! 2 Specific case: unknown  (con’t)

12 Gaussian Case: unknown  and   = (  1,  2 ) = ( ,  2 ) 2 ML Estimation (unknown  and  )

13 Combine (1) and (2) to obtain: 2 Results …

14 ML estimate for  2 is biased An elementary unbiased estimator for  is: 2 Bias

15 Let D = {x 1, x 2, …, x n } P(x 1,…, x n |  ) =  k=1 n P(x k |  ) |D| = n Goal: determine (value of  that makes this sample the most representative) 2 ML Problem Statement

16 |D| = n x1x1 x2x2 xnxn x 11 x 20 x 10 x8x8 x9x9 x1x1 N(  j,  j ) = P(x j |  1 ) D1D1 DcDc DkDk P(x j |  n ) P(x j |  k ) 2

17  = (  1,  2, …,  c ) Find such that: 2 Problem Statement

18 Bayesian Decision Theory Chapter 2 (Sections ) Minimum-Error-Rate Classification Classifiers, Discriminant Functions, Decision Surfaces The Normal Density

19 Minimum-Error-Rate Classification Actions are decisions on classes If take action  i and the true state of nature is  j then: decision is correct iff i = j (else in error) Seek decision rule that minimizes the probability of error (aka error rate )

20 Conditional risk: “The risk corresponding to this loss function is the average probability error”  Zero-one loss function

21 As R(  i | x) = 1 – P(  i | x) … to minimize risk, maximize P(  i | x) For Minimum error rate Decide  i if P (  i | x) > P(  j | x)  j  i Minimum Error Rate

22 As 0/1 loss, decide  1 if Decision Boundary If is the zero-one loss function which means:

23

24 Classifiers, Discriminant Functions and Decision Surfaces The multi-category case Set of discriminant functions g i (x), i = 1,…, c Classifier assigns feature vector x to class  i if: g i (x) > g j (x)  j  i

25

26 Let g i (x) = - R(  i | x) (max discriminant corresponds to min risk!) For minimum error rate, use g i (x) = P(  i | x) (max discrimination corresponds to max posterior!) g i (x)  P(x |  i ) P(  i ) Useg i (x) = ln P(x |  i ) + ln P(  i ) (ln: natural logarithm) Max Discriminant

27 Dividee feature space into c decision regions if g i (x) > g j (x)  j  i then x is in R i ( R i  assign x to  i ) Two-category case Classifier is “dichotomizer” iff it has two discriminant functions g 1 and g 2 Let g(x)  g 1 (x) – g 2 (x) Decide  1 if g(x) > 0 ; Otherwise decide  2 Decision Regions

28 Computing g(x)

29

30 Univariate Normal Density Continuous density, analytically tractable Many processes are asymptotically Gaussian Handwritten characters Speech sounds ideal or prototype corrupted by random process (central limit theorem) where:  = mean (or expected value) of x  2 = expected squared deviation or variance

31

32 Multivariate normal density in d dimensions is: where: x = (x 1, x 2, …, x d ) t (t stands for the transpose vector form)  = (  1,  2, …,  d ) t mean vector  = d*d covariance matrix |  | and  -1 are determinant and inverse respectively Multivariate Normal Density

33 Bayesian Decision Theory III Chapter 2 (Sections 2-6,2-9) Discriminant Functions for the Normal Density Bayes Decision Theory – Discrete Features

34 Discriminant Functions for the Normal Density Recall… minimum error-rate classification achieved by discriminant function g i (x) = ln P(x |  i ) + ln P(  i ) Multivariate normal

35 Special Case… Independent variables; Constant Variance  i =  2  I ( I  identity matrix) where … Linear Discriminant Function  i  is “threshold for i th category

36 Classifier using linear discriminant function is called “a linear machine” Decision surfaces for a linear machine are pieces of hyperplanes defined by: g i (x) = g j (x) Linear Machine

37

38 The hyperplane separating R i and R j always orthogonal to the line linking the means! Classification Region HERE!!!

39

40

41 Case  i =  (covariance of all classes are identical but arbitrary!) Hyperplane separating R i and R j (the hyperplane separating R i and R j is generally not orthogonal to the line between the means!)

42

43

44 Case  i = arbitrary The covariance matrices are different for each category (Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)

45

46

47 Bayes Decision Theory – Discrete Features Components of x are binary or integer valued, x can take only one of m discrete values v 1, v 2, …, v m Case of independent binary features in 2 category problem Let x = [x 1, x 2, …, x d ] t where each x i is either 0 or 1, with probabilities: p i = P(x i = 1 |  1 ) q i = P(x i = 1 |  2 )

48 The discriminant function in this case is: