Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections 2.1-2.2) Introduction Bayesian Decision Theory–Continuous Features.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Bayesian Decision Theory
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Chapter 4: Linear Models for Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Chapter 5: Linear Discriminant Functions
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.
Machine Learning CMPT 726 Simon Fraser University
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 2 (part 3) Bayesian Decision Theory Discriminant Functions for the Normal Density Bayes Decision Theory – Discrete Features All materials used.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Principles of Pattern Recognition
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 13 Oct 14, 2005 Nanjing University of Science & Technology.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
1 E. Fatemizadeh Statistical Pattern Recognition.
Optimal Bayes Classification
Bayesian Decision Theory (Classification) 主講人:虞台文.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
1 Bayesian Decision Theory Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Covariance matrices for all of the classes are identical, But covariance matrices are arbitrary.
Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Basic Technical Concepts in Machine Learning Introduction Supervised learning Problems in supervised learning Bayesian decision theory.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification Chapter 2(Part 3) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Lecture 2. Bayesian Decision Theory
Chapter 3: Maximum-Likelihood Parameter Estimation
Special Topics In Scientific Computing
Probability theory retro
LECTURE 03: DECISION SURFACES
CH 5: Multivariate Methods
Pattern Recognition PhD Course.
LECTURE 05: THRESHOLD DECODING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
Bayesian Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 05: THRESHOLD DECODING
LECTURE 11: Exam No. 1 Review
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

Pattern Classification

Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features

Pattern Classification, Chapter 2 (Part 1) 2 Introduction The sea bass/salmon example State of nature, prior State of nature is a random variable The catch of salmon and sea bass is equiprobable P(  1 ) = P(  2 ) (uniform priors) P(  1 ) + P(  2 ) = 1 (exclusivity and exhaustivity)

Pattern Classification, Chapter 2 (Part 1) 3 Decision rule with only the prior information Decide  1 if P(  1 ) > P(  2 ) otherwise decide  2 Use of the class–conditional information P(x |  1 ) and P(x |  2 ) describe the difference in lightness between populations of sea-bass and salmon

Pattern Classification, Chapter 2 (Part 1) 4

5 Posterior, likelihood, evidence P(  j | x) = (P(x |  j ) * P (  j )) / P(x) (BAYES RULE) Posterior = (Likelihood * Prior) / Evidence Where in case of two categories

Pattern Classification, Chapter 2 (Part 1) 6

7 Intuitive decision rule given the posterior probabilities: Given x: if P(  1 | x) > P(  2 | x) True state of nature =  1 if P(  1 | x) < P(  2 | x) True state of nature =  2 Why do this?: Whenever we observe a particular x, the probability of error is : P(error | x) = P(  1 | x) if we decide  2 P(error | x) = P(  2 | x) if we decide  1

Pattern Classification, Chapter 2 (Part 1) 8 Minimizing the probability of error Decide  1 if P(  1 | x) > P(  2 | x); otherwise decide  2 Therefore: P(error | x) = min [P(  1 | x), P(  2 | x)] (Bayes decision)

Pattern Classification, Chapter 2 (Part 1) 9 Bayesian Decision Theory – Continuous Features Generalization of the preceding ideas Use of more than one feature Use more than two states of nature Allowing actions and not only decide on the state of nature Introduce a loss of function which is more general than the probability of error Allowing actions other than classification primarily allows the possibility of rejection Refusing to make a decision in close or bad cases! Letting loss function state how costly each action taken is

Pattern Classification, Chapter 2 (Part 1) 10 Let {  1,  2,…,  c } be the set of c states of nature (or “classes”) Let {  1,  2,…,  a } be the set of possible actions Let (  i |  j ) be the loss for action  i when the state of nature is  j Bayesian Decision Theory – Continuous Features

Pattern Classification, Chapter 2 (Part 1) 11 What is the Expected Loss for action  i ? Conditional risk = Expected Loss For any given x the expected loss is

Pattern Classification, Chapter 2 (Part 1) 12 Overall risk R = Sum of all R(  i | x) for i = 1,…,a Minimizing R Minimizing R(  i | x) for i = 1,…, a for i = 1,…,a Conditional risk

Pattern Classification, Chapter 2 (Part 1) 13 Select the action  i for which R(  i | x) is minimum R is minimum and R in this case is called the Bayes risk = best performance that can be achieved!

Pattern Classification, Chapter 2 (Part 1) 14 Two-Category Classification  1 : deciding  1  2 : deciding  2 ij = (  i |  j ) loss incurred for deciding  i when the true state of nature is  j Conditional risk: R(  1 | x) =  11 P(  1 | x) + 12 P(  2 | x) R(  2 | x) =  21 P(  1 | x) + 22 P(  2 | x)

Pattern Classification, Chapter 2 (Part 1) 15 Our rule is the following: if R(  1 | x) < R(  2 | x) action  1 : “decide  1 ” is taken This results in the equivalent rule : decide  1 if: ( ) P(x |  1 ) P(  1 ) > ( ) P(x |  2 ) P(  2 ) and decide  2 otherwise

Pattern Classification, Chapter 2 (Part 1) 16 x ( ) x ( )

Pattern Classification, Chapter 2 (Part 1) 17 Two-Category Decision Theory: Chopping Machine  1 = chop  2 = DO NOT chop  1 = NO hand in machine  2 = hand in machine 11 = (  1 |  1 ) = $ = (  1 |  2 ) = $ = (  2 |  1 ) = $ = (  1 |  1 ) = $ 0.01 Therefore our rule becomes ( ) P(x |  1 ) P(  1 ) > ( ) P(x |  2 ) P(  2 ) 0.01 P(x |  1 ) P(  1 ) > P(x |  2 ) P(  2 )

Pattern Classification, Chapter 2 (Part 1) 18 Our rule is the following: if R(  1 | x) < R(  2 | x) action  1 : “decide  1 ” is taken This results in the equivalent rule : decide  1 if: ( ) P(x |  1 ) P(  1 ) > ( ) P(x |  2 ) P(  2 ) and decide  2 otherwise

Pattern Classification, Chapter 2 (Part 1) 19 x 0.01 x  1 = chop  2 = DO NOT chop  1 = NO hand in machine  2 = hand in machine

Pattern Classification, Chapter 2 (Part 1) 20 Exercise Select the optimal decision where: W = {  1,  2 } P(x |  1 ) N(2, 0.5) (Normal distribution) P(x |  2 ) N(1.5, 0.2) P(  1 ) = 2/3 P(  2 ) = 1/3

Chapter 2 (Part 2): Bayesian Decision Theory (Sections ) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density

Pattern Classification, Chapter 2 (Part 1) 22 Minimum-Error-Rate Classification Actions are decisions on classes If action  i is taken and the true state of nature is  j then the decision is correct if i = j and in error if i  j Seek a decision rule that minimizes the probability of error which is the error rate

Pattern Classification, Chapter 2 (Part 1) 23 Introduction of the zero-one loss function: Therefore, the conditional risk for each action is: “The risk corresponding to this loss function is the average (or expected) probability error”  Average Prob. of Error

Pattern Classification, Chapter 2 (Part 1) 24 Minimize the risk requires maximize P(  i | x) (since R(  i | x) = 1 – P(  i | x)) For Minimum error rate Decide  i if P (  i | x) > P(  j | x)  j  i

Pattern Classification, Chapter 2 (Part 1) 25 Two-Category Classification  1 : deciding  1  2 : deciding  2 ij = (  i |  j ) loss incurred for deciding  i when the true state of nature is  j Conditional risk: R(  1 | x) =  11 P(  1 | x) + 12 P(  2 | x) R(  2 | x) =  21 P(  1 | x) + 22 P(  2 | x)

Pattern Classification, Chapter 2 (Part 1) 26 Our rule is the following: if R(  1 | x) < R(  2 | x) action  1 : “decide  1 ” is taken This results in the equivalent rule : decide  1 if: ( ) P(x |  1 ) P(  1 ) > ( ) P(x |  2 ) P(  2 ) and decide  2 otherwise

Pattern Classification, Chapter 2 (Part 1) 27 Likelihood ratio: The preceding rule is equivalent to the following rule: Then take action  1 (decide  1 ) Otherwise take action  2 (decide  2 )

Pattern Classification, Chapter 2 (Part 1) 28 Regions of decision and zero-one loss function, therefore: If is the zero-one loss function which means:

Pattern Classification, Chapter 2 (Part 1) 29

Pattern Classification, Chapter 2 (Part 1) 30 Classifiers, Discriminant Functions and Decision Surfaces The multi-category case Set of discriminant functions g i (x), i = 1,…, c The classifier assigns a feature vector x to class  i if: g i (x) > g j (x)  j  i

Pattern Classification, Chapter 2 (Part 1) 31

Pattern Classification, Chapter 2 (Part 1) 32 Let g i (x) = - R(  i | x) (max. discriminant corresponds to min. risk!) For the minimum error rate, we take g i (x) = P(  i | x) (max. discrimination corresponds to max. posterior!) g i (x)  P(x |  i ) P(  i ) g i (x) = ln P(x |  i ) + ln P(  i ) (ln: natural logarithm!)

Pattern Classification, Chapter 2 (Part 1) 33 Feature space divided into c decision regions if g i (x) > g j (x)  j  i then x is in R i ( R i means assign x to  i ) The two-category case A classifier is a “dichotomizer” that has two discriminant functions g 1 and g 2 Let g(x)  g 1 (x) – g 2 (x) Decide  1 if g(x) > 0 ; Otherwise decide  2

Pattern Classification, Chapter 2 (Part 1) 34 The computation of g(x)

Pattern Classification, Chapter 2 (Part 1) 35 On to higher dimensions!

Pattern Classification, Chapter 2 (Part 1) 36 The Normal Density Univariate density Density which is analytically tractable Continuous density A lot of processes are asymptotically Gaussian Handwritten characters, speech sounds are ideal or prototype corrupted by random process (central limit theorem) Where:  = mean (or expected value) of x  2 = expected squared deviation or variance

Pattern Classification, Chapter 2 (Part 1) 37

Pattern Classification, Chapter 2 (Part 1) 38 Multivariate density Multivariate normal density in d dimensions is: where: x = (x 1, x 2, …, x d ) t (t stands for the transpose vector form)  = (  1,  2, …,  d ) t mean vector  = d*d covariance matrix |  | and  -1 are determinant and inverse respectively

Chapter 2 (part 3) Bayesian Decision Theory (Sections 2-6,2-9) Discriminant Functions for the Normal Density Bayes Decision Theory – Discrete Features

Pattern Classification, Chapter 2 (Part 1) 40 Discriminant Functions for the Normal Density We saw that the minimum error-rate classification can be achieved by the discriminant function g i (x) = ln P(x |  i ) + ln P(  i ) Case of multivariate normal

Pattern Classification, Chapter 2 (Part 1) 41 Case  i =  2 I ( I stands for the identity matrix)

Pattern Classification, Chapter 2 (Part 1) 42 A classifier that uses linear discriminant functions is called “a linear machine” The decision surfaces for a linear machine are pieces of hyperplanes defined by: g i (x) = g j (x)

Pattern Classification, Chapter 2 (Part 1) 43

Pattern Classification, Chapter 2 (Part 1) 44 The hyperplane separating R i and R j are given by always orthogonal to the line linking the means!

Pattern Classification, Chapter 2 (Part 1) 45

Pattern Classification, Chapter 2 (Part 1) 46

Pattern Classification, Chapter 2 (Part 1) 47 Case  i =  (covariance of all classes are identical but arbitrary!) Hyperplane separating R i and R j Here the hyperplane separating R i and R j is generally not orthogonal to the line between the means!

Pattern Classification, Chapter 2 (Part 1) 48

Pattern Classification, Chapter 2 (Part 1) 49

Pattern Classification, Chapter 2 (Part 1) 50 Case  i = arbitrary The covariance matrices are different for each category Here the separating surfaces are Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)

Pattern Classification, Chapter 2 (Part 1) 51

Pattern Classification, Chapter 2 (Part 1) 52

Pattern Classification, Chapter 2 (Part 1) 53 Bayes Decision Theory – Discrete Features Components of x are binary or integer valued, x can take only one of m discrete values v 1, v 2, …, v m Case of independent binary features in 2 category problem Let x = [x 1, x 2, …, x d ] t where each x i is either 0 or 1, with probabilities: p i = P(x i = 1 |  1 ) q i = P(x i = 1 |  2 )

Pattern Classification, Chapter 2 (Part 1) 54 The discriminant function in this case is: