METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Advertisements

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Principles of Pattern Recognition
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 13 Oct 14, 2005 Nanjing University of Science & Technology.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 E. Fatemizadeh Statistical Pattern Recognition.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Optimal Bayes Classification
1 Bayesian Decision Theory Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 6: Nearest and k-nearest Neighbor Classification.
Covariance matrices for all of the classes are identical, But covariance matrices are arbitrary.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
ECE 471/571 – Lecture 3 Discriminant Function and Normal Density 08/27/15.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Classifier Team teaching.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Lecture 2. Bayesian Decision Theory
Lecture 1.31 Criteria for optimal reception of radio signals.
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 04: DECISION SURFACES
LECTURE 03: DECISION SURFACES
CH 5: Multivariate Methods
LECTURE 05: THRESHOLD DECODING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
REMOTE SENSING Multispectral Image Classification
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Image Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
EE513 Audio Signals and Systems
Generally Discriminant Analysis
Bayesian Classification
Mathematical Foundations of BME
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
LECTURE 05: THRESHOLD DECODING
Multivariate Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Hairong Qi, Gonzalez Family Professor
Presentation transcript:

METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification with Bayes Rule

Statistical Approach to P.R Dimension of the feature space: Set of different states of nature: Categories: find set of possible actions (decisions): Here, a decision might include a ‘reject option’ A Discriminant Function in region ; decision rule : if

A Pattern Classifier So our aim now will be to define these functions to minimize or optimize a criterion.

Parametric Approach to Classification 'Bayes DecisionTheory' is used for minimum-error/minimum risk pattern classifier design. Here, it is assumed that if a sample is drawn from a class it is a random variable represented with a multivariate probability density function. ‘Class- conditional density function’

(c is no. of classes) We also know a-priori probability Then, we can talk about a decision rule that minimizes the probability of error. Suppose we have the observation This observation is going to change a-priori assumption to a- posteriori probability: which can be found by the Bayes Rule.

can be found by Total Probability Rule: When ‘s are disjoint, Decision Rule: Choose the category with highest a-posteriori probability, calculated as above, using Bayes Rule.

then, 1 Decision boundary: or in general, decision boundaries are where: between regions and

Single feature – decision boundary – point 2 features – curve 3 features – surface More than 3 – hypersurface Sometimes, it is easier to work with logarithms Since logarithmic function is a monotonically increasing function, log fn will give the same result.

2 Category Case: Assign to if if But this is the same as: By throwing away ‘s, we end up with: Which the same as: Likelihood ratio

Example: a single feature, 2 category problem with gaussian density : Diagnosis of diabetes using sugar count X state of being healthy state of being sick (diabetes) The decision rule: if or

Assume now: And we measured: Assign the unknown sample: to the correct category. Find likelihood ratio: for Compare with: So assign: to .

Example: A discrete problem Consider a 2-feature, 3 category case where: And , , Find the decision boundaries and regions: Solution:

Remember now that for the 2-class case: if or Likelihood ratio Error probabilities and a simple proof of minimum error Consider again a 2-class 1-d problem: Let’s show that: if the decision boundary is (intersection point) rather than any arbitrary point .

Then (probability of error) is minimum. It can very easily be seen that the is minimum if .

Minimum Risk Classification Risk associated with incorrect decision might be more important than the probability of error. So our decision criterion might be modified to minimize the average risk in making an incorrect decision. We define a conditional risk (expected loss) for decision when occurs as: Where is defined as the conditional loss associated with decision when the true class is . It is assumed that is known. The decision rule: decide on if for all The discriminant function here can be defined as: 4

We can show that minimum – error decision is a special case of above rule where: then, so the rule is if

For the 2 – category case, minimum – risk classifier becomes: Otherwise, . This is the same as likelihood rule if and

Discriminant Functions so far For Minimum Error: For Minimum Risk: Where

Bayes (Maximum Likelihood)Decision: Most general optimal solution Provides an upper limit(you cannot do better with other rule) Useful in comparing with other classifiers

Special Cases of Discriminant Functions Multivariate Gaussian (Normal) Density : The general density form: Here in the feature vector of size . M : d element mean vector : covariance matrix (variance of feature ) - symmetric when and are statistically independent.

General shape: Hyper ellipsoids where - determinant of General shape: Hyper ellipsoids where is constant: Mahalanobis Distance 2 – d problem: , If , (statistically independent features) then, major axes are parallel to major ellipsoid axes

if in addition circular in general, the equal density curves are hyper ellipsoids. Now is used for since its ease in manipulation is a quadratic function of as will be shown.

a scalar Then, On the decision boundary,

Decision boundary function is hyperquadratic in general. Example in 2d. Then, above boundary becomes

General form of hyper quadratic boundary IN 2-d. The special cases of Gaussian: Assume Where is the unit matrix

euclidian distance between X and Mi (not a function of X so can be removed) Now assume euclidian distance between X and Mi Then,the decision boundary is linear !

Decision Rule: Assign the unknown sample to the closest mean’s category unknown sample d= Perpendicular bisector that will move towards the less probable category

Minimum Distance Classifier Classify an unknown sample X to the category with closest mean ! Optimum when gaussian densities with equal variance and equal a-priori probability. Piecewise linear boundary in case of more than 2 categories.

Samples fall in clusters of equal size and shape Another special case: It can be shown that when (Covariance matrices are the same) Samples fall in clusters of equal size and shape unknown sample is called Mahalonobis Distance

Then, if The decision rule: if (Mahalanobis Distance of unknown sample to ) > (Mahalanobis Distance of unknown sample to ) If The boundary moves toward the less probable one.

Binary Random Variables Discrete features: Features can take only discrete values. Integrals are replaced by summations. Binary Features: 0 or 1 Assume binary features are statistically independent. Where is binary

Binary Random Variables Example: Bit – matrix for machine – printed characters a pixel Here, each pixel may be taken as a feature For above problem, we have is the probability that for letter A,B,… B A

defined for undefined elsewhere: If statistical independence of features is assumed. Consider the 2 category problem; assume:

The decision boundary is linear in X. a weighted sum of the inputs then, the decision boundary is: So if The decision boundary is linear in X. a weighted sum of the inputs where: and