Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.

Slides:



Advertisements
Similar presentations
Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Lecture 3 Nonparametric density estimation and classification
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Intro. ANN & Fuzzy Systems Lecture 8. Learning (V): Perceptron Learning.
Visual Recognition Tutorial
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning CMPT 726 Simon Fraser University
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Bayesian Networks. Male brain wiring Female brain wiring.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
1 E. Fatemizadeh Statistical Pattern Recognition.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Optimal Bayes Classification
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Intro. ANN & Fuzzy Systems Lecture 23 Clustering (4)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Intro. ANN & Fuzzy Systems Lecture 38 Mixture of Experts Neural Network.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Intro. ANN & Fuzzy Systems Lecture 20 Clustering (1)
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Lecture 1.31 Criteria for optimal reception of radio signals.
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Lecture 15. Pattern Classification (I): Statistical Formulation
Special Topics In Scientific Computing
LECTURE 03: DECISION SURFACES
Parameter Estimation 主講人:虞台文.
Data Mining Lecture 11.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Pattern Recognition PhD Course.
Lecture 26: Faces and probabilities
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
EE513 Audio Signals and Systems
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 23: INFORMATION THEORY REVIEW
Lecture 8. Learning (V): Perceptron Learning
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation

Intro. ANN & Fuzzy Systems (C) 2001 by Yu Hen Hu 2 Outline Statistical Pattern Recognition –Maximum Posterior Probability (MAP) Classifier –Maximum Likelihood (ML) Classifier –K-Nearest Neighbor Classifier –MLP classifier

Intro. ANN & Fuzzy Systems (C) 2001 by Yu Hen Hu 3 An Example Consider classify eggs into 3 categories with labels: medium, large, or jumbo. The classification will be based on the weight and length of each egg. Decision rules: 1.If W < 10 g & L < 3cm, then the egg is medium 2.If W > 20g & L > 5 cm then the egg is jumbo 3.Otherwise, the egg is large Three components in a pattern classifier: –Category (target) label –Features –Decision rule W L

Intro. ANN & Fuzzy Systems (C) 2001 by Yu Hen Hu 4 Statistical Pattern Classification The objective of statistical pattern classification is to draw an optimal decision rule given a set of training samples. The decision rule is optimal because it is designed to minimize a cost function, called the expected risk in making classification decision. This is a learning problem! Assumptions 1.Features are given. Feature selection problem needs to be solved separately. Training samples are randomly chosen from a population 2.Target labels are given Assume each sample is assigned to a specific, unique label by the nature. Assume the label of training samples are known.

Intro. ANN & Fuzzy Systems (C) 2001 by Yu Hen Hu 5 Pattern Classification Problem Let X be the feature space, and C = {c(i), 1  i  M} be M class labels. For each x  X, it is assumed that the nature assigned a class label t(x)  C according to some probabilistic rule. Randomly draw a feature vector x from X, P(c(i)) = P(x  c(i)) is the a priori probability that t(x) = c(i) without referring to x. P(c(i)|x) = P(x  c(i)|x) is the posteriori probability that t(x) = c(i) given the value of x P(x|c(i)) = P(x |x  c(i)) is the conditional probability (a.k.a. likelihood function) that x will assume its value given that it is drawn from class c(i). P(x) is the marginal probability that x will assume its value without referring to which class it belongs to. Use Bayes’ Rule, we have P(x|c(i))P(c(i)) = P(c(i)|x)P(x) Also,

Intro. ANN & Fuzzy Systems (C) 2001 by Yu Hen Hu 6 Decision Function and Prob. Mis-Classification Given a sample x  X, the objective of statistical pattern classification is to design a decision rule g(x)  C to assign a label to x. If g(x) = t(x), the naturally assigned class label, then it is a correct classification. Otherwise, it is a mis- classification. Define a 0-1 loss function: Given that g(x) = c(i*), then Hence the probability of mis- classification for a specific decision g(x) = c(i*) is Clearly, to minimize the Pr. of mis-classification for a given x, the best choice is to choose g(x) = c(i*) if P(c(i*)|x) > P(c(i)|x) for i  i*

Intro. ANN & Fuzzy Systems (C) 2001 by Yu Hen Hu 7 MAP: Maximum A Posteriori Classifier The MAP classifier stipulates that the classifier that minimizes pr. of mis- classification should choose g(x) = c(i*) if P(c(i*)|x) > P(c(i)|x), i  i*. This is an optimal decision rule. Unfortunately, in real world applications, it is often difficult to estimate P(c(i)|x). Fortunately, to derive the optimal MAP decision rule, one can instead estimate a discriminant function G i (x) such that for any x  X, i  i*. G i* (x) > G i (x) iff P(c(i*)|x) > P(c(i)|x) G i (x) can be an approximation of P(c(i)|x) or any function satisfying above relationship.

Intro. ANN & Fuzzy Systems (C) 2001 by Yu Hen Hu 8 Maximum Likelihood Classifier Use Bayes rule, p(c(i)|x) = p(x|c(i))p(c(i))/p(x). Hence the MAP decision rule can be expressed as: g(x) = c(i*) if p(c(i*))p(x|c(i*)) > p(c(i))p(x|c(i)), i  i*. Under the assumption that the a priori Pr. is unknown, we may assume p(c(i)) = 1/M. As such, maximizing p(x|c(i)) is equivalent to maximizing p(c(i)|c). The likelihood function p(x|c(i)) may assume a uni- variate Gaussian model. That is, p(x|c(i)) ~ N(  i,  i )  i,  i can be estimated using samples from {x|t(x) = c(i)}. A priori pr. p(c(i)) can be estimated as:

Intro. ANN & Fuzzy Systems (C) 2001 by Yu Hen Hu 9 Nearest-Neighbor Classifier Let {y(1),, y(n)}  X be n samples which has already been classified. Given a new sample x, the NN decision rule chooses g(x) = c(i) if is labeled with c(i). As n , the prob. Mis-classification using NN classifier is at most twice of the prob. Mis-classification of the optimal (MAP) classifier. k-Nearest Neighbor classifier examine the k-nearest, classified samples, and classify x into the majority of them. Problem of implementation: require large storage to store ALL the training samples.

Intro. ANN & Fuzzy Systems (C) 2001 by Yu Hen Hu 10 MLP Classifier Each output of a MLP will be used to approximate the a posteriori probability P(c(i)|x) directly. The classification decision then amounts to assign the feature to the class whose corresponding output at the MLP is maximum. During training, the classification labels (1 of N encoding) are presented as target values (rather than the true, but unknown, a posteriori probability) Denote y(x,W) to be the i th output of MLP, and t(x) to be the corresponding target value (0 or 1) during the training. Hence y(x,W) will approximate E(t(x)|x) = P(c(i)|x)