1 E. Fatemizadeh Statistical Pattern Recognition.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Principles of Density Estimation
COMPUTER AIDED DIAGNOSIS: CLASSIFICATION Prof. Yasser Mostafa Kadah –
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Pattern recognition Professor Aly A. Farag
Visual Recognition Tutorial
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Classification and risk prediction
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Basics of discriminant analysis
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Sergios Theodoridis Konstantinos Koutroumbas Version 2
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ECES 690 – Statistical Pattern Recognition Lecture 2- Density Estimation / Linear Classifiers Andrew R. Cohen 1.
Lecture 2. Bayesian Decision Theory
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
ECES 481/681 – Statistical Pattern Recognition
LECTURE 03: DECISION SURFACES
Outline Parameter estimation – continued Non-parametric methods.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Image Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
Bayesian Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Hairong Qi, Gonzalez Family Professor
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Data Exploration and Pattern Recognition © R. El-Yaniv
Presentation transcript:

1 E. Fatemizadeh Statistical Pattern Recognition

2  Typical application areas  Machine vision  Character recognition (OCR)  Computer aided diagnosis  Speech recognition  Face recognition  Biometrics  Image Data Base retrieval  Data mining  Bionformatics  The task : Assign unknown objects – patterns – into the correct class. This is known as classification. PATTERN RECOGNITION

3  Features: These are measurable quantities obtained from the patterns, and the classification task is based on their respective values.  Feature vectors : A number of features constitute the feature vector Feature vectors are treated as random vectors.

4 An example:

5  The classifier consists of a set of functions, whose values, computed at, determine the class to which the corresponding pattern belongs  Classification system overview sensor feature generation feature selection classifier design system evaluation Patterns

6  Supervised – unsupervised pattern recognition: The two major directions  Supervised: Patterns whose class is known a-priori are used for training.  Unsupervised: The number of classes is (in general) unknown and no training patterns are available.

7 CLASSIFIERS BASED ON BAYES DECISION THEORY  Statistical nature of feature vectors  Assign the pattern represented by feature vector to the most probable of the available classes That is maximum

8  Computation of a-posteriori probabilities  Assume known a-priori probabilities This is also known as the likelihood of

9  The Bayes rule ( Μ =2) where

10  The Bayes classification rule (for two classes M =2)  Given classify it according to the rule  Equivalently: classify according to the rule  For equiprobable classes the test becomes

11

12  Equivalently in words: Divide space in two regions  Probability of error  Total shaded area   Bayesian classifier is OPTIMAL with respect to minimising the classification error probability!!!!

13  Indeed: Moving the threshold the total shaded area INCREASES by the extra “grey” area.

Minimizing Classification Error Probability  Our Claim: Bayesian Classifier is Optimal w.r to minimizing the classification error Probability. 14

Minimizing Classification Error Probability R1 and R2 are Union of space: 15

16  The Bayes classification rule for many (M>2) classes:  Given classify it to if:  Such a choice also minimizes the classification error probability  Minimizing the average risk  For each wrong decision, a penalty term is assigned since some decisions are more sensitive than others

17  For M =2 Define the loss matrix penalty term for deciding class, although the pattern belongs to, etc.  Risk with respect to

18  Risk with respect to  RED lighted:  Average risk Probabilities of wrong decisions, weighted by the penalty terms

19  Choose and so that r is minimized  Then assign to if  Equivalently: assign x in if : likelihood ratio

20  If

21  An example:

22  Then the threshold value is:  Threshold for minimum r

23 Thus moves to the left of (WHY?)

MinMax Criteria  Goal:  Minimizing Maximum Possible Overall Risk (No Priori Knowledge) 24

MinMax Criteria  With Some Simplification: 25

MinMax Criteria  A Simple Case (λ 11 = λ 22 =0, λ 12 = λ 21 =1) 26

27  If are contiguous: is the surface separating the regions. On one side is positive (+), on the other is negative (-). It is known as Decision Surface + - DISCRIMINANT FUNCTIONS DECISION SURFACES

28  For monotonic increasing f(.), the rule remains the same if we use:  is a discriminant function  In general, discriminant functions can be defined independent of the Bayesian rule. They lead to suboptimal solutions, yet if chosen appropriately, can be computationally more tractable.

29 BAYESIAN CLASSIFIER FOR NORMAL DISTRIBUTIONS  Multivariate Gaussian pdf  The last one Called Covariance Matrix

30  is monotonically increasing. Define:    Example:

31  That is, is quadratic and the surfaces Quadrics, Ellipsoids, Parabolas, Hyperbolas, Pairs of lines. For example:

32  Decision Hyperplanes  Quadratic terms come from: If ALL (the same) the quadratic terms are not of interest. They are not involved in comparisons. Then, equivalently, we can write: Discriminant functions are LINEAR

33  Let in addition:

34  Non Diagonal:  Decision hyperplane

35  Minimum Distance Classifiers  equiprobable   Euclidean Distance: smaller  Mahalanobis Distance: smaller

More  Geometry Analysis 36

37

38  Example:

Statistical Error Analysis  Minimum Attainable Classification Error (Bayes): 39

40  Maximum Likelihood     ESTIMATION OF UNKNOWN PROBABILITY DENSITY FUNCTIONS

41   

42

43

44  Example:

45  Maximum Aposteriori Probability Estimation  In ML method, θ was considered as a parameter  Here we shall look at θ as a random vector described by a pdf p(θ), assumed to be known  Given Compute the maximum of  From Bayes theorem

46  The method:

47

48  Example:

49  Bayesian Inference 

50

51  The above is a sequence of Gaussians as  Maximum Entropy  Entropy 

52  Example: x is nonzero in the interval and zero otherwise. Compute the ME pdf The constraint: Lagrange Multipliers

53  Mixture Models   Assume parametric modeling, i.e.,  The goal is to estimate given a set  We Ignore the details! We Ignore the details!

54  Nonparametric Estimation   

55  Parzen Windows  Divide the multidimensional space in hypercubes

56  Define That is, it is 1 inside a unit side hypercube centered at 0 The problem: Parzen windows-kernels-potential functions

57  Mean value Hence unbiased in the limit

58  Variance The smaller the h the higher the variance h=0.1, N=1000h=0.8, N=1000

59 h=0.1, N=10000  The higher the N the better the accuracy

60  If asymptotically unbiased  The method Remember:

61  CURSE OF DIMENSIONALITY  In all the methods, so far, we saw that the highest the number of points, N, the better the resulting estimate.  If in the one-dimensional space an interval, filled with N points, is adequately (for good estimation), in the two-dimensional space the corresponding square will require N 2 and in the ℓ -dimensional space the ℓ - dimensional cube will require N ℓ points.  The exponential increase in the number of necessary points in known as the curse of dimensionality. This is a major problem one is confronted with in high dimensional spaces.

62  NAIVE – BAYES CLASSIFIER  Let and the goal is to estimate i = 1, 2, …, M. For a “good” estimate of the pdf one would need, say, N ℓ points.  Assume x 1, x 2,…, x ℓ mutually independent. Then:  In this case, one would require, roughly, N points for each pdf. Thus, a number of points of the order N·ℓ would suffice.  It turns out that the Naïve – Bayes classifier works reasonably well even in cases that violate the independence assumption.

63  K Nearest Neighbor Density Estimation  In Parzen: The volume is constant The number of points in the volume is varying  Now: Keep the number of points constant Leave the volume to be varying

64

65  The Nearest Neighbor Rule  Choose k out of the N training vectors, identify the k nearest ones to x  Out of these k identify k i that belong to class ω i   The simplest version k=1 !!!  For large N this is not bad. It can be shown that: if P B is the optimal Bayesian error probability, then:

66    For small P B :

67  Voronoi tesselation