Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Chapter 4: Nonparametric Techniques (Sections 1-6) 1. Introduction 2. Density Estimation 3. Parzen Windows 4. Kn–Nearest-Neighbor Estimation 5. The Nearest-Neighbor Rule 6. Metrics and Nearest-Neighbor Classification

Pattern Classification, Ch4 2 1. Introduction All Parametric densities are unimodal (have a single local maximum), whereas many practical problems involve multi- modal densities All Parametric densities are unimodal (have a single local maximum), whereas many practical problems involve multi- modal densities Nonparametric procedures can be used with arbitrary distributions and without the assumption that the forms of the underlying densities are known Nonparametric procedures can be used with arbitrary distributions and without the assumption that the forms of the underlying densities are known There are two types of nonparametric methods: There are two types of nonparametric methods: Estimate density functions P(x |  j ) without assuming a model Estimate density functions P(x |  j ) without assuming a model Parzen Windows Parzen Windows Bypass density functions and directly estimate P(  j |x) Bypass density functions and directly estimate P(  j |x) k-Nearest Neighbor (kNN) k-Nearest Neighbor (kNN)

Pattern Classification, Ch4 3 2. Density Estimation Basic idea: Probability that a vector x will fall in region R is: Probability that a vector x will fall in region R is: P is a smoothed version of the density function p(x) P is a smoothed version of the density function p(x) For n samples, the probability that k points fall in R is: For n samples, the probability that k points fall in R is: and the expected value for k is: and the expected value for k is: E(k) = nP (3) E(k) = nP (3) Binomial distribution

Pattern Classification, Ch4 4 Maximum Likelihood estimation of P =  is reached for is reached for Therefore, the ratio k/n is a good estimate for the probability P and hence for the density function p. Because p(x) is continuous and the region R is so small that p does not vary significantly within it: where x is a point within R and V the volume enclosed by R. where x is a point within R and V the volume enclosed by R.

Pattern Classification, Ch4 5 Combining equation (1), (3) and (4) yields:

Pattern Classification, Ch4 6 Theoretically, if an unlimited number of samples is available, we can estimate the density of x by forming a sequence of regions R 1, R 2, … containing x: the first contains one sample, the second two, etc. Let V n be the volume of R n, k n the number of samples falling in R n and p n (x) be the n th estimate for p(x): p n (x) = (k n /n)/V n (7) Three conditions must apply for convergence : There are two different ways to satisfy these conditions: 1. Shrink an initial region where V n = 1/  n and show that This is the Parzen-window estimation method This is the Parzen-window estimation method 2. Specify k n as a function of n, such as k n =  n; the volume V n is grown until it encloses k n neighbors of x. This is the kNN estimation method until it encloses k n neighbors of x. This is the kNN estimation method

Pattern Classification, Ch4 7

8 3. Parzen Windows Parzen-window approach to estimate densities: assume e.g. that the region R n is a d-dimensional hypercube Parzen-window approach to estimate densities: assume e.g. that the region R n is a d-dimensional hypercube  ((x-x i )/h n ) is equal to unity if x i falls within the hypercube of volume V n centered at x, and equal to zero otherwise.  ((x-x i )/h n ) is equal to unity if x i falls within the hypercube of volume V n centered at x, and equal to zero otherwise.

Pattern Classification, Ch4 9 The number of samples in this hypercube is: The number of samples in this hypercube is: Substituting k n in equation 7 (p n (x) = (k n /n)/V n ) we obtain: P n (x) estimates p(x) as an average of functions of x and the samples (x i ) (i = 1, …,n). These functions  can be general!

Pattern Classification, Ch4 10 Illustration – effect of window function The behavior of the Parzen-window method Case where p(x)  N(0,1) Let  (u) = (1/  (2  ) exp(-u 2 /2) and h n = h 1 /  n (n>1) (h 1 : parameter at our disposal) Thus: is an average of normal densities centered at the samples x i.

Pattern Classification, Ch4 11 Numerical 1D results (see figure next slide): Results depend on n and h 1 For n = 1 and h 1 =1 For n = 1 and h 1 =1 For n = 10 and h 1 = 0.1, the contributions of the individual samples are clearly observable ! For n = 10 and h 1 = 0.1, the contributions of the individual samples are clearly observable !

Pattern Classification, Ch4 14 Case where p(x) = 1.U(a,b) + 2.T(c,d) (mixture of a uniform and a triangle density)

Pattern Classification, Ch4 15 Classification example In classifiers based on Parzen-window estimation: We estimate the densities P(x |  j ) for each category and classify a test point by the label corresponding to the maximum posterior (unequal priors for multiple classes can be included) We estimate the densities P(x |  j ) for each category and classify a test point by the label corresponding to the maximum posterior (unequal priors for multiple classes can be included) The decision region for a Parzen-window classifier depends upon the choice of window function as illustrated in the following figure. The decision region for a Parzen-window classifier depends upon the choice of window function as illustrated in the following figure. For good estimates, usually n must be large, much greater than for parametric models For good estimates, usually n must be large, much greater than for parametric models

Pattern Classification, Ch4 18 Rather than trying to find the “best” Parzen window function Rather than trying to find the “best” Parzen window function Let the cell volume be a function of the training data Let the cell volume be a function of the training data Center a cell about x and let it grows until it captures k n samples (k n = f(n)) Center a cell about x and let it grows until it captures k n samples (k n = f(n)) k n are called the k n nearest-neighbors of x k n are called the k n nearest-neighbors of x Two possibilities can occur: 1. Density is high near x so the cell will be small to provide good resolution 2. Density is low so the cell will grow until higher density regions are reached We can obtain a family of estimates by setting k n =k 1 /  n and choosing different values for k 1 (a parameter at our disposal) 4. K n –Nearest-Neighbor Estimation

Pattern Classification, Ch4 22 Estimation of a-posteriori probabilities Estimation of a-posteriori probabilities Goal: estimate P(  i | x) from a set of n labeled samples Goal: estimate P(  i | x) from a set of n labeled samples Let’s place a cell of volume V around x and capture k samples Let’s place a cell of volume V around x and capture k samples If k i samples among the k turned out to be labeled  i, then: If k i samples among the k turned out to be labeled  i, then: p n (x,  i ) = k i /n.V An estimate for p n (  i | x) is:

Pattern Classification, Ch4 23 k i /k is the fraction of the samples within the cell that are labeled  i k i /k is the fraction of the samples within the cell that are labeled  i For minimum error rate, the most frequently represented category within the cell is selected For minimum error rate, the most frequently represented category within the cell is selected If k is large and the cell sufficiently small, the performance will approach the best possible If k is large and the cell sufficiently small, the performance will approach the best possible

Pattern Classification, Ch4 24 Let D n = {x 1, x 2, …, x n } be a set of n labeled prototypes Let D n = {x 1, x 2, …, x n } be a set of n labeled prototypes Let x’  D n be the closest prototype to a test point x then the nearest- neighbor rule for classifying x is to assign it the label associated with x’ Let x’  D n be the closest prototype to a test point x then the nearest- neighbor rule for classifying x is to assign it the label associated with x’ The nearest-neighbor rule leads to an error rate greater than the minimum possible: the Bayes rate The nearest-neighbor rule leads to an error rate greater than the minimum possible: the Bayes rate If the number of prototypes is large (unlimited), the error rate of the nearest-neighbor classifier is never worse than twice the Bayes rate (it can be demonstrated!) If the number of prototypes is large (unlimited), the error rate of the nearest-neighbor classifier is never worse than twice the Bayes rate (it can be demonstrated!) If n  , it is always possible to find x’ sufficiently close so that: If n  , it is always possible to find x’ sufficiently close so that: P(  i | x’)  P(  i | x) P(  i | x’)  P(  i | x) If P(  m | x)  1, then the nearest neighbor selection is almost always the same as the Bayes selection If P(  m | x)  1, then the nearest neighbor selection is almost always the same as the Bayes selection 5. The Nearest-Neighbor Rule

25 Pattern Classification, Ch 4

26 The k-nearest-neighbor rule The k-nearest-neighbor rule Goal: Classify x by assigning it the label most frequently represented among the k nearest samples and use a voting scheme Goal: Classify x by assigning it the label most frequently represented among the k nearest samples and use a voting scheme Usually choose k odd so no voting ties Usually choose k odd so no voting ties

Pattern Classification, Ch4 30Example: k = 3 (odd value) and x = (0.10, 0.25) t Closest vectors to x with their labels are: {(0.10, 0.28,  2 ); (0.12, 0.20,  2 ); (0.15, 0.35,  1 )} One voting scheme assigns the label  2 to x since  2 is the most frequently represented PrototypesLabels (0.15, 0.35) (0.10, 0.28) (0.09, 0.30) (0.12, 0.20) 12521252

Pattern Classification, Ch4 31 kNN uses a metric (distance function) between two vectors kNN uses a metric (distance function) between two vectors Typically Euclidean distance Typically Euclidean distance Distance functions have the properties of Distance functions have the properties of 6. Metrics and Nearest-Neighbor Classification

Pattern Classification, Ch4 33 L 1 is the Manhattan or city block distance L 1 is the Manhattan or city block distance L 2 is the Euclidean distance L 2 is the Euclidean distance The Minkowski Metric or Distance

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Similar presentations

Presentation on theme: "Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Similar presentations

Presentation on theme: "Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley."— Presentation transcript:

Similar presentations

About project

Feedback