Lecture 3 Nonparametric density estimation and classification

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Principles of Density Estimation
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Pattern recognition Professor Aly A. Farag
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Chapter 4 (Part 1): Non-Parametric Classification
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Statistical Decision Theory, Bayes Classifier
Prénom Nom Document Analysis: Non Parametric Methods for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Evaluating Hypotheses
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 4 (part 2): Non-Parametric Classification
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Ensemble Learning (2), Tree and Forest
Exercise Session 10 – Image Categorization
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Non-Parametric Learning Prof. A.L. Yuille Stat 231. Fall Chp 4.1 – 4.3.
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
Non-Parameter Estimation 主講人:虞台文. Contents Introduction Parzen Windows k n -Nearest-Neighbor Estimation Classification Techiques – The Nearest-Neighbor.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
1 E. Fatemizadeh Statistical Pattern Recognition.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Machine Learning – Lecture 3 Probability Density Estimation II Bastian.
Optimal Bayes Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Lecture 4 Linear machine
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Nonparametric Density Estimation Riu Baring CIS 8526 Machine Learning Temple University Fall 2007 Christopher M. Bishop, Pattern Recognition and Machine.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Univariate Gaussian Case (Cont.)
Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.
CHAPTER 8: Nonparametric Methods Alpaydin transparencies significantly modified, extended and changed by Ch. Eick Last updated: March 4, 2011.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
KNN & Naïve Bayes Hongning Wang
Lecture 2. Bayesian Decision Theory
Univariate Gaussian Case (Cont.)
Lecture 15. Pattern Classification (I): Statistical Formulation
Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17
Non-Parameter Estimation
INTRODUCTION TO Machine Learning 3rd Edition
Ch8: Nonparametric Methods
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
3(+1) classifiers from the Bayesian world
Outline Parameter estimation – continued Non-parametric methods.
K Nearest Neighbor Classification
Nonparametric methods Parzen window and nearest neighbor
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 16: NONPARAMETRIC TECHNIQUES
LECTURE 09: BAYESIAN LEARNING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Nonparametric density estimation and classification
Mathematical Foundations of BME
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Hairong Qi, Gonzalez Family Professor
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
ECE – Pattern Recognition Lecture 10 – Nonparametric Density Estimation – k-nearest-neighbor (kNN) Hairong Qi, Gonzalez Family Professor Electrical.
Presentation transcript:

Lecture 3 Nonparametric density estimation and classification Histogram The box kernel -- Parzen window K-nearest neighbor

Density estimation Classification can be based on estimating the density for each of the classes. From a set of observed random vectors, {x1, x2, ……, xn}  p(x) The probability that a vector x, drawn from p(x) falls into region R of the sample space is When n vectors are observed from the distribution, the probability that k of them fall into R is

Density estimation According to the properties of the Binomial distribution, As n increases, the variance diminishes. k/n becomes a good estimator of P.

Density estimation When big enough sample is available, we can use small R such that p(x) varies very little within R. Let V be the volume. Since we also have Then, As N increases and V decreases, the estiamte becomes more accurate.

Density estimation Asymptotic considerations. Construct R1, R2, R3, ……with a growing number of samples. Let Vn be the volumes, kn be the number of samples included, and pn(x) be the nth estimate of p(x) Three conditions are to be met for pn(x) to converge to p(x)

Density estimation How to obtain such a sequence R1, R2, R3,…… Two general approaches: Specify Vn to be a function of n, for example Show that kn and kn/n conform to the three conditions. * This is the kernel density estimation (2) Specify kn as a function of n, for example Use Vn such that kn samples are contained in the neighborhood. Show that Vn conform to the conditions. * This is the kn nearest neighbor method.

Density estimation

Histogram The histogram is close to, but not truly density estimation. It doesn’t try to estimate p(x) at every x. Rather, it partitions the sample space into bins, and only approximate the density at the center of each bin. It is a sample collected from the kernel density estimation where the kernel is a box.

Histogram For bin bj, the histogram density of the ith classis defined as Within each bin, the density is assumed to be constant. It is a legitimate density function --- positive and integrate to one.

Histogram The histogram density estimation is influenced by: The starting position of the bins The orientation of the bins in >1 dimension Artifact of discontinuity Since the bins are equal size, when dimension is high, a huge number of bins are needed, and most are empty with limited amount of data.

Parzen window Emanuel Parzen 1962. The original version is rectangular (box) kernel. Some use “Parzen window” to refer to the general kernel density estimation. Define a window function This is a unit hypercube centered at origin. Given the volume of a d-dimensional hypercube Vn, the edge length hn satisfies

Parzen window By hn, we can define the kernel: If xi falls within the hypercube centered at x, with volume Vn The number of samples in the hypercube is: The estimate of p(x) is n is sample size.

Parzen window Is the pn(x) a legitimate density function? It needs to satisfy (1) nonnegative and (2) integrate to one. This can be achieved by requiring the window function to satisfy these conditions: Define the function The pn(x) can be written as

Parzen window

Parzen window The window function can be generalized. Notice any density function satisfies our requirement: pn(x) is a superposition of n density functions.

Parzen window

Parzen window We want the mean of pn(x) to converge to the truth p(x) The expected value of the estimate is an average of the true density around x. It is the convolution of the true density and the window function --- a “blurred” version of the truth. When

Parzen window p(x) Standard normal.

Parzen window

Parzen window

Parzen window classification A classifier based on Parzen window is straight-forward: Estimate the densities for each class using Parzen window Construct a Bayes classifier using the densities. Classify a test object based on the posterior probabilities and the loss function. The decision boundary of the classifier depends upon the choice of window function and window size.

Parzen window classifier

Parzen window classifier

KNN estimation To estimate p(x), we grow a cell from x until kn samples are captured. kn is a function of n. The sample is the kn nearest neighbors of x. The density estimate is as discussed: If Then V1 is determined by the nature of the data.

KNN estimation

KNN estimation

KNN classifier Although KNN is similar to the Parzen window, in terms of classification, it is used in a simpler way: directly estimate the posterior probability from n labeled samples. A cell with volume V captures k samples, K1 in class 1; k2 in class 2 … The joint probability is estimated by Then,

KNN classifier The estimate of the posterior probability is simply the fraction of the samples within the cell belonging to a specific class. Bayes decision is used again to minimize error rate. Notice there is no computation to be done for the model-learning step. When a testing data is present, frequencies from training data around the testing data is used for classification.

KNN classifier Nontheless, the rule is capable of drawing class boundaries. The feature space is partitioned into “Coronoi tessellation”

KNN error KNN doesn’t reach Bayes error rate. Here’s why: The true posterior probabilities are known. The Bayes decision rule will choose class 1. But will KNN always do that? No. KNN is influenced by sampling variations. It chooses class 1 with probability: The larger the k, the smaller the error.

KNN error c classes. When a class posterior is close to 1, the Bayes error is small, so is the KNN error. When each class is almost equally likely, both Bayes and KNN has an error rate ~(1-1/c). In the middle, KNN error rate is bounded by Bayes error rate: