Principles of Density Estimation

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Lecture 3 Nonparametric density estimation and classification
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Pattern recognition Professor Aly A. Farag
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Chapter 4 (Part 1): Non-Parametric Classification
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Prénom Nom Document Analysis: Non Parametric Methods for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Evaluating Hypotheses
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 4 (part 2): Non-Parametric Classification
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 13 Oct 14, 2005 Nanjing University of Science & Technology.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
Non-Parameter Estimation 主講人:虞台文. Contents Introduction Parzen Windows k n -Nearest-Neighbor Estimation Classification Techiques – The Nearest-Neighbor.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
1 E. Fatemizadeh Statistical Pattern Recognition.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 6: Nearest and k-nearest Neighbor Classification.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Nonparametric Density Estimation Riu Baring CIS 8526 Machine Learning Temple University Fall 2007 Christopher M. Bishop, Pattern Recognition and Machine.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.
Univariate Gaussian Case (Cont.)
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
KNN & Naïve Bayes Hongning Wang
Chapter 3: Maximum-Likelihood Parameter Estimation
Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17
Non-Parameter Estimation
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch8: Nonparametric Methods
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
School of Computer Science & Engineering
3(+1) classifiers from the Bayesian world
Outline Parameter estimation – continued Non-parametric methods.
K Nearest Neighbor Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 16: NONPARAMETRIC TECHNIQUES
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Nonparametric density estimation and classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Hairong Qi, Gonzalez Family Professor
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
ECE – Pattern Recognition Lecture 10 – Nonparametric Density Estimation – k-nearest-neighbor (kNN) Hairong Qi, Gonzalez Family Professor Electrical.
Presentation transcript:

LECTURE 14: NONPARAMETRIC TECHNIQUES Objectives: Density Estimation Parzen Windows k-Nearest Neighbor Properties of Metrics Resources: DHS: Chapter 4 (Part 1) DHS: Chapter 4 (Part 2) EB&OS: Parzen Windows MS: Nonparametric Density KT: K-Nearest Neighbor Rule SOSS: KNN Tutorial Audio: URL:

Principles of Density Estimation Supervised learning assumes we know the form of the underlying density function, which is often not true in real applications. All parametric densities are unimodal (have a single local maximum), such as a Gaussian distribution, whereas many practical problems involve multi-modal densities. (We discussed mixture distributions as a way to deal with this.) Nonparametric procedures can be used with arbitrary distributions and without the assumption that the forms of the underlying densities are known. They are data-driven (estimated from the data) and of course suffer the same challenges with respect to overfitting. There are two types of nonparametric methods: Estimating P(x|j ). Bypass probability and go directly to a-posteriori probability estimation, P(j |x). Examples of this approach include many forms of neural networks. The basic idea in density estimation is that a vector, x, will fall in a region R with probability: . P is a smoothed or averaged version of the density function p(x).

Principles of Density Estimation (Cont.) Suppose n samples are drawn independently and identically distributed (i.i.d.) according to p(x). The probability that k of these n fall in R is given by: and the expected value for k is: E[k] = nP. The ML estimate, , is . Therefore, the ratio k/n is a good estimate for the probability P and hence for the density function p(x). Assume p(x) is continuous and that the region R is so small that p(x) does not vary significantly within it. We can write: where x is a point within R and V the volume enclosed by R, and . A demonstration of nonparametric density estimation. The true probability was chosen to be 0.7. The curves vary as a function of the number of samples, n. We see the binomial distribution peaks strongly at the true probability.

Principles of Density Estimation (Cont.) However, V cannot become arbitrarily small because we reach a point where no samples are contained in V, so we cannot get convergence this way. Alternate approach: V cannot be allowed to become small since the number of samples is always limited. One will have to accept a certain amount of variance in the ratio k/n. Theoretically, if an unlimited number of samples is available, we can circumvent this difficulty. To estimate the density of x, we form a sequence of regions R1, R2,… containing x: the first region contains one sample, the second two samples and so on. Let Vn be the volume of Rn, kn the number of samples falling in Rn and pn(x) be the nth estimate for p(x): pn(x) = (kn/n)/Vn . Three necessary conditions should apply if we want pn(x) to converge to p(x):

Principles of Density Estimation (Cont.) There are two different ways of obtaining sequences of regions that satisfy these conditions: Shrink an initial region where and show that . This is called the Parzen-window estimation method. Specify kn as some function of n, such as ; the volume Vn is grown until it encloses kn neighbors of x. This is called the kn-nearest neighbor estimation method.

Parzen Windows Parzen-window approach to estimate densities assume that the region Rn is a d-dimensional hypercube: ((x-xi)/hn) is equal to unity if xi falls within the hypercube of volume Vn centered at x and equal to zero otherwise. The number of samples in this hypercube is: The estimate for pn(x) is: pn(x) estimates p(x) as an average of functions of x and the samples {xi} for i = 1,… ,n. This reminiscent of the Sampling Theorem. These sampling functions, , can be general! Consider an example in which we estimate p(x) using normal distributions:

Example of a Parzen Window (Gaussian Kernels) Case where p(x) N(0,1) Let (u) = (1/(2) exp(-u2/2) and hn = h1/n (n>1), where h1 is a known parameter. Thus: and is an average of normal densities centered at the samples xi.

K-Nearest Neighbor Estimation Goal: a solution for the problem of the unknown “best” window function. Approach: Estimate density using real data points. Let the cell volume be a function of the training data. Center a cell about x and let it grow until it captures kn samples: kn = f(n) kn are called the kn nearest-neighbors of x. Two possibilities can occur: Density is high near x; therefore the cell will be small which provides good resolution. Density is low; therefore the cell will grow large and stop until higher density regions are reached. We can obtain a family of estimates by setting kn=k1/n and choosing different values for k1.

Estimation of A Posteriori Probabilities Goal: estimate P(i|x) from a set of n labeled samples. Let’s place a cell of volume V around x and capture k samples. ki samples amongst k turned out to be labeled i then: pn(x, i) = (ki/n)/V. A reasonable estimate for P(i|x) is: . ki/k is the fraction of the samples within the cell that are labeled I . For minimum error rate, the most frequently represented category within the cell is selected. If k is large and the cell sufficiently small, the performance will approach the best possible.

The Nearest-Neighbor Rule Let Dn = {x1, x2, …, xn} be a set of n labeled prototypes. Let x’  Dn be the closest prototype to a test point x. The nearest-neighbor rule for classifying x is to assign it the label associated with x’ The nearest-neighbor rule leads to an error rate greater than the minimum possible: the Bayes rate (see the textbook for the derivation). If the number of prototypes is large (unlimited), the error rate of the nearest- neighbor classifier is never worse than twice the Bayes rate. If n  , it is always possible to find x’ sufficiently close so that: P(i | x’)  P(i | x) This produces a Voronoi tesselation of the space, and the individual decision regions are called Voronoi cells. For large data sets, this approach can be very effective but not computationally efficient.

The K-Nearest-Neighbor Rule The k-nearest neighbor rule is a straightforward modification of the nearest neighbor rule that builds on the concept of majority voting. Query starts at the data point, x, and grows a spherical region until it encloses k training samples. The point is labeled by a majority vote of the class assignments for the k samples. For very large data sets, the performance approaches the Bayes error rate. The computational complexity of this approach can be high. Each distance calculation is O(d), and thus the search is O(dn2). A parallel implementation exists that is O(1) in time and O(n) in space. Tree-structured searches can gain further efficiency, and the training set can be “pruned” to eliminate “useless” prototypes (complexity: O(d3n(d/2)ln(n))).

Properties of Metrics Nonnegativity: Reflexivity: Symmetry: Triangle Inequality: Euclidean Distance: The Minkowski metric is a generalization of a Euclidean distance: and is often referred to as the Lk norm. The L1 norm is often called the city block distance because it gives the shortest path between a and b, each segment of which is parallel to a coordinate axis.

Summary Motivated nonparameteric density estimation. Introduced Parzen windows. Introduced k-nearest neighbor approaches. Discussed properties of a good distance metric.