Outline Maximum Likelihood Maximum A-Posteriori (MAP) Estimation

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Principles of Density Estimation
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Lecture 3 Nonparametric density estimation and classification
Pattern recognition Professor Aly A. Farag
Visual Recognition Tutorial
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Maximum likelihood (ML) and likelihood ratio (LR) test
Chapter 4 (Part 1): Non-Parametric Classification
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Introduction to Bayesian Parameter Estimation
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
1 E. Fatemizadeh Statistical Pattern Recognition.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Univariate Gaussian Case (Cont.)
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Univariate Gaussian Case (Cont.)
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
12. Principles of Parameter Estimation
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Parameter Estimation 主講人:虞台文.
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Outline Parameter estimation – continued Non-parametric methods.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Summarizing Data by Statistics
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 16: NONPARAMETRIC TECHNIQUES
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Learning From Observed Data
Nonparametric density estimation and classification
Hairong Qi, Gonzalez Family Professor
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
12. Principles of Parameter Estimation
Data Exploration and Pattern Recognition © R. El-Yaniv
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

Outline Maximum Likelihood Maximum A-Posteriori (MAP) Estimation Bayesian Parameter Estimation Example:The Gaussian Case Recursive Bayesian Incremental Learning Problems of Dimensionality Nonparametric Techniques Density Estimation Histogram Approach Parzen-window method 9/20/2018 236875 Visual Recognition

Bayes' Decision Rule (Minimizes the probability of error) choose w1 : if P(w1|x) > P(w2|x) choose w2 : otherwise or w1 : if P ( x |w1) P(w1) > P(x|w2) P(w2) w2 : otherwise and P(Error|x) = min [P(w1|x) , P(w2|x)] 9/20/2018 236875 Visual Recognition

Normal Density - Multivariate Case The general multivariate normal density (MND) in a d dimensions is written as It can be shown that: which means for components 9/20/2018 236875 Visual Recognition

Maximum Likelihood and Bayesian Parameter Estimation To design an optimal classifier we need P(wi) and p(x| wi), but usually we do not know them. Solution – to use training data to estimate the unknown probabilities. Estimation of class-conditional densities is a difficult task. 9/20/2018 236875 Visual Recognition

Maximum Likelihood and Bayesian Parameter Estimation Supervised learning: we get to see samples from each of the classes “separately” (called tagged or labeled samples). Tagged samples are “expensive”. We need to learn the distributions as efficiently as possible. Two methods: parametric (easier) and non-parametric (harder) 9/20/2018 236875 Visual Recognition

Learning From Observed Data Hidden Observed Unsupervised Supervised 9/20/2018 236875 Visual Recognition

Maximum Likelihood and Bayesian Parameter Estimation Program for parametric methods: Assume specific parametric distributions with parameters Estimate parameters from training data Replace true value of class-conditional density with approximation and apply the Bayesian framework for decision making.  9/20/2018 236875 Visual Recognition

Maximum Likelihood and Bayesian Parameter Estimation Suppose we can assume that the relevant (class-conditional) densities are of some parametric form. That is, p(x|w)=p(x|q), where Examples of parameterized densities: Binomial: x(n) has m 1’s and n-m 0’s Exponential: Each data point x is distributed according to 9/20/2018 236875 Visual Recognition

Maximum Likelihood and Bayesian Parameter Estimation cont. Two procedures for parameter estimation will be considered: Maximum likelihood estimation: choose parameter value that makes the data most probable (i.e., maximizes the probability of obtaining the sample that has actually been observed), Bayesian learning: define a prior probability on the model space and compute the posterior Additional samples sharp the posterior density which peaks near the true values of the parameters .   9/20/2018 236875 Visual Recognition

It is assumed that a sample set with Sampling Model It is assumed that a sample set with independently generated samples is available. The sample set is partitioned into separate sample sets for each class, A generic sample set will simply be denoted by . Each class-conditional is assumed to have a known parametric form and is uniquely specified by a parameter (vector) . Samples in each set are assumed to be independent and identically distributed (i.i.d.) according to some true probability law . 9/20/2018 236875 Visual Recognition

Log-Likelihood function and Score Function The sample sets are assumed to be functionally independent, i.e., the training set contains no information about for . The i.i.d. assumption implies that Let be a generic sample of size . Log-likelihood function: The log-likelihood function is identical to the logarithm of the probability density function, but is interpreted as a function over the sample space for given parameter 9/20/2018 236875 Visual Recognition

Log-Likelihood Illustration Assume that all the points in are drawn from some (one-dimensional) normal distribution with some (known) variance and unknown mean. 9/20/2018 236875 Visual Recognition

Log-Likelihood function and Score Function cont. Maximum likelihood estimator (MLE): (tacitly assuming that such a maximum exists!) Score function: and hence Necessary condition for MLE (if not on border of domain ) : 9/20/2018 236875 Visual Recognition

Maximum A Posteriory Maximum a posteriory (MAP): Find the value of q that maximizes l(q)+ln(p(q)), where p(q),is a prior probability of different parameter values.A MAP estimator finds the peak or mode of a posterior. Drawback of MAP: after arbitrary nonlinear transformation of the parameter space, the density will change, and the MAP solution will no longer be correct. 9/20/2018 236875 Visual Recognition

Maximum A-Posteriori (MAP) Estimation The “most likely value” is given by q 9/20/2018 236875 Visual Recognition

Maximum A-Posteriori (MAP) Estimation since the data is i.i.d. We can disregard the normalizing factor when looking for the maximum 9/20/2018 236875 Visual Recognition

MAP - continued So, the we are looking for is 9/20/2018 236875 Visual Recognition

The Gaussian Case: Unknown Mean Suppose that the samples are drawn from a multivariate normal population with mean , and covariance matrix .  Consider fist the case where only the mean is unknown  For a sample point xk , we have and The maximum likelihood estimate for must satisfy 9/20/2018 236875 Visual Recognition

The Gaussian Case: Unknown Mean Multiplying by , and rearranging, we obtain The MLE estimate for the unknown population mean is just the arithmetic average of the training samples (sample mean). Geometrically, if we think of the n samples as a cloud of points, the sample mean is the centroid of the cloud  9/20/2018 236875 Visual Recognition

The Gaussian Case: Unknown Mean and Covariance In the general multivariate normal case, neither the mean nor the covariance matrix is known . Consider fist the univariate case with and .  The log-likelihood of a single point is and its derivative is 9/20/2018 236875 Visual Recognition

The Gaussian Case: Unknown Mean and Covariance Setting the gradient to zero, and using all the sample points, we get the following necessary conditions: where are the MLE estimates for , and respectively. Solving for , we obtain 9/20/2018 236875 Visual Recognition

The Gaussian multivariate case For the multivariate case, it is easy to show that the MLE estimates for are given by The MLE for the mean vector is the sample mean, and the MLE estimate for the covariance matrix is the arithmetic average of the n matrices The MLE for is biased (i.e., the expected value over all data sets of size n of the sample variance is not equal to the true variance: 9/20/2018 236875 Visual Recognition

The Gaussian multivariate case Unbiased estimator for and are given by and C is called the sample covariance matrix . C is absolutely unbiased. is asymptotically unbiased. 9/20/2018 236875 Visual Recognition

Bayesian Estimation: Class-Conditional Densities The aim is to find posteriors P(wi|x) knowing p(x|wi) and P(wi), but they are unknown. How to find them? Given the sample D, we say that the aim is to find P(wi|x, D) Bayes formula gives: We use the information provided by training samples to determine the class conditional densities and the prior probabilities. Generally used assumptions: Priors generally are known or obtainable from a trivial calculations. Thus P(wi)= P(wi|D). The training set can be separated into c subsets: D1,…,Dc P(w,x,D)=p(x|w,D)P(w,D)=p(w|x,D)P(x,D); P(w,D)=p(w|D)P(D);P(x,D)=p(x)P(D) 9/20/2018 236875 Visual Recognition

Bayesian Estimation: Class-Conditional Densities The samples Dj have no influence on p(x|wi,Di ) if Thus we can write: We have c separate problems of the form: Use a set D of samples drawn independently according to a fixed but unknown probability distribution p(x) to determine p(x|D). P(w,x,D)=p(x|w,D)P(w,D)=p(w|x,D)P(x,D); P(w,D)=p(w|D)P(D);P(x,D)=p(x)P(D) 9/20/2018 236875 Visual Recognition

Bayesian Estimation: General Theory Bayesian leaning considers (the parameter vector to be estimated) to be a random variable. Before we observe the data, the parameters are described by a prior p(q ) which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior p(q |D ). Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning (see fig.) 9/20/2018 236875 Visual Recognition

General Theory cont. Density function for x, given the training data set , From the definition of conditional probability densities The first factor is independent of since it just our assumed form for parameterized density. Therefore Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of The weighting factor , which is a posterior of is determined by starting from some assumed prior 9/20/2018 236875 Visual Recognition

General Theory cont. Then update it using Bayes’ formula to take account of data set . Since are drawn independently which is likelihood function. Posterior for is where normalization factor 9/20/2018 236875 Visual Recognition

Bayesian Learning – Univariate Normal Distribution Let us use the Bayesian estimation technique to calculate a posteriori density and the desired probability density for the case Univariate Case: Let m be the only unknown parameter 9/20/2018 236875 Visual Recognition

Bayesian Learning – Univariate Normal Distribution Prior probability: normal distribution over , encodes some prior knowledge about the true mean , while measures our prior uncertainty. If m is drawn from p(m) then density for x is completely determined. Letting we use 9/20/2018 236875 Visual Recognition

Bayesian Learning – Univariate Normal Distribution Computing the posterior distribution 9/20/2018 236875 Visual Recognition

Bayesian Learning – Univariate Normal Distribution Where factors that do not depend on have been absorbed into the constants and is an exponential function of a quadratic function of i.e. it is a normal density. remains normal for any number of training samples. If we write then identifying the coefficients, we get 9/20/2018 236875 Visual Recognition

Bayesian Learning – Univariate Normal Distribution where is the sample mean. Solving explicitly for and we obtain and represents our best guess for after observing n samples. measures our uncertainty about this guess. decreases monotonically with n (approaching as n approaches infinity)   9/20/2018 236875 Visual Recognition

Bayesian Learning – Univariate Normal Distribution Each additional observation decreases our uncertainty about the true value of . As n increases, becomes more and more sharply peaked, approaching a Dirac delta function as n approaches infinity. This behavior is known as Bayesian Learning. 9/20/2018 236875 Visual Recognition

Bayesian Learning – Univariate Normal Distribution In general, is a linear combination of and , with coefficients that are non-negative and sum to 1. Thus lies somewhere between and . If , as If , our a priori certainty that is so strong that no number of observations can change our opinion. If , a priori guess is very uncertain, and we take The ratio is called dogmatism. 9/20/2018 236875 Visual Recognition

Bayesian Learning – Univariate Normal Distribution The Univariate Case: where 9/20/2018 236875 Visual Recognition

Bayesian Learning – Univariate Normal Distribution Since we can write To obtain the class conditional probability , whose parametric form is known to be we replace by and by The conditional mean is treated as if it were the true mean, and the known variance is increased to account for the additional uncertainty in x resulting from our lack of exact knowledge of the mean . 9/20/2018 236875 Visual Recognition

Example (demo-MAP) We have N points which are generated by one dimensional Gaussian, Since we think that the mean should not be very big we use as a prior where is a hyperparameter. The total objective function is: which is maximized to give, For influence of prior is negligible and result is ML estimate. But for very strong belief in the prior the estimate tends to zero. Thus, if few data are available, the prior will bias the estimate towards the prior expected value 9/20/2018 236875 Visual Recognition

Recursive Bayesian Incremental Learning We have seen that Let us define Then Substituting into and using Bayes we have: Finally 9/20/2018 236875 Visual Recognition

Recursive Bayesian Incremental Learning While repeated use of this eq. produces a sequence This is called the recursive Bayes approach to the parameter estimation. (Also incremental or on-line learning). When this sequence of densities converges to a Dirac delta function centered about the true parameter value, we have Bayesian learning. 9/20/2018 236875 Visual Recognition

Maximal Likelihood vs. Bayesian ML and Bayesian estimations are asymptotically equivalent and “consistent”. They yield the same class-conditional densities when the size of the training data grows to infinity. ML is typically computationally easier: in ML we need to do (multidimensional) differentiation and in Bayesian (multidimensional) integration. ML is often easier to interpret: it returns the single best model (parameter) whereas Bayesian gives a weighted average of models. But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information). Bayesian with “flat” prior is essentially ML; with asymmetric and broad priors the methods lead to different solutions. 9/20/2018 236875 Visual Recognition

Problems of Dimensionality:Accuracy, Dimension, and Training Sample Size Consider two-class multivariate normal distributions with the same covariance. If priors are equal then Bayesian error rate is given by where is the squared Mahalanobis distance: Thus the probability of error decreases as r increases. In the conditionally independent case and 9/20/2018 236875 Visual Recognition

Problems of Dimensionality While classification accuracy can become better with growing of dimensionality (and an amount of training data), beyond a certain point, the inclusion of additional features leads to worse rather then better performance computational complexity grows the problem of overfitting arises 9/20/2018 236875 Visual Recognition

Occam's Razor "Pluralitas non est ponenda sine neccesitate" or "plurality should not be posited without necessity." The words are those of the medieval English philosopher and Franciscan monk William of Occam (ca. 1285-1349). Decisions based on overly complex models often lead to lower accuracy of the classifier. 9/20/2018 236875 Visual Recognition

Outline Nonparametric Techniques Density Estimation Histogram Approach Parzen-window method Kn-Nearest-Neighbor Estimation Component Analysis and Discriminants Principal Components Analysis Fisher Linear Discriminant MDA 9/20/2018 236875 Visual Recognition

NONPARAMETRIC TECHNIQUES So far, we treated supervised learning under the assumption that the forms of the underlying density functions were known. The common parametric forms rarely fit the densities actually encountered in practice. Classical parametric densities are unimodal, whereas many practical problems involve multimodal densities. We examine nonparametric procedures that can be used with arbitrary distribution and without the assumption that the forms of the underlying densities are known. 9/20/2018 236875 Visual Recognition

NONPARAMETRIC TECHNIQUES There are several types of nonparametric methods: Procedures for estimating the density functions from sample patterns. If these estimates are satisfactory, they can be substituted for the true densities when designing the classifier. Procedures for directly estimating the a posteriori probabilities Nearest neighbor rule which bypass probability estimation, and go directly to decision functions. 9/20/2018 236875 Visual Recognition

Histogram Approach The conceptually simplest method of estimating a p.d.f. is histogram. The range of each component xs of vector x is divided into a fixed number m of equal intervals. The resulting boxes (bins) of identical volume V are then expected and the number of points falling into each bin is counted. Suppose that we have ni samples xj , j=1,…, ni from class wi Suppose that the number of vector points in the j-th bin, bj , be kj . The histogram estimate , of density function 9/20/2018 236875 Visual Recognition

Histogram Approach is defined as is constant over every bin bj . Let us verify that is a density function: We can choose a number m of bins and their starting points. Fixation of starting points is not critical, but m is important. It place a role of smoothing parameter. Too big m makes histogram spiky, for too little m we loose a true form of the density function 9/20/2018 236875 Visual Recognition

The Histogram Method:Example Assume (one dimensional) data Some points were sampled from a combination of two Gaussians: 3 bins 9/20/2018 236875 Visual Recognition

The Histogram Method:Example 7 bins 11 bins 9/20/2018 236875 Visual Recognition

Histogram Approach The histogram p.d.f. estimator is very effective. We can do it online: all we should do is to update the counters kj during the run time, so we do not need to keep all the data which could be huge. But its usefulness is limited only to low dimensional vectors x, because the number of bins, Nb , grows exponentially with dimensionality d : This is the so called “curse of dimensionality” 9/20/2018 236875 Visual Recognition

DENSITY ESTIMATION To estimate the density at x, we form a sequence of regions R1, R2, .. The probability for x to fall into R is Suppose we have n i. i.d. samples x1 , … , xn drawn according to p(x) . The probability that k of them fall in R is and the expected value for k is and variance . The relative part of samples which fall into R (k/n) is also a random variable for which When n is growing up the variance is making smaller and is becoming to be better estimator for P. 9/20/2018 236875 Visual Recognition

DENSITY ESTIMATION Pk sharply peaks about the mean, so the k/n is a good estimate of P. For small enough R where x is within R and V is a volume enclosed by R. Thus 9/20/2018 236875 Visual Recognition

Three Conditions for DENSITY ESTIMATION Let us take a growing sequence of samples n=1,2,3... We take regions Rn with reduced volumes V1 > V2 > V3 >... Let kn be the number of samples falling in Rn Let pn(x) be the nth estimate for p(x) If pn(x) is to converge to p(x) , 3 conditions must be required: resolution as big as possible (to reduce smoothing) otherwise in the range Rn there will not be infinite number of points and k/n will not converge to P and we’ll get p(x)=0 to guarantee convergence of (*). 9/20/2018 236875 Visual Recognition

Parzen Window and KNN How to obtain the sequence R1 , R2 , ..? There are 2 common approaches of obtaining sequences of regions that satisfy the above conditions: Shrink an initial region by specifying the volume Vn as some function of n , such as and show that kn and kn/n behave properly i.e. pn(x) converges to p(x). This is Parzen-window (or kernel ) method . Specify kn as some function of n, such as Here the volume Vn is grown until it encloses kn neighbors of x .   This is kn-nearest-neighbor method . Both of these methods do converge, although it is difficult to make meaningful statements about their finite-sample behavior. 9/20/2018 236875 Visual Recognition

PARZEN WINDOWS Assume that the region Rn is a d-dimensional hypercube. If hn is the length of an edge of that hypercube, then its volume is given by Define the following window function: defines a unit hypercube centered at the origin.   , if xi falls within the hypercube of volume Vn centered at x, and is zero otherwise. The number of samples in this hypercube is given by: 9/20/2018 236875 Visual Recognition

PARZEN WINDOWS cont. Since Rather than limiting ourselves to the hypercube window, we can use a more general class of window functions.Thus pn(x) is an average of functions of x and the samples xi. The window function is being used for interpolation. Each sample contributing to the estimate in accordance with its distance from x. pn(x) must: be nonnegative integrate to 1. 9/20/2018 236875 Visual Recognition

PARZEN WINDOWS cont.  This can be assured by requiring the window function itself be a density function, i.e., Effect of the window size hn on p(x) Define the function then, we write pn(x) as the average Since , hn affects both the amplitude and the width of 9/20/2018 236875 Visual Recognition

PARZEN WINDOWS cont. Examples of two-dimensional circularly symmetric normal Parzen windows for 3 different values of h.  If hn is very large, the amplitude of is small, and x must be far from xi before changes much from 9/20/2018 236875 Visual Recognition

PARZEN WINDOWS cont. In this case pn(x) is the superposition of n broad, slowly varying functions, and is very smooth "out-of-focus" estimate for p(x). If hn is very small, the peak value of is large, and occurs near x = xi . In this case, pn(x) is the superposition of n sharp pulses centered at the samples: an erratic, "noisy" estimate. As hn approaches zero, approaches a Dirac delta function centered at xi , and pn(x) approaches a superposition of delta functions centered at the samples. 9/20/2018 236875 Visual Recognition

PARZEN WINDOWS cont. 3 Parzen-window density estimates based on the same set of 5 samples, using windows from previous figure  The choice of hn (or Vn) has an important effect on pn(x)  If Vn is too large the estimate will suffer from too little resolution  If Vn is too small the estimate will suffer from too much statistical variability.  If there is limited number of samples, then seek some acceptable compromise. 9/20/2018 236875 Visual Recognition

PARZEN WINDOWS cont. If we have unlimited number of samples, then let Vn slowly approach zero as n increases, and have pn(x) converge to the unknown density p(x). Examples Example 1: p(x) is a zero-mean, unit variance, univariate normal density. Let the widow function be of the same form:  Let where is a parameter pn(x) is an average of normal densities centered at the samples: 9/20/2018 236875 Visual Recognition