Presentation is loading. Please wait.

Presentation is loading. Please wait.

Outline Maximum Likelihood Maximum A-Posteriori (MAP) Estimation

Similar presentations


Presentation on theme: "Outline Maximum Likelihood Maximum A-Posteriori (MAP) Estimation"— Presentation transcript:

1 Outline Maximum Likelihood Maximum A-Posteriori (MAP) Estimation
Bayesian Parameter Estimation Example:The Gaussian Case Recursive Bayesian Incremental Learning Problems of Dimensionality Nonparametric Techniques Density Estimation Histogram Approach Parzen-window method 9/20/2018 Visual Recognition

2 Bayes' Decision Rule (Minimizes the probability of error)
choose w1 : if P(w1|x) > P(w2|x) choose w2 : otherwise or w1 : if P ( x |w1) P(w1) > P(x|w2) P(w2) w2 : otherwise and P(Error|x) = min [P(w1|x) , P(w2|x)] 9/20/2018 Visual Recognition

3 Normal Density - Multivariate Case
The general multivariate normal density (MND) in a d dimensions is written as It can be shown that: which means for components 9/20/2018 Visual Recognition

4 Maximum Likelihood and Bayesian Parameter Estimation
To design an optimal classifier we need P(wi) and p(x| wi), but usually we do not know them. Solution – to use training data to estimate the unknown probabilities. Estimation of class-conditional densities is a difficult task. 9/20/2018 Visual Recognition

5 Maximum Likelihood and Bayesian Parameter Estimation
Supervised learning: we get to see samples from each of the classes “separately” (called tagged or labeled samples). Tagged samples are “expensive”. We need to learn the distributions as efficiently as possible. Two methods: parametric (easier) and non-parametric (harder) 9/20/2018 Visual Recognition

6 Learning From Observed Data
Hidden Observed Unsupervised Supervised 9/20/2018 Visual Recognition

7 Maximum Likelihood and Bayesian Parameter Estimation
Program for parametric methods: Assume specific parametric distributions with parameters Estimate parameters from training data Replace true value of class-conditional density with approximation and apply the Bayesian framework for decision making.  9/20/2018 Visual Recognition

8 Maximum Likelihood and Bayesian Parameter Estimation
Suppose we can assume that the relevant (class-conditional) densities are of some parametric form. That is, p(x|w)=p(x|q), where Examples of parameterized densities: Binomial: x(n) has m 1’s and n-m 0’s Exponential: Each data point x is distributed according to 9/20/2018 Visual Recognition

9 Maximum Likelihood and Bayesian Parameter Estimation cont.
Two procedures for parameter estimation will be considered: Maximum likelihood estimation: choose parameter value that makes the data most probable (i.e., maximizes the probability of obtaining the sample that has actually been observed), Bayesian learning: define a prior probability on the model space and compute the posterior Additional samples sharp the posterior density which peaks near the true values of the parameters . 9/20/2018 Visual Recognition

10 It is assumed that a sample set with
Sampling Model It is assumed that a sample set with independently generated samples is available. The sample set is partitioned into separate sample sets for each class, A generic sample set will simply be denoted by Each class-conditional is assumed to have a known parametric form and is uniquely specified by a parameter (vector) Samples in each set are assumed to be independent and identically distributed (i.i.d.) according to some true probability law 9/20/2018 Visual Recognition

11 Log-Likelihood function and Score Function
The sample sets are assumed to be functionally independent, i.e., the training set contains no information about for The i.i.d. assumption implies that Let be a generic sample of size Log-likelihood function: The log-likelihood function is identical to the logarithm of the probability density function, but is interpreted as a function over the sample space for given parameter 9/20/2018 Visual Recognition

12 Log-Likelihood Illustration
Assume that all the points in are drawn from some (one-dimensional) normal distribution with some (known) variance and unknown mean. 9/20/2018 Visual Recognition

13 Log-Likelihood function and Score Function cont.
Maximum likelihood estimator (MLE): (tacitly assuming that such a maximum exists!) Score function: and hence Necessary condition for MLE (if not on border of domain ) : 9/20/2018 Visual Recognition

14 Maximum A Posteriory Maximum a posteriory (MAP):
Find the value of q that maximizes l(q)+ln(p(q)), where p(q),is a prior probability of different parameter values.A MAP estimator finds the peak or mode of a posterior. Drawback of MAP: after arbitrary nonlinear transformation of the parameter space, the density will change, and the MAP solution will no longer be correct. 9/20/2018 Visual Recognition

15 Maximum A-Posteriori (MAP) Estimation
The “most likely value” is given by q 9/20/2018 Visual Recognition

16 Maximum A-Posteriori (MAP) Estimation
since the data is i.i.d. We can disregard the normalizing factor when looking for the maximum 9/20/2018 Visual Recognition

17 MAP - continued So, the we are looking for is 9/20/2018
Visual Recognition

18 The Gaussian Case: Unknown Mean
Suppose that the samples are drawn from a multivariate normal population with mean , and covariance matrix .  Consider fist the case where only the mean is unknown  For a sample point xk , we have and The maximum likelihood estimate for must satisfy 9/20/2018 Visual Recognition

19 The Gaussian Case: Unknown Mean
Multiplying by , and rearranging, we obtain The MLE estimate for the unknown population mean is just the arithmetic average of the training samples (sample mean). Geometrically, if we think of the n samples as a cloud of points, the sample mean is the centroid of the cloud  9/20/2018 Visual Recognition

20 The Gaussian Case: Unknown Mean and Covariance
In the general multivariate normal case, neither the mean nor the covariance matrix is known Consider fist the univariate case with and .  The log-likelihood of a single point is and its derivative is 9/20/2018 Visual Recognition

21 The Gaussian Case: Unknown Mean and Covariance
Setting the gradient to zero, and using all the sample points, we get the following necessary conditions: where are the MLE estimates for , and respectively. Solving for , we obtain 9/20/2018 Visual Recognition

22 The Gaussian multivariate case
For the multivariate case, it is easy to show that the MLE estimates for are given by The MLE for the mean vector is the sample mean, and the MLE estimate for the covariance matrix is the arithmetic average of the n matrices The MLE for is biased (i.e., the expected value over all data sets of size n of the sample variance is not equal to the true variance: 9/20/2018 Visual Recognition

23 The Gaussian multivariate case
Unbiased estimator for and are given by and C is called the sample covariance matrix . C is absolutely unbiased is asymptotically unbiased. 9/20/2018 Visual Recognition

24 Bayesian Estimation: Class-Conditional Densities
The aim is to find posteriors P(wi|x) knowing p(x|wi) and P(wi), but they are unknown. How to find them? Given the sample D, we say that the aim is to find P(wi|x, D) Bayes formula gives: We use the information provided by training samples to determine the class conditional densities and the prior probabilities. Generally used assumptions: Priors generally are known or obtainable from a trivial calculations. Thus P(wi)= P(wi|D). The training set can be separated into c subsets: D1,…,Dc P(w,x,D)=p(x|w,D)P(w,D)=p(w|x,D)P(x,D); P(w,D)=p(w|D)P(D);P(x,D)=p(x)P(D) 9/20/2018 Visual Recognition

25 Bayesian Estimation: Class-Conditional Densities
The samples Dj have no influence on p(x|wi,Di ) if Thus we can write: We have c separate problems of the form: Use a set D of samples drawn independently according to a fixed but unknown probability distribution p(x) to determine p(x|D). P(w,x,D)=p(x|w,D)P(w,D)=p(w|x,D)P(x,D); P(w,D)=p(w|D)P(D);P(x,D)=p(x)P(D) 9/20/2018 Visual Recognition

26 Bayesian Estimation: General Theory
Bayesian leaning considers (the parameter vector to be estimated) to be a random variable. Before we observe the data, the parameters are described by a prior p(q ) which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior p(q |D ). Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning (see fig.) 9/20/2018 Visual Recognition

27 General Theory cont. Density function for x, given the training data set , From the definition of conditional probability densities The first factor is independent of since it just our assumed form for parameterized density. Therefore Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of The weighting factor , which is a posterior of is determined by starting from some assumed prior 9/20/2018 Visual Recognition

28 General Theory cont. Then update it using Bayes’ formula to take account of data set Since are drawn independently which is likelihood function. Posterior for is where normalization factor 9/20/2018 Visual Recognition

29 Bayesian Learning – Univariate Normal Distribution
Let us use the Bayesian estimation technique to calculate a posteriori density and the desired probability density for the case Univariate Case: Let m be the only unknown parameter 9/20/2018 Visual Recognition

30 Bayesian Learning – Univariate Normal Distribution
Prior probability: normal distribution over , encodes some prior knowledge about the true mean , while measures our prior uncertainty. If m is drawn from p(m) then density for x is completely determined. Letting we use 9/20/2018 Visual Recognition

31 Bayesian Learning – Univariate Normal Distribution
Computing the posterior distribution 9/20/2018 Visual Recognition

32 Bayesian Learning – Univariate Normal Distribution
Where factors that do not depend on have been absorbed into the constants and is an exponential function of a quadratic function of i.e. it is a normal density. remains normal for any number of training samples. If we write then identifying the coefficients, we get 9/20/2018 Visual Recognition

33 Bayesian Learning – Univariate Normal Distribution
where is the sample mean. Solving explicitly for and we obtain and represents our best guess for after observing n samples. measures our uncertainty about this guess. decreases monotonically with n (approaching as n approaches infinity) 9/20/2018 Visual Recognition

34 Bayesian Learning – Univariate Normal Distribution
Each additional observation decreases our uncertainty about the true value of As n increases, becomes more and more sharply peaked, approaching a Dirac delta function as n approaches infinity. This behavior is known as Bayesian Learning. 9/20/2018 Visual Recognition

35 Bayesian Learning – Univariate Normal Distribution
In general, is a linear combination of and , with coefficients that are non-negative and sum to 1. Thus lies somewhere between and If , as If , our a priori certainty that is so strong that no number of observations can change our opinion. If , a priori guess is very uncertain, and we take The ratio is called dogmatism. 9/20/2018 Visual Recognition

36 Bayesian Learning – Univariate Normal Distribution
The Univariate Case: where 9/20/2018 Visual Recognition

37 Bayesian Learning – Univariate Normal Distribution
Since we can write To obtain the class conditional probability , whose parametric form is known to be we replace by and by The conditional mean is treated as if it were the true mean, and the known variance is increased to account for the additional uncertainty in x resulting from our lack of exact knowledge of the mean 9/20/2018 Visual Recognition

38 Example (demo-MAP) We have N points which are generated by one dimensional Gaussian, Since we think that the mean should not be very big we use as a prior where is a hyperparameter. The total objective function is: which is maximized to give, For influence of prior is negligible and result is ML estimate. But for very strong belief in the prior the estimate tends to zero. Thus, if few data are available, the prior will bias the estimate towards the prior expected value 9/20/2018 Visual Recognition

39 Recursive Bayesian Incremental Learning
We have seen that Let us define Then Substituting into and using Bayes we have: Finally 9/20/2018 Visual Recognition

40 Recursive Bayesian Incremental Learning
While repeated use of this eq. produces a sequence This is called the recursive Bayes approach to the parameter estimation. (Also incremental or on-line learning). When this sequence of densities converges to a Dirac delta function centered about the true parameter value, we have Bayesian learning. 9/20/2018 Visual Recognition

41 Maximal Likelihood vs. Bayesian
ML and Bayesian estimations are asymptotically equivalent and “consistent”. They yield the same class-conditional densities when the size of the training data grows to infinity. ML is typically computationally easier: in ML we need to do (multidimensional) differentiation and in Bayesian (multidimensional) integration. ML is often easier to interpret: it returns the single best model (parameter) whereas Bayesian gives a weighted average of models. But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information). Bayesian with “flat” prior is essentially ML; with asymmetric and broad priors the methods lead to different solutions. 9/20/2018 Visual Recognition

42 Problems of Dimensionality:Accuracy, Dimension, and Training Sample Size
Consider two-class multivariate normal distributions with the same covariance. If priors are equal then Bayesian error rate is given by where is the squared Mahalanobis distance: Thus the probability of error decreases as r increases. In the conditionally independent case and 9/20/2018 Visual Recognition

43 Problems of Dimensionality
While classification accuracy can become better with growing of dimensionality (and an amount of training data), beyond a certain point, the inclusion of additional features leads to worse rather then better performance computational complexity grows the problem of overfitting arises 9/20/2018 Visual Recognition

44 Occam's Razor "Pluralitas non est ponenda sine neccesitate" or "plurality should not be posited without necessity." The words are those of the medieval English philosopher and Franciscan monk William of Occam (ca ). Decisions based on overly complex models often lead to lower accuracy of the classifier. 9/20/2018 Visual Recognition

45 Outline Nonparametric Techniques Density Estimation Histogram Approach
Parzen-window method Kn-Nearest-Neighbor Estimation Component Analysis and Discriminants Principal Components Analysis Fisher Linear Discriminant MDA 9/20/2018 Visual Recognition

46 NONPARAMETRIC TECHNIQUES
So far, we treated supervised learning under the assumption that the forms of the underlying density functions were known. The common parametric forms rarely fit the densities actually encountered in practice. Classical parametric densities are unimodal, whereas many practical problems involve multimodal densities. We examine nonparametric procedures that can be used with arbitrary distribution and without the assumption that the forms of the underlying densities are known. 9/20/2018 Visual Recognition

47 NONPARAMETRIC TECHNIQUES
There are several types of nonparametric methods: Procedures for estimating the density functions from sample patterns. If these estimates are satisfactory, they can be substituted for the true densities when designing the classifier. Procedures for directly estimating the a posteriori probabilities Nearest neighbor rule which bypass probability estimation, and go directly to decision functions. 9/20/2018 Visual Recognition

48 Histogram Approach The conceptually simplest method of estimating a p.d.f. is histogram. The range of each component xs of vector x is divided into a fixed number m of equal intervals. The resulting boxes (bins) of identical volume V are then expected and the number of points falling into each bin is counted. Suppose that we have ni samples xj , j=1,…, ni from class wi Suppose that the number of vector points in the j-th bin, bj , be kj . The histogram estimate , of density function 9/20/2018 Visual Recognition

49 Histogram Approach is defined as is constant over every bin bj .
Let us verify that is a density function: We can choose a number m of bins and their starting points. Fixation of starting points is not critical, but m is important. It place a role of smoothing parameter. Too big m makes histogram spiky, for too little m we loose a true form of the density function 9/20/2018 Visual Recognition

50 The Histogram Method:Example
Assume (one dimensional) data Some points were sampled from a combination of two Gaussians: 3 bins 9/20/2018 Visual Recognition

51 The Histogram Method:Example
7 bins 11 bins 9/20/2018 Visual Recognition

52 Histogram Approach The histogram p.d.f. estimator is very effective. We can do it online: all we should do is to update the counters kj during the run time, so we do not need to keep all the data which could be huge. But its usefulness is limited only to low dimensional vectors x, because the number of bins, Nb , grows exponentially with dimensionality d : This is the so called “curse of dimensionality” 9/20/2018 Visual Recognition

53 DENSITY ESTIMATION To estimate the density at x, we form a sequence of regions R1, R2, .. The probability for x to fall into R is Suppose we have n i. i.d. samples x1 , … , xn drawn according to p(x) . The probability that k of them fall in R is and the expected value for k is and variance . The relative part of samples which fall into R (k/n) is also a random variable for which When n is growing up the variance is making smaller and is becoming to be better estimator for P. 9/20/2018 Visual Recognition

54 DENSITY ESTIMATION Pk sharply peaks about the mean, so the
k/n is a good estimate of P. For small enough R where x is within R and V is a volume enclosed by R. Thus 9/20/2018 Visual Recognition

55 Three Conditions for DENSITY ESTIMATION
Let us take a growing sequence of samples n=1,2,3... We take regions Rn with reduced volumes V1 > V2 > V3 >... Let kn be the number of samples falling in Rn Let pn(x) be the nth estimate for p(x) If pn(x) is to converge to p(x) , 3 conditions must be required: resolution as big as possible (to reduce smoothing) otherwise in the range Rn there will not be infinite number of points and k/n will not converge to P and we’ll get p(x)=0 to guarantee convergence of (*). 9/20/2018 Visual Recognition

56 Parzen Window and KNN How to obtain the sequence R1 , R2 , ..?
There are 2 common approaches of obtaining sequences of regions that satisfy the above conditions: Shrink an initial region by specifying the volume Vn as some function of n , such as and show that kn and kn/n behave properly i.e. pn(x) converges to p(x). This is Parzen-window (or kernel ) method . Specify kn as some function of n, such as Here the volume Vn is grown until it encloses kn neighbors of x .   This is kn-nearest-neighbor method . Both of these methods do converge, although it is difficult to make meaningful statements about their finite-sample behavior. 9/20/2018 Visual Recognition

57 PARZEN WINDOWS Assume that the region Rn is a d-dimensional hypercube.
If hn is the length of an edge of that hypercube, then its volume is given by Define the following window function: defines a unit hypercube centered at the origin.   , if xi falls within the hypercube of volume Vn centered at x, and is zero otherwise. The number of samples in this hypercube is given by: 9/20/2018 Visual Recognition

58 PARZEN WINDOWS cont. Since
Rather than limiting ourselves to the hypercube window, we can use a more general class of window functions.Thus pn(x) is an average of functions of x and the samples xi. The window function is being used for interpolation. Each sample contributing to the estimate in accordance with its distance from x. pn(x) must: be nonnegative integrate to 1. 9/20/2018 Visual Recognition

59 PARZEN WINDOWS cont.  This can be assured by requiring the window function itself be a density function, i.e., Effect of the window size hn on p(x) Define the function then, we write pn(x) as the average Since , hn affects both the amplitude and the width of 9/20/2018 Visual Recognition

60 PARZEN WINDOWS cont. Examples of two-dimensional circularly symmetric normal Parzen windows for 3 different values of h.  If hn is very large, the amplitude of is small, and x must be far from xi before changes much from 9/20/2018 Visual Recognition

61 PARZEN WINDOWS cont. In this case pn(x) is the superposition of n broad, slowly varying functions, and is very smooth "out-of-focus" estimate for p(x). If hn is very small, the peak value of is large, and occurs near x = xi . In this case, pn(x) is the superposition of n sharp pulses centered at the samples: an erratic, "noisy" estimate. As hn approaches zero, approaches a Dirac delta function centered at xi , and pn(x) approaches a superposition of delta functions centered at the samples. 9/20/2018 Visual Recognition

62 PARZEN WINDOWS cont. 3 Parzen-window density estimates based on the same set of 5 samples, using windows from previous figure  The choice of hn (or Vn) has an important effect on pn(x)  If Vn is too large the estimate will suffer from too little resolution  If Vn is too small the estimate will suffer from too much statistical variability.  If there is limited number of samples, then seek some acceptable compromise. 9/20/2018 Visual Recognition

63 PARZEN WINDOWS cont. If we have unlimited number of samples, then let Vn slowly approach zero as n increases, and have pn(x) converge to the unknown density p(x). Examples Example 1: p(x) is a zero-mean, unit variance, univariate normal density. Let the widow function be of the same form:  Let where is a parameter pn(x) is an average of normal densities centered at the samples: 9/20/2018 Visual Recognition


Download ppt "Outline Maximum Likelihood Maximum A-Posteriori (MAP) Estimation"

Similar presentations


Ads by Google