Outline Maximum Likelihood Maximum A-Posteriori (MAP) Estimation

Outline Maximum Likelihood Maximum A-Posteriori (MAP) Estimation
Bayesian Parameter Estimation Example:The Gaussian Case Recursive Bayesian Incremental Learning Problems of Dimensionality Nonparametric Techniques Density Estimation Histogram Approach Parzen-window method 9/20/2018 Visual Recognition

Normal Density - Multivariate Case
The general multivariate normal density (MND) in a d dimensions is written as It can be shown that: which means for components 9/20/2018 Visual Recognition

Maximum Likelihood and Bayesian Parameter Estimation
To design an optimal classifier we need P(wi) and p(x| wi), but usually we do not know them. Solution – to use training data to estimate the unknown probabilities. Estimation of class-conditional densities is a difficult task. 9/20/2018 Visual Recognition

Supervised learning: we get to see samples from each of the classes “separately” (called tagged or labeled samples). Tagged samples are “expensive”. We need to learn the distributions as efficiently as possible. Two methods: parametric (easier) and non-parametric (harder) 9/20/2018 Visual Recognition

Learning From Observed Data
Hidden Observed Unsupervised Supervised 9/20/2018 Visual Recognition

Program for parametric methods: Assume specific parametric distributions with parameters Estimate parameters from training data Replace true value of class-conditional density with approximation and apply the Bayesian framework for decision making. 9/20/2018 Visual Recognition

Suppose we can assume that the relevant (class-conditional) densities are of some parametric form. That is, p(x|w)=p(x|q), where Examples of parameterized densities: Binomial: x(n) has m 1’s and n-m 0’s Exponential: Each data point x is distributed according to 9/20/2018 Visual Recognition

Maximum Likelihood and Bayesian Parameter Estimation cont.
Two procedures for parameter estimation will be considered: Maximum likelihood estimation: choose parameter value that makes the data most probable (i.e., maximizes the probability of obtaining the sample that has actually been observed), Bayesian learning: define a prior probability on the model space and compute the posterior Additional samples sharp the posterior density which peaks near the true values of the parameters . 9/20/2018 Visual Recognition

It is assumed that a sample set with
Sampling Model It is assumed that a sample set with independently generated samples is available. The sample set is partitioned into separate sample sets for each class, A generic sample set will simply be denoted by Each class-conditional is assumed to have a known parametric form and is uniquely specified by a parameter (vector) Samples in each set are assumed to be independent and identically distributed (i.i.d.) according to some true probability law 9/20/2018 Visual Recognition

Log-Likelihood function and Score Function
The sample sets are assumed to be functionally independent, i.e., the training set contains no information about for The i.i.d. assumption implies that Let be a generic sample of size Log-likelihood function: The log-likelihood function is identical to the logarithm of the probability density function, but is interpreted as a function over the sample space for given parameter 9/20/2018 Visual Recognition

Log-Likelihood Illustration
Assume that all the points in are drawn from some (one-dimensional) normal distribution with some (known) variance and unknown mean. 9/20/2018 Visual Recognition

Log-Likelihood function and Score Function cont.
Maximum likelihood estimator (MLE): (tacitly assuming that such a maximum exists!) Score function: and hence Necessary condition for MLE (if not on border of domain ) : 9/20/2018 Visual Recognition

Maximum A Posteriory Maximum a posteriory (MAP):
Find the value of q that maximizes l(q)+ln(p(q)), where p(q),is a prior probability of different parameter values.A MAP estimator finds the peak or mode of a posterior. Drawback of MAP: after arbitrary nonlinear transformation of the parameter space, the density will change, and the MAP solution will no longer be correct. 9/20/2018 Visual Recognition

Maximum A-Posteriori (MAP) Estimation
The “most likely value” is given by q 9/20/2018 Visual Recognition

Maximum A-Posteriori (MAP) Estimation
since the data is i.i.d. We can disregard the normalizing factor when looking for the maximum 9/20/2018 Visual Recognition

MAP - continued So, the we are looking for is 9/20/2018
Visual Recognition

The Gaussian Case: Unknown Mean
Suppose that the samples are drawn from a multivariate normal population with mean , and covariance matrix . Consider fist the case where only the mean is unknown For a sample point xk , we have and The maximum likelihood estimate for must satisfy 9/20/2018 Visual Recognition

The Gaussian Case: Unknown Mean
Multiplying by , and rearranging, we obtain The MLE estimate for the unknown population mean is just the arithmetic average of the training samples (sample mean). Geometrically, if we think of the n samples as a cloud of points, the sample mean is the centroid of the cloud 9/20/2018 Visual Recognition

The Gaussian Case: Unknown Mean and Covariance
In the general multivariate normal case, neither the mean nor the covariance matrix is known Consider fist the univariate case with and . The log-likelihood of a single point is and its derivative is 9/20/2018 Visual Recognition

The Gaussian Case: Unknown Mean and Covariance
Setting the gradient to zero, and using all the sample points, we get the following necessary conditions: where are the MLE estimates for , and respectively. Solving for , we obtain 9/20/2018 Visual Recognition

The Gaussian multivariate case
For the multivariate case, it is easy to show that the MLE estimates for are given by The MLE for the mean vector is the sample mean, and the MLE estimate for the covariance matrix is the arithmetic average of the n matrices The MLE for is biased (i.e., the expected value over all data sets of size n of the sample variance is not equal to the true variance: 9/20/2018 Visual Recognition

The Gaussian multivariate case
Unbiased estimator for and are given by and C is called the sample covariance matrix . C is absolutely unbiased is asymptotically unbiased. 9/20/2018 Visual Recognition

Bayesian Estimation: Class-Conditional Densities
The aim is to find posteriors P(wi|x) knowing p(x|wi) and P(wi), but they are unknown. How to find them? Given the sample D, we say that the aim is to find P(wi|x, D) Bayes formula gives: We use the information provided by training samples to determine the class conditional densities and the prior probabilities. Generally used assumptions: Priors generally are known or obtainable from a trivial calculations. Thus P(wi)= P(wi|D). The training set can be separated into c subsets: D1,…,Dc P(w,x,D)=p(x|w,D)P(w,D)=p(w|x,D)P(x,D); P(w,D)=p(w|D)P(D);P(x,D)=p(x)P(D) 9/20/2018 Visual Recognition

Bayesian Estimation: Class-Conditional Densities
The samples Dj have no influence on p(x|wi,Di ) if Thus we can write: We have c separate problems of the form: Use a set D of samples drawn independently according to a fixed but unknown probability distribution p(x) to determine p(x|D). P(w,x,D)=p(x|w,D)P(w,D)=p(w|x,D)P(x,D); P(w,D)=p(w|D)P(D);P(x,D)=p(x)P(D) 9/20/2018 Visual Recognition

Bayesian Estimation: General Theory
Bayesian leaning considers (the parameter vector to be estimated) to be a random variable. Before we observe the data, the parameters are described by a prior p(q ) which is typically very broad. Once we observed the data, we can make use of Bayes’ formula to find posterior p(q |D ). Since some values of the parameters are more consistent with the data than others, the posterior is narrower than prior. This is Bayesian learning (see fig.) 9/20/2018 Visual Recognition

General Theory cont. Density function for x, given the training data set , From the definition of conditional probability densities The first factor is independent of since it just our assumed form for parameterized density. Therefore Instead of choosing a specific value for , the Bayesian approach performs a weighted average over all values of The weighting factor , which is a posterior of is determined by starting from some assumed prior 9/20/2018 Visual Recognition

General Theory cont. Then update it using Bayes’ formula to take account of data set Since are drawn independently which is likelihood function. Posterior for is where normalization factor 9/20/2018 Visual Recognition

Bayesian Learning – Univariate Normal Distribution
Let us use the Bayesian estimation technique to calculate a posteriori density and the desired probability density for the case Univariate Case: Let m be the only unknown parameter 9/20/2018 Visual Recognition

Prior probability: normal distribution over , encodes some prior knowledge about the true mean , while measures our prior uncertainty. If m is drawn from p(m) then density for x is completely determined. Letting we use 9/20/2018 Visual Recognition

Computing the posterior distribution 9/20/2018 Visual Recognition

Where factors that do not depend on have been absorbed into the constants and is an exponential function of a quadratic function of i.e. it is a normal density. remains normal for any number of training samples. If we write then identifying the coefficients, we get 9/20/2018 Visual Recognition

where is the sample mean. Solving explicitly for and we obtain and represents our best guess for after observing n samples. measures our uncertainty about this guess. decreases monotonically with n (approaching as n approaches infinity) 9/20/2018 Visual Recognition

Each additional observation decreases our uncertainty about the true value of As n increases, becomes more and more sharply peaked, approaching a Dirac delta function as n approaches infinity. This behavior is known as Bayesian Learning. 9/20/2018 Visual Recognition

In general, is a linear combination of and , with coefficients that are non-negative and sum to 1. Thus lies somewhere between and If , as If , our a priori certainty that is so strong that no number of observations can change our opinion. If , a priori guess is very uncertain, and we take The ratio is called dogmatism. 9/20/2018 Visual Recognition

The Univariate Case: where 9/20/2018 Visual Recognition

Since we can write To obtain the class conditional probability , whose parametric form is known to be we replace by and by The conditional mean is treated as if it were the true mean, and the known variance is increased to account for the additional uncertainty in x resulting from our lack of exact knowledge of the mean 9/20/2018 Visual Recognition

Example (demo-MAP) We have N points which are generated by one dimensional Gaussian, Since we think that the mean should not be very big we use as a prior where is a hyperparameter. The total objective function is: which is maximized to give, For influence of prior is negligible and result is ML estimate. But for very strong belief in the prior the estimate tends to zero. Thus, if few data are available, the prior will bias the estimate towards the prior expected value 9/20/2018 Visual Recognition

Recursive Bayesian Incremental Learning
We have seen that Let us define Then Substituting into and using Bayes we have: Finally 9/20/2018 Visual Recognition

Recursive Bayesian Incremental Learning
While repeated use of this eq. produces a sequence This is called the recursive Bayes approach to the parameter estimation. (Also incremental or on-line learning). When this sequence of densities converges to a Dirac delta function centered about the true parameter value, we have Bayesian learning. 9/20/2018 Visual Recognition

Maximal Likelihood vs. Bayesian
ML and Bayesian estimations are asymptotically equivalent and “consistent”. They yield the same class-conditional densities when the size of the training data grows to infinity. ML is typically computationally easier: in ML we need to do (multidimensional) differentiation and in Bayesian (multidimensional) integration. ML is often easier to interpret: it returns the single best model (parameter) whereas Bayesian gives a weighted average of models. But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information). Bayesian with “flat” prior is essentially ML; with asymmetric and broad priors the methods lead to different solutions. 9/20/2018 Visual Recognition

Problems of Dimensionality:Accuracy, Dimension, and Training Sample Size
Consider two-class multivariate normal distributions with the same covariance. If priors are equal then Bayesian error rate is given by where is the squared Mahalanobis distance: Thus the probability of error decreases as r increases. In the conditionally independent case and 9/20/2018 Visual Recognition

Problems of Dimensionality
While classification accuracy can become better with growing of dimensionality (and an amount of training data), beyond a certain point, the inclusion of additional features leads to worse rather then better performance computational complexity grows the problem of overfitting arises 9/20/2018 Visual Recognition

Occam's Razor "Pluralitas non est ponenda sine neccesitate" or "plurality should not be posited without necessity." The words are those of the medieval English philosopher and Franciscan monk William of Occam (ca ). Decisions based on overly complex models often lead to lower accuracy of the classifier. 9/20/2018 Visual Recognition

Outline Nonparametric Techniques Density Estimation Histogram Approach
Parzen-window method Kn-Nearest-Neighbor Estimation Component Analysis and Discriminants Principal Components Analysis Fisher Linear Discriminant MDA 9/20/2018 Visual Recognition

NONPARAMETRIC TECHNIQUES
So far, we treated supervised learning under the assumption that the forms of the underlying density functions were known. The common parametric forms rarely fit the densities actually encountered in practice. Classical parametric densities are unimodal, whereas many practical problems involve multimodal densities. We examine nonparametric procedures that can be used with arbitrary distribution and without the assumption that the forms of the underlying densities are known. 9/20/2018 Visual Recognition

NONPARAMETRIC TECHNIQUES
There are several types of nonparametric methods: Procedures for estimating the density functions from sample patterns. If these estimates are satisfactory, they can be substituted for the true densities when designing the classifier. Procedures for directly estimating the a posteriori probabilities Nearest neighbor rule which bypass probability estimation, and go directly to decision functions. 9/20/2018 Visual Recognition

Histogram Approach The conceptually simplest method of estimating a p.d.f. is histogram. The range of each component xs of vector x is divided into a fixed number m of equal intervals. The resulting boxes (bins) of identical volume V are then expected and the number of points falling into each bin is counted. Suppose that we have ni samples xj , j=1,…, ni from class wi Suppose that the number of vector points in the j-th bin, bj , be kj . The histogram estimate , of density function 9/20/2018 Visual Recognition

Histogram Approach is defined as is constant over every bin bj .
Let us verify that is a density function: We can choose a number m of bins and their starting points. Fixation of starting points is not critical, but m is important. It place a role of smoothing parameter. Too big m makes histogram spiky, for too little m we loose a true form of the density function 9/20/2018 Visual Recognition

The Histogram Method:Example
Assume (one dimensional) data Some points were sampled from a combination of two Gaussians: 3 bins 9/20/2018 Visual Recognition

The Histogram Method:Example
7 bins 11 bins 9/20/2018 Visual Recognition

Histogram Approach The histogram p.d.f. estimator is very effective. We can do it online: all we should do is to update the counters kj during the run time, so we do not need to keep all the data which could be huge. But its usefulness is limited only to low dimensional vectors x, because the number of bins, Nb , grows exponentially with dimensionality d : This is the so called “curse of dimensionality” 9/20/2018 Visual Recognition

DENSITY ESTIMATION To estimate the density at x, we form a sequence of regions R1, R2, .. The probability for x to fall into R is Suppose we have n i. i.d. samples x1 , … , xn drawn according to p(x) . The probability that k of them fall in R is and the expected value for k is and variance . The relative part of samples which fall into R (k/n) is also a random variable for which When n is growing up the variance is making smaller and is becoming to be better estimator for P. 9/20/2018 Visual Recognition

DENSITY ESTIMATION Pk sharply peaks about the mean, so the
k/n is a good estimate of P. For small enough R where x is within R and V is a volume enclosed by R. Thus 9/20/2018 Visual Recognition

Three Conditions for DENSITY ESTIMATION
Let us take a growing sequence of samples n=1,2,3... We take regions Rn with reduced volumes V1 > V2 > V3 >... Let kn be the number of samples falling in Rn Let pn(x) be the nth estimate for p(x) If pn(x) is to converge to p(x) , 3 conditions must be required: resolution as big as possible (to reduce smoothing) otherwise in the range Rn there will not be infinite number of points and k/n will not converge to P and we’ll get p(x)=0 to guarantee convergence of (*). 9/20/2018 Visual Recognition

Parzen Window and KNN How to obtain the sequence R1 , R2 , ..?
There are 2 common approaches of obtaining sequences of regions that satisfy the above conditions: Shrink an initial region by specifying the volume Vn as some function of n , such as and show that kn and kn/n behave properly i.e. pn(x) converges to p(x). This is Parzen-window (or kernel ) method . Specify kn as some function of n, such as Here the volume Vn is grown until it encloses kn neighbors of x . This is kn-nearest-neighbor method . Both of these methods do converge, although it is difficult to make meaningful statements about their finite-sample behavior. 9/20/2018 Visual Recognition

PARZEN WINDOWS Assume that the region Rn is a d-dimensional hypercube.
If hn is the length of an edge of that hypercube, then its volume is given by Define the following window function: defines a unit hypercube centered at the origin. , if xi falls within the hypercube of volume Vn centered at x, and is zero otherwise. The number of samples in this hypercube is given by: 9/20/2018 Visual Recognition

PARZEN WINDOWS cont. Since
Rather than limiting ourselves to the hypercube window, we can use a more general class of window functions.Thus pn(x) is an average of functions of x and the samples xi. The window function is being used for interpolation. Each sample contributing to the estimate in accordance with its distance from x. pn(x) must: be nonnegative integrate to 1. 9/20/2018 Visual Recognition

PARZEN WINDOWS cont. This can be assured by requiring the window function itself be a density function, i.e., Effect of the window size hn on p(x) Define the function then, we write pn(x) as the average Since , hn affects both the amplitude and the width of 9/20/2018 Visual Recognition

PARZEN WINDOWS cont. Examples of two-dimensional circularly symmetric normal Parzen windows for 3 different values of h. If hn is very large, the amplitude of is small, and x must be far from xi before changes much from 9/20/2018 Visual Recognition

PARZEN WINDOWS cont. In this case pn(x) is the superposition of n broad, slowly varying functions, and is very smooth "out-of-focus" estimate for p(x). If hn is very small, the peak value of is large, and occurs near x = xi . In this case, pn(x) is the superposition of n sharp pulses centered at the samples: an erratic, "noisy" estimate. As hn approaches zero, approaches a Dirac delta function centered at xi , and pn(x) approaches a superposition of delta functions centered at the samples. 9/20/2018 Visual Recognition

PARZEN WINDOWS cont. 3 Parzen-window density estimates based on the same set of 5 samples, using windows from previous figure The choice of hn (or Vn) has an important effect on pn(x) If Vn is too large the estimate will suffer from too little resolution If Vn is too small the estimate will suffer from too much statistical variability. If there is limited number of samples, then seek some acceptable compromise. 9/20/2018 Visual Recognition

PARZEN WINDOWS cont. If we have unlimited number of samples, then let Vn slowly approach zero as n increases, and have pn(x) converge to the unknown density p(x). Examples Example 1: p(x) is a zero-mean, unit variance, univariate normal density. Let the widow function be of the same form: Let where is a parameter pn(x) is an average of normal densities centered at the samples: 9/20/2018 Visual Recognition

Outline Maximum Likelihood Maximum A-Posteriori (MAP) Estimation

Similar presentations

Presentation on theme: "Outline Maximum Likelihood Maximum A-Posteriori (MAP) Estimation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Outline Maximum Likelihood Maximum A-Posteriori (MAP) Estimation

Similar presentations

Presentation on theme: "Outline Maximum Likelihood Maximum A-Posteriori (MAP) Estimation"— Presentation transcript:

Similar presentations

About project

Feedback