Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 4.

Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 4

Parameter Estimations 22

Course Outline MODEL INFORMATION COMPLETEINCOMPLETE Supervised Learning Unsupervised Learning Nonparametric Approach Parametric Approach Nonparametric Approach Parametric Approach Bayes Decision Theory “Optimal” Rules Plug-in Rules Density Estimation Geometric Rules (K-NN, MLP) Mixture Resolving Cluster Analysis (Hard, Fuzzy)

We know very little The ultimate goal of all sciences is to arrive at quantitative models that describe nature with a sufficient accuracy - or to put it short: to calculate nature. These calculations have the general form answer = f (question) or output = f (input) Laws of nature is rather limited (although expanding) to answer all questions. Chemists, material’s scientists, engineers or biologists who want to ask questions like the biological effect of a new molecular entity or the properties of a new material’s composition, will require to estimate some parameters from what they know. 4

To Expand our knowledge Situation 1: The model function f is theoretically or empirically known. Then the output quantity of interest may be calculated directly. Situation 2: The structural form of the function f is known but not the values of its parameters. Then these parameter values may be statistically estimated on the basis of experimental data by curve fitting methods. Situation 3: Even the structural form of the function f is unknown. As an approximation the function f may be modeled by a machine learning technique on the basis of experimental data. 5

Situation 2 – 2D Input x produces output y linearly according to the line equation: y= f(x)=b 1 +b 2 x b 1 and b 2 parameters of this line, (y-intercept of the line and the slope) can be estimates geometrically, or by curve fitting, using least squares method. 6

7 Situation 2 – Higher Dimensions Chikazoe, J, Lee H, Kriegeskorte N, & Anderson A, “Population coding of affect across stimuli, modalities and individuals”, Nature Neuroscience 17, 111 4– 1122 (2014) Received 19 January 2014 Accepted 23 May 2014 Published online 22 June 2014

(a) ROIs were determined on the basis of anatomical gray matter masks. (b) The 128 visual scene stimuli were arranged using MDS such that pairwise distances reflected neural response-pattern similarity. Color code indicates feature magnitude scores for low-level visual features in EVC (top), animacy in VTC (middle) and subjective valence in OFC (bottom) for the same stimuli. Examples i, ii, iii, iv and v traverse the primary dimension in each feature space, with pictures illustrating visual features (for example, luminance) (top), animacy (middle) and valence (bottom).

Limitations More than 10 dimensions, difficult to visualise without transformations, Situation 3 is most difficult 9

Parametric Distributions Parametric distributions are probability distributions that you can describe using a finite set of parameters. 1. You can choose a distribution model for your data based on a parametric family of probability distributions, 2. then adjust the parameters to fit the data. 3. You can then perform further analyses by computing summary statistics, evaluating the probability density function (pdf) and cumulative distribution function (cdf), and assessing the fit of the distribution to your data. 10

Parameter Estimation When seen from a distance, without prior knowledge about how many girls and boys, we might see a : Boy or Girl We denote by θ the (unknown) probability P(B). Estimation task: Given a sequence of students, x[1], x[2],..., x[M] we want to estimate the probabilities P(B)= θ and P(G) = 1 - θ 11

Consider instances x[1], x[2],..., x[M] such that The set of values that x can take is known Each is sampled from the same distribution Each sampled independently of the rest Here we focus on multinomial distributions Only finitely many possible values for x Special case: binomial, with values B(oy) and G(irl) 12

Two major approaches Maximum-Likelihood Method Bayesian Method Use P(  i | x) for our classification rule! Results are nearly identical, but the approaches are different

Maximum-Likelihood vs. Bayesian: Maximum Likelihood Parameters are fixed but unknown! Best parameters are obtained by maximizing the probability of obtaining the samples observed Bayes Parameters are random variables having some known distribution Best parameters are obtained by estimating them given the new data, converts the prior to a posterior density. 14 11

Has good convergence properties as the sample size increases Simpler than any other alternative techniques Principles: Choose parameters that maximize the likelihood function This is one of the most commonly used estimators in statistics Intuitively appealing 15 2 Maximum-Likelihood Estimation

Major assumptions Suppose we have a set D = {x 1,...,x n } of independent and identically distributed (i.i.d.) samples drawn from the density p(x| θ ). We would like to use training samples in D to estimate the unknown parameter vector θ. Define L( θ |D) as the likelihood function of θ with respect to D as 16 1

The maximum likelihood estimate (MLE) of θ is, by definition, the value that maximizes L( θ |D) and can be computed as It is often easier to work with the logarithm of the likelihood function (log-likelihood function) that gives 17

The Binomial Likelihood Function How good is a particular θ ? It depends on how likely it is to generate the observed data L( θ :D) = P(D| θ ) = ∏P(x[m]| θ ) m The likelihood for the sequence B,G, G, B, B is L( θ :D) = θ ⋅ (1 −θ ) ⋅ (1 −θ ) ⋅ θ ⋅ θ 18

Example: MLE in Binomial Data It can be shown that the MLE for the probability of boys is given by which coincides with what one would expect Example: (N B,N G ) = (3,2) MLE estimate is 3/5 = 0.6 19 11

From Binomial to Multinomial For example, suppose X can have the values 1,2,...,K We want to learn the parameters θ 1, θ 2...., θ K Observations: N 1, N 2,..., N K - the number of times each outcome is observed Likelihood function: MLE 20 11

Maximum Likelihood Estimation If the number of parameters is p, i.e., θ = ( θ 1,..., θ p ) T, define the gradient operator Then, the MLE of θ should satisfy the necessary conditions 22

Maximum Likelihood Estimation Properties of MLEs: The MLE is the parameter point for which the observed sample is the most likely. The procedure with partial derivatives may result in several local extrema. We should check each solution individually to identify the global optimum. Boundary conditions must also be checked separately for extrema. Invariance property: if is the MLE of θ, then for any function f( θ ), the MLE of f( θ ) is f( ). 23

Bayes Parameter Estimation 24

Bayesian Estimation Assumes that the parameters  are random variables that have some known a-priori distribution p( . Estimates a distribution rather than making point estimates like ML. BE solution might not be of the parametric form assumed.

Bayesian Estimation (BE) Need to estimate p(x/ω i,D i ) for every class ω i If the samples in D j give no information about  i we need to solve c independent problems of the following form: “Given D, estimate p(x/D)”

BE Approach Estimate p(x/D) as follows: Since the distribution is known completely given θ, we have: Important equation; it links p(x/D) with p(θ/D)

BE Main Steps (1) Compute p(θ/D) : (2) Compute p(x/D) :

Interpretation of BE Solution If we are less certain about the exact value of θ, we should consider a weighted average of p(x / θ) over the possible values of θ. Bayesian estimation approach estimates a distribution for p(x/D) rather than making point estimates like ML

Relation to ML Solution Suppose p(θ/D) peaks very sharply at, and, then p(x/D) can be approximated as follows: (i.e., the best estimate is obtained by setting ) This is the ML solution (i.e., p(D/θ) peaks at too) since

Interpretation of Bayesian Estimation Given a large number of samples, p(θ/D n ) will have a very strong peak at ; in this case: There are cases where p(θ/D n ) contains more than one peaks (i.e., more than one θ explains the data); in this case, the solution p(x/θ) should be obtained by integration.

ML vs Bayesian Estimation Number of training data – The two methods are equivalent assuming infinite number of training data (and prior distributions that do not exclude the true solution). – For small training data sets, they give different results in most cases. Computational complexity – ML uses differential calculus or gradient search for maximizing the likelihood. – Bayesian estimation requires complex multidimensional integration techniques.

ML vs Bayesian Estimation (cont’d) Solution complexity – Easier to interpret ML solutions, i.e., must be of the assumed parametric form, finding an estimate of θ based on the samples in D but a different sample set would give rise to a different estimate.. – A Bayesian estimation solution might not be of the parametric form assumed, but taking into account the sampling variability. – Bayes assumes that we do not know the true value of θ, and instead of taking a single estimate, we take a weighted sum of the densities p(x|θ) weighted by the distribution p(θ|D). Prior distribution – If the prior distribution p(θ) is uniform, Bayesian estimation solutions are equivalent to ML solutions. – Otherwise, the two methods will give different solutions.

ML vs Bayesian Estimation (cont’d) General comments – There are strong theoretical and methodological arguments supporting Bayesian estimation. – In practice, ML estimation is simpler and can lead to comparable performance.

Computational Complexity O(dn) O(d 2 n) O(d 2 ) O(1) O(n) dimensionality: d # training data: n # classes: c O(d 3 )=O(d 2 n) ML estimation Learning complexity (n>d) These computations must be repeated c times!

Computational Complexity O(1) dimensionality: d # training data: n # classes: c O(d 2 ) Bayesian Estimation: higher learning complexity, same classification complexity Classification complexity These computations must be repeated c times and take max!

Comparison of MLEs and Bayes estimates  37 If there is much data (strongly peaked p(θ|D)) and the prior p(θ) is uniform, then the Bayes estimate and MLE are equivalent.

Goodness-of-fit To measure how well a fitted distribution resembles the sample data (goodness-of-fit), we can use the Kolmogorov- Smirnov test statistic. It is defined as the maximum value of the absolute difference between the cumulative distribution function estimated from the sample and the one calculated from the fitted distribution. After estimating the parameters for different distributions, we can compute the Kolmogorov-Smirnov statistic for each distribution and choose the one with the smallest value as the best fit to our sample. 38

39 Maximum Likelihood Estimation Examples Figure1: Histograms of samples and estimated densities for different distributions. Random sample from N(10,2 2 )Random sample from 0.5 N(10,0.4 2 ) + 0.5 N(11,0.5 2 ) Random sample from Gamma(4,4)Cumulative distribution functions (b) True pdf is 0.5N (10, 0.16) + 0.5N (11, 0.25). Estimated pdf is N (10.50, 0.47). (a) True pdf N (10, 4) Estimated pdf is N (9.98, 4.05) (c) True pdf is Gamma(4, 4). Estimated pdfs are N (16.1, 67.4) and Gamma(3.8, 4.2). (d) Cumulative distribution functions for the example in (c)

Main Sources of Error in Classifier Design To apply these results to multiple classes, separate the training samples to c subsets D 1,..., D c, with the samples in D i belonging to class w i, and then estimate each density p(x|w i, D i ) separately. Bayes error – The error due to overlapping densities p(x/ω i ) Model error – The error due to choosing an incorrect model. Estimation error – The error due to incorrectly estimated parameters (e.g., due to small number of training examples)

Overfitting When the number of training examples is inadequate, the solution obtained might not be optimal. Consider the problem of fitting a curve to some data: – Points were selected from a parabola (plus noise). – A 10th degree polynomial fits the data perfectly but does not generalize well. A greater error on training data might improve generalization!  Need more training examples than number or model parameters!

Overfitting (cont’d) Control model complexity. – Assume diagonal covariance matrix (i.e., uncorrelated features). – Use the same covariance matrix for all classes and consolidate data. – Use the shrinkage technique: Shrink common covariance matrix to identity matrix: Shrink individual covariance matrices to same covariance:

Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 4.

Similar presentations

Presentation on theme: "Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 4."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 4.

Similar presentations

Presentation on theme: "Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 4."— Presentation transcript:

Similar presentations

About project

Feedback