Lecture 3: Inferences using Least-Squares. Abstraction Vector of N random variables, x with joint probability density p(x) expectation x and covariance.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

The Maximum Likelihood Method
Environmental Data Analysis with MatLab
Environmental Data Analysis with MatLab Lecture 21: Interpolation.
General Linear Model With correlated error terms  =  2 V ≠  2 I.
The Simple Regression Model
Environmental Data Analysis with MatLab Lecture 8: Solving Generalized Least Squares Problems.
Lecture 6 Bootstraps Maximum Likelihood Methods. Boostrapping A way to generate empirical probability distributions Very handy for making estimates of.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Lecture 3 Probability and Measurement Error, Part 2.
The adjustment of the observations
The General Linear Model. The Simple Linear Model Linear Regression.
Environmental Data Analysis with MatLab
The Basics of Inversion
Visual Recognition Tutorial
Lecture 2 Probability and Measurement Error, Part 1.
Lecture 4 The L 2 Norm and Simple Least Squares. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture.
Lecture 9 Inexact Theories. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture 03Probability and.
The Simple Linear Regression Model: Specification and Estimation
Lecture 6 Resolution and Generalized Inverses. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture.
Lecture 8 Advanced Topics in Least Squares - Part Two -
Maximum likelihood (ML) and likelihood ratio (LR) test
Environmental Data Analysis with MatLab Lecture 5: Linear Models.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Lecture 3 Review of Linear Algebra Simple least-squares.
Lecture 7 Advanced Topics in Least Squares. the multivariate normal distribution for data, d p(d) = (2  ) -N/2 |C d | -1/2 exp{ -1/2 (d-d) T C d -1 (d-d)
Lecture 4: Practical Examples. Remember this? m est = m A + M [ d obs – Gm A ] where M = [G T C d -1 G + C m -1 ] -1 G T C d -1.
Lecture 5 Probability and Statistics. Please Read Doug Martinson’s Chapter 3: ‘Statistics’ Available on Courseworks.
Maximum likelihood (ML) and likelihood ratio (LR) test
Lecture 8 The Principle of Maximum Likelihood. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture.
Lecture 2 Probability and what it has to do with data analysis.
Lecture 4 Probability and what it has to do with data analysis.
Linear and generalised linear models
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
Linear and generalised linear models
Basics of regression analysis
Environmental Data Analysis with MatLab Lecture 7: Prior Information.
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Maximum likelihood (ML)
Principles of the Global Positioning System Lecture 11 Prof. Thomas Herring Room A;
Review of Lecture Two Linear Regression Normal Equation
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
R. Kass/W03P416/Lecture 7 1 Lecture 7 Some Advanced Topics using Propagation of Errors and Least Squares Fitting Error on the mean (review from Lecture.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
G(m)=d mathematical model d data m model G operator d=G(m true )+  = d true +  Forward problem: find d given m Inverse problem (discrete parameter estimation):
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
GG 313 Geological Data Analysis Lecture 13 Solution of Simultaneous Equations October 4, 2005.
Machine Learning 5. Parametric Methods.
Chapter 11: Linear Regression and Correlation Regression analysis is a statistical tool that utilizes the relation between two or more quantitative variables.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Lecture 2 Probability and what it has to do with data analysis.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.
Computacion Inteligente Least-Square Methods for System Identification.
R. Kass/Sp07P416/Lecture 71 More on Least Squares Fit (LSQF) In Lec 5, we discussed how we can fit our data points to a linear function (straight line)
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
Inference about the slope parameter and correlation
CS479/679 Pattern Recognition Dr. George Bebis
Probability Theory and Parameter Estimation I
Parameter Estimation and Fitting to Data
More about Posterior Distributions
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Principles of the Global Positioning System Lecture 11
Learning From Observed Data
Probabilistic Surrogate Models
Presentation transcript:

Lecture 3: Inferences using Least-Squares

Abstraction Vector of N random variables, x with joint probability density p(x) expectation x and covariance C x x2x2 x1x1 Shown as 2D here, but actually N- dimensional

the multivariate normal distribution p(x) = (2  ) -N/2 |C x | -1/2 exp{ -1/2 (x-x) T C x -1 (x-x) } has expectation x covariance C x And is normalized to unit area

examples

x = 2 C x = p(x,y)

x = 2 C x = p(x,y)

x = 2 C x = p(x,y)

x = 2 C x = p(x,y)

x = 2 C x = p(x,y)

Remember this from last lecture ? x2x2 x1x1 x2x2 x1x1 x1x1 p(x 1 ) p(x 1 ) =  p(x 1,x 2 ) dx 2 x2x2 p(x 2 ) p(x 2 ) =  p(x 1,x 2 ) dx 1 distribution of x 1 (irrespective of x 2 ) distribution of x 2 (irrespective of x 1 )

p(x,y) p(y) y y x p(y) =  p(x,y) dx

p(x) x p(x,y) y x p(x) =  p(x,y) dy

Remember p(x,y) = p(x|y) p(y) = p(y|x) p(x) from the last lecture ? we can compute p(x|y) and p(y,x) as follows P(x|y) = P(x,y) / P(y) P(y|x) = P(x,y) / P(x )

p(x,y) p(x|y) p(y|x)

Any linear function of a normal distribution is a normal distribution p(x) = (2  ) -N/2 |C x | -1/2 exp{ -1/2 (x-x) T C x -1 (x-x) } And y=Mx then p(y) = (2  ) -N/2 |C y | -1/2 exp{ -1/2 (y-y) T C y -1 (y-y) } with y=Mx and C y =MC x M T Memorize!

Do you remember this from a previous lecture? then the standard Least-squares solution is m est = [G T G] -1 G T and the rule for error propagation gives C m =  d 2 [G T G] -1 if d = G m

Example – all the data assumed to have the same true value, m 1, and each measured with the same variance,  d 2 d 1 1 d 2 1 d 3 = 1 m 1 … d N 1 G G T G = N so [G T G] -1 = N -1 G T d =  i d i m est =[G T G] -1 G T d = (  i d i ) / N C m =  d 2 / N

m 1 est = (  i d i ) / N … the traditional formula for the mean! the estimated mean has variance C m =  d 2 / N =  m 2 note then that  m =  d /  N the estimated mean is a normally-distributed random variable the width of this distribution,  m, decreases with the square root of the number of measurements

Accuracy grows only slowly with N N=1 N=100 N=10 N=1000

Estimating the variance of the data What  2 d do you use in this formula?

Prior estimates of  d Based on knowledge of the limits of you measuring technique … my ruler has only mm tics, so I’m going to assume that  d = 0.5 mm the manufacturer claims that the instrument is accurate to 0.1%, so since my typical measurement is 25, I’ll assume  d =0.025

posterior estimate of the error Based on error measured with respect to best fit  2 d = (1/N)  i (d i obs -d i pre ) 2 = (1/N)  i e i 2

1 x 1 ay 1 1 x 2 b = y 2 … … … 1 x N y 3 G m = d m est = [G T G] -1 G T d is normally distributed with variance C m =  d 2 [G T G] -1

p(m) = p(a,b) = p(intercept, slope) slope intercept

How probable is a dataset ?

N data d are all drawn from the same distribution p(d) the probable-ness of a single measurement d i is p(d i ) So the probable-ness of the whole dataset is p(d 1 )  p(d 2 )  …  p(d N ) =  i p(d i ) L = ln  i p(d i ) =  i ln p(d i ) called then “Likelihood” of the data

Now imagine that the distribution p(d) is known up to a vector m of unknown parameters write p(d; m) with semicolon as a reminder that its not a joint probability The L is a function of m L(m) =  i ln p(d i ; m)

The Principle of Maximum Likelihood choose m so that it maximizes L(m) the dataset that was in fact observed is the most probable one that could have been observed The best choice of parameters m are the ones that make the dataset likely

the multivariate normal distribution for data, d p(d) = (2  ) -N/2 |C d | -1/2 exp{ -1/2 (d-d) T C d -1 (d-d) } Let’s assume that the expectation d is given by a general linear model d = Gm And that the covariance C d is known (prior covariance)

Then we have a distribution P(d; m) with unknown parameters, m p(d)=(2  ) -N/2 |C d | -1/2 exp{ -½ (d-Gm) T C d -1 (d-Gm) } We can now apply the principle of maximum likelihood To estimate the unknown parameters m

Find the m that maximizes L(m) = ln p(d; m) with p(d;m)=(2  ) -N/2 |C d | -1/2 exp{ -½ (d-Gm) T C d -1 (d-Gm) }

L(m) = ln p(d; m) = - ½Nln (2  ) - ½ln (|C d |) - ½(d-Gm) T C d -1 (d-Gm) The first two terms do not contain m, so the principle of maximum likelihood is Maximize -½ (d-Gm) T C d -1 (d-Gm) or Minimize (d-Gm) T C d -1 (d-Gm)

Special case of uncorrelated data with equal variance C d =  d 2 I Minimize  d -2 (d-Gm) T (d-Gm) with respect to m Which is the same as Minimize (d-Gm) T (d- Gm) with respect to m This is the Principle of Least Squares

But back to the general case … What formula for m does the rule Minimize (d-Gm) T C d -1 (d-Gm) imply ?

Answer (after a lot of algebra) m = [G T C d -1 G] -1 G T C d -1 d And then by the usual rules of error propagation C m = [G T C d -1 G] -1

This special case is often called Weighted Least Squares Note that the total error is E = e T C d -1 e =  i  i -2 e i 2 Each individual error is weighted by the reciprocal of its variance, so errors involving data with SMALL variance get MORE weight weight

Example: fitting a straight line 100 data, first 50 have a different  d than the last 50

Equal variance Left 50:  d = 5 right 50:  d = 5

Left has smaller variance first 50:  d = 5 last 50:  d = 100

Right has smaller variance first 50:  d = 100 last 50:  d = 5

What can go wrong in least-squares m = [G T G] -1 G T d the matrix [G T G] -1 is singular

m = d1d2d3…dNd1d2d3…dN 1x11x21x3…1xN1x11x21x3…1xN EXAMPLE - a straight line fit N  i x i  i x i S i x i 2 G T G = det(G T G) = N  i x i 2 – [  i x i ] 2 [G T G] -1 singular when determinant is zero

N=1, only one measurement (x,d) N  i x i 2 – [  i x i ] 2 = x 2 - x 2 = 0 you can’t fit a straight line to only one point N  1, all data measured at the same x N  i x i 2 – [  i x i ] 2 = N 2 x 2 – N 2 x 2 = 0 measuring the same point over and over doesn’t help det(G T G) = N  i x i 2 – [  i x i ] 2 = 0

This sort of ‘missing measurement’ might be difficult to recognize in a complicated problem but it happens all the time …

Example - Tomography

in this method, you try to plaster the subject with X-ray beams made at every possible position and direction, but you can easily wind up missing some small region … no data coverage here

What to do ? Introduce prior information assumptions about the behavior of the unknowns that ‘fill in’ the data gaps

Examples of Prior Information The unknowns: are close to some already-known value the density of the mantle is close to 3000 kg/m 3 vary smoothly with time or with geographical position ocean currents have length scales of 10’s of km obey some physical law embodied in a PDE water is incompressible and thus its velocity satisfies div(v)=0

Are you only fooling yourself ? It depends … are your assumptions good ones?

Application of the Maximum Likelihood Method to this problem so, let’s have a foray into the world of probability

Overall Strategy 1. Represent the observed data as a probability distribution 2. Represent prior information as a probability distribution 3. Represent the relationship between data and model parameters as a probability distribution 4. Combine the three distributions in a way that embodies combining the information that they contain 5. Apply maximum likelihood to the combined distribution

How to combine distributions in a way that embodies combining the information that they contain … Short answer: multiply them x p 1 (x) x p 2 (x) x p T (x) x1x1 x2x2 x3x3 x between x 1 and x 3 x between x 2 and x 4 x between x 2 and x 3 x4x4

Overall Strategy 1. Represent the observed data as a Normal probability distribution p A (d)  exp{ -½ (d-d obs ) T C d -1 (d-d obs ) } In the absence of any other information, the best estimate of the mean of the data is the observed data itself. Prior covariance of the data. I don’t feel like typing the normalization

Overall Strategy 2. Represent prior information as a Normal probability distribution p A (m)  exp{ -½ (m-m A ) T C m -1 (m-m A ) } Prior estimate of the model, your best guess as to what it would be, in the absence of any observation s. Prior covariance of the model quantifies how good you think your prior estimate is …

example one observation d obs = 0.8 ± 0.4 one model parameter with m A =1.0 ± 1.25

m A =1 d obs = p A (d) p A (m)

Overall Strategy 3. Represent the relationship between data and model parameters as a probability distribution p T (d,m)  exp{ -½ (d-Gm) T C G -1 (d-Gm) } Prior covariance of the theory quantifies how good you think your linear theory is. linear theory, Gm=d, relating data, d, to model parameters, m.

example theory: d=m but only accurate to ± 0.2

m A =1 d obs = p T (d,m)

Overall Strategy 4. Combine the three distributions in a way that embodies combining the information that they contain p (m,d) = p A (d) p A (m) p T (m,d)  exp{ -½ [ (d-d obs ) T C d -1 (d-d obs ) + (m-m A ) T C m -1 (m-m A ) + (d-Gm) T C G -1 (d-Gm) ]} a bit of a mess, but it can be simplified,,,

p(d,m)=p A (d) p A (m) p T (d,m)

Overall Strategy 5. Apply maximum likelihood to the combined distribution, p(d,m) = p A (d) p A (m) p T (m,d)

m est d pre p(d,m) Maximum likelihood point

special case of an exact theory Exact Theory: the covariance C G is very small: limit C G  0 After projecting p(d,m) to p(m) by integrating over all d p(m)  exp{-½(Gm-d obs ) T C d -1 (Gm-d obs )+(m-m A ) T C m -1 (m-m A )]}

maximizing p(m) is equivalent to minimizing (Gm-d obs ) T C d -1 (Gm-d obs ) + (m-m A ) T C m -1 (m-m A ) weighted “prediction error”weighted “distance of the model from its prior value” +

solution calculated via the usual messy minimization process m est = m A + M [ d obs – Gm A ] where M = [G T C d -1 G + C m -1 ] -1 G T C d -1 Don’t Memorize, but be prepared to use

interesting interpretation m est - m A = M [ d obs – Gm A ] estimated model minus its prior observed data minus the prediction of the prior model linear connection between the two is a generalized form of least squares

special uncorrelated case C m =  m 2 I and C d =  d 2 I M = [G T C d -1 G + C m -1 ] -1 G T C d -1 = [ G T G + (  d /  m ) 2 I ] -1 G T this formula is sometimes called “damped least squares”, with “damping factor”  =  d /  m

Damped Least Squares makes the process of avoiding singular matrices associated with insufficient data trivially easy you just add  2 I to G T G before computing the inverse

G T G  G T G +  2 I this process regularizes the matrix, so its inverse always exists its interpretation is : in the absence of relevant data, assume the model parameter has its prior value

Are you only fooling yourself ? It depends … is the assumption - that you know the prior value - a good one?