Download presentation
Presentation is loading. Please wait.
1
Lecture 8 Advanced Topics in Least Squares - Part Two -
2
Concerning the Homework the dirty little secret of data analysis
3
You often spend more time futzing with reading files that are in inscrutable formats than the intellectually-interesting side of data analysis
4
Sample MatLab Code cs = importdata('test.txt','='); Ns = length(cs); mag = zeros(Ns,1); Nm=0; for i = [1:Ns] s=char(cs(i)); smag=s(48:50); stype=s(51:52); if( stype == 'Mb' ) Nm=Nm+1; mag(Nm,1)=str2num(smag); end mag = mag(1:Nm); a routine to read a text file choose non-occurring delimiter to force complete line into one cell returns “cellstr” data type: array of strings convert “cellstr” element to string convert string to number
5
What can go wrong in least-squares m = [G T G] -1 G T d the matrix [G T G] -1 is singular
6
m = d1d2d3…dNd1d2d3…dN 1x11x21x3…1xN1x11x21x3…1xN EXAMPLE - a straight line fit N i x i i x i i x i 2 G T G = det(G T G) = N i x i 2 – [ i x i ] 2 [G T G] -1 singular when determinant is zero
7
N=1, only one measurement (x,d) N i x i 2 – [ i x i ] 2 = x 2 - x 2 = 0 you can’t fit a straight line to only one point N 1, all data measured at the same x N i x i 2 – [ i x i ] 2 = N 2 x 2 – N 2 x 2 = 0 measuring the same point over and over doesn’t help det(G T G) = N i x i 2 – [ i x i ] 2 = 0
8
m = s 1 s 2 … d Nd-1 d Nd-1 1 1 11 … 1-1 1-1 another example – sums and differences N s +N d N s -N d N s -N d N s +N d G T G = det(G T G) = 0 = [ N s +N d ] 2 - [Ns-Nd] 2 = [ N s 2 +N d 2 +2N s N d ] - [ N s 2 +N d 2 -2N s N d ] = 4N s N d N s sums, s i, and N d differences, d i, of two unknowns m 1 and m 2 zero when N s =0 or N d =0, that is, only measurements of one kind
9
This sort of ‘missing measurement’ might be difficult to recognize in a complicated problem but it happens all the time …
10
Example - Tomography
11
in this method, you try to plaster the subject with X-ray beams made at every possible position and direction, but you can easily wind up missing some small region … no data coverage here
12
What to do ? Introduce prior information assumptions about the behavior of the unknowns that ‘fill in’ the data gaps
13
Examples of Prior Information The unknowns: are close to some already-known value the density of the mantle is close to 3000 kg/m 3 vary smoothly with time or with geographical position ocean currents have length scales of 10’s of km obey some physical law embodied in a PDE water is incompressible and thus its velocity satisfies div(v)=0
14
Are you only fooling yourself ? It depends … are your assumptions good ones?
15
Application of the Maximum Likelihood Method to this problem so, let’s have a foray into the world of probability
16
Overall Strategy 1. Represent the observed data as a probability distribution 2. Represent prior information as a probability distribution 3. Represent the relationship between data and model parameters as a probability distribution 4. Combine the three distributions in a way that embodies combining the information that they contain 5. Apply maximum likelihood to the combined distribution
17
How to combine distributions in a way that embodies combining the information that they contain … Short answer: multiply them But let’s step through a more well-reasoned analysis of why we should do that … x p 1 (x) x p 2 (x) x p T (x)
18
how to quantify the information in a distribution p(x) Information compared to what? Compared to a distribution p N (x) that represents the state of complete ignorance Example: p N (x) = a uniform distribution The information content should be a scalar quantity, Q
19
Q = ln[ p(x)/p N (x) ] p(x) dx Q is the expected value of ln[ p(x)/p N (x) ] Properties: Q=0 when p(x) = p N (x) Q 0 always (since limit p 0 of p ln(p)=0) Q is invariant under a change of variables x y
20
Combining distributions p A (x) and p B (x) Desired properties of the combination: p A (x) combined with p B (x) is the same as p B (x) combined with p A (x) p A (x) combined [ p B (x) combined with p C (x)] is the same as [ p A (x) combined p B (x) ] combined with p C (x) Q of [ p A (x) combined with p N (x) ] Q A
21
p AB (x) = p A (x) p B (x) / p N (x) When p N (x) is the uniform distribution … … combining is just multiplying. But note that for “points on the surface of a sphere’, the null distribution, p( , ), is latitude and is longitude, where would not be uniform, but rather proportional to sin( ).
22
Overall Strategy 1. Represent the observed data as a Normal probability distribution p A (d) exp{ -½ (d-d obs ) T C d -1 (d-d obs ) } In the absence of any other information, the best estimate of the mean of the data is the observed data itself. Prior covariance of the data. I don’t feel like typing the normalization
23
Overall Strategy 2. Represent prior information as a Normal probability distribution p A (m) exp{ -½ (m-m A ) T C m -1 (m-m A ) } Prior estimate of the model, your best guess as to what it would be, in the absence of any observations. Prior covariance of the model quantifies how good you think your prior estimate is …
24
example one observation d obs = 0.8 ± 0.4 one model parameter with m A =1.0 ± 1.25
25
m A =1 d obs =0.8 02 2 0 p A (d) p A (m)
26
Overall Strategy 3. Represent the relationship between data and model parameters as a probability distribution p T (d,m) exp{ -½ (d-Gm) T C G -1 (d-Gm) } Prior covariance of the theory quantifies how good you think your linear theory is. linear theory, Gm=d, relating data, d, to model parameters, m.
27
example theory: d=m but only accurate to ± 0.2
28
m A =1 d obs =0.8 02 2 0 pT(d,m)pT(d,m)
29
Overall Strategy 4. Combine the three distributions in a way that embodies combining the information that they contain p (m,d) = p A (d) p A (m) p T (m,d) exp{ -½ [ (d-d obs ) T C d -1 (d-d obs ) + (m-m A ) T C m -1 (m-m A ) + (d-Gm) T C G -1 (d-Gm) ]} a bit of a mess, but it can be simplified,,,
30
02 2 0 p(d,m)=p A (d) p A (m) p T (d,m)
31
Overall Strategy 5. Apply maximum likelihood to the combined distribution, p(d,m) = p A (d) p A (m) p T (m,d) There are two distinct ways we could do this: Find the (d,m) combinations that maximized the joint probability distribution, p(d,m) Find the m that maximized the individual probability distribution, p(m) = p(d,m) dd These do not necessarily give the same value for m
32
m est d pre 02 2 0 p(d,m) maximum of p(d,m)=p A (d) p A (m) p T (d,m) Maximum likelihood point
33
maximum of p(m) = p(d,m)dd m p(m) m est Maximum likelihood point
34
special case of an exact theory in the limit C G 0 exp{ -½ (d-Gm) T C G -1 (d-Gm) } ( d-Gm) and p(m) = p(d,m) dd = p A (m) p A (d) ( d-Gm) dd = p A (m) p A (d=Gm) so for normal distributions p(m) = exp{ -½ [ (Gm-d obs ) T C d -1 (Gm-d obs ) + (m-m A ) T C m -1 (m-m A ) ]} Dirac delta function, with property f(x) (x-y) dx = f(y)
35
special case of an exact theory maximizing p(m) is equivalent to minimizing (Gm-d obs ) T C d -1 (Gm-d obs ) + (m-m A ) T C m -1 (m-m A ) weighted “prediction error”weighted “distance of the model from its prior value” +
36
solution calculated via the usual messy minimization process m est = m A + M [ d obs – Gm A ] where M = [G T C d -1 G + C m -1 ] -1 G T C d -1 Don’t Memorize, but be prepared to use (e.g. in homework).
37
interesting interpretation m est - m A = M [ d obs – Gm A ] estimated model minus its prior observed data minus the prediction of the prior model linear connection between the two
38
special case of no prior information C m M = [G T C d -1 G + C m -1 ] -1 G T C d -1 [G T C d -1 G] -1 G T C d -1 m est = m A + [G T C d -1 G] -1 G T C d -1 [ d obs – Gm A ] = m A +[G T C d -1 G] -1 G T C d -1 d obs –[G T C d -1 G] -1 G T C d -1 Gm A = m A +[G T C d -1 G] -1 G T C d -1 d obs –m A = [G T C d -1 G] -1 G T C d -1 d obs recovers weighted least squares
39
special case of infinitely accurate prior information C m 0 M = [G T C d -1 G + C m -1 ] -1 G T C d -1 0 m est = m A + 0 = m A recovers prior value of m
40
special uncorrelated case C m = m 2 I and C d = d 2 I M = [G T C d -1 G + C m -1 ] -1 G T C d -1 = [ d -2 G T G + m -2 I] -1 G T d -2 = [ G T G + ( d / m ) 2 I ] -1 G T this formula is sometimes called “damped least squares”, with “damping factor” = d / m
41
Damped Least Squares makes the process of avoiding singular matrices associated with insufficient data trivially easy you just add 2 I to G T G before computing the inverse
42
G T G G T G + 2 I this process regularizes the matrix, so its inverse always exists its interpretation is : in the absence of relevant data, assume the model parameter has its prior value
43
Are you only fooling yourself ? It depends … is the assumption - that you know the prior value - a good one?
44
Smoothness Constraints e.g. model is smooth when its second derivative is small d 2 m i /dx 2 m i-1 - 2m i + m i+1 (assuming the data are organized according to one spatial variable)
45
matrix D approximates second derivative 1 -2 1 0 0 0 … 0 1 -2 1 0 0 … … 0 0 0 … 1 -2 1 D = d 2 m/dx 2 Dm
46
Choosing a smooth solution is thus equivalent to minimizing (Dm) T (Dm) = m T (D T D) m comparing to the (m-m A ) T C m -1 (m-m A ) minimization implied by the general solution m est = m A + M [ d obs – Gm A ] where M = [G T C d -1 G + C m -1 ] -1 G T C d -1 indicates that we should make the choices m A = 0 C m -1 = (D T D) To implement smoothness
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.