Lecture 5 Probability and Statistics
Please Read Doug Martinson’s Chapter 3: ‘Statistics’ Available on Courseworks
Abstraction Vector of N random variables, x with joint probability density p(x) expectation x and covariance C x x2x2 x1x1 Shown as 2D here, but actually N- dimensional
the multivariate normal distribution p(x) = (2 ) -N/2 |C x | -1/2 exp{ -1/2 (x-x) T C x -1 (x-x) } has expectation x covariance C x And is normalized to unit area
Special case of C x = p(x) = (2 ) -N/2 |C x | -1/2 exp{ -1/2 (x-x) T C x -1 (x-x) } Note |C x | = 1 2 2 2 … N 2 and (x-x) T C x -1 (x-x) = i (x i -x i ) 2 / i 2 So p(x) = i (2 ) -1/2 i -1 exp{ (x i -x i ) 2 / 2 i 2 } Which is the product of N individual one-variable normal distributions … 0 … 0 0 … … Uncorrelated case
How would you show that the this distribution p(x) = (2 ) -N/2 |C x | -1/2 exp{ -1/2 (x-x) T C x -1 (x-x) } Really has expectation x And covariance C x ???
How would you prove this ? Do you remember how to transform a integral from x to y ? … p(x) d N x = … ? d N y =
given y(x) then … p(x) d N x = … p[x(y)] |dx/dy| d N y = Jacobian determinant, that is, the determinant of matrix J ij whose elements are dx i /dy j p(y)
Here’s how you prove the expectation … Insert p(x) into the usual formula for expectation E(x) = (2 ) -N/2 |C x | -1/2 .. x exp{ -1/2 (x-x) T C x -1 (x-x) } d N x Now use the transformation y=C x -1/2 (x-x) Noting that the Jacobian determinant is |C x | 1/2 E(x) = (2 ) -N/2 .. (x+ C x 1/2 y) exp{ -1/2 y T y } d N y = x .. (2 ) -N/2 exp{ -1/2 y T y } d N y + (2 ) -N/2 C x 1/2 .. y exp{ -1/2 y T y } d N y The first integral is the area under a N-dimensional gaussian, which is just unity The second integral contains an odd function of y times an even function, and so is zero, thus E(x) = x = x
I’ve never tried to prove the covariance … But how much harder could it be ?
examples
x = 2 C x = p(x,y)
x = 2 C x = p(x,y)
x = 2 C x = p(x,y)
x = 2 C x = p(x,y)
x = 2 C x = p(x,y)
Remember this from last lecture ? x2x2 x1x1 x2x2 x1x1 x1x1 p(x 1 ) p(x 1 ) = p(x 1,x 2 ) dx 2 x2x2 p(x 2 ) p(x 2 ) = p(x 1,x 2 ) dx 1 distribution of x 1 (irrespective of x 2 ) distribution of x 2 (irrespective of x 1 )
p(x,y) p(y) y y x p(y) = p(x,y) dx
p(x) x p(x,y) y x p(x) = p(x,y) dy
Remember p(x,y) = p(x|y) p(y) = p(y|x) p(x) from the last lecture ? we can compute p(x|y) and p(y,x) as follows P(x|y) = P(x,y) / P(y) P(y|x) = P(x,y) / P(x)
p(x,y) p(x|y) p(y|x)
Any linear function of a normal distribution is a normal distribution p(x) = (2 ) -N/2 |C x | -1/2 exp{ -1/2 (x-x) T C x -1 (x-x) } And y=Mx then p(y) = (2 ) -N/2 |C y | -1/2 exp{ -1/2 (y-y) T C y -1 (y-y) } with y=Mx and C y =MC x M T
Proof needs rules [AB] -1 =B -1 A -1 and |AB|=|A||B| and |A -1 |=|A| -1 p(x) = (2 ) -N/2 |C x | -1/2 exp{ -1/2 (x-x) T C x -1 (x-x) } Transformation is p(y) = p[x(y)] |dx/dy| Substitute in x=M -1 y and |dx/dy|=|M -1 | P[x(y)]|dx/dy| = (2 ) -N |C x | -1/2 exp{ -1/2 (x-x) T M T M T-1 C x -1 M -1 M (x-x) }|M -1 | II Jacobian determinant
p[x(y)]|dx/dy| = (2 ) -N/2 |C x | -1/2 |M -1 | exp{ -1/2 (x-x) T M T M T-1 C x -1 M -1 M (x-x) } = |M -1/2 ||C x | -1/2 |M -1/2 | [M(x-x)] T [MC x M T ] -1 M(x-x) } |C y | -1/2 (y-y) T C y -1 (y-y) } So p(y) = (2 ) -N/2 |C y | -1/2 exp{ -1/2 (y-y) T C y -1 (y-y) }
Note that these rules work for the multivariate normal distribution if y is linearly related to x, y=Mx then y=Mx (rule for means) C y = M C x M T (rule for propagating error)
Do you remember this from a previous lecture? then the standard Least-squares Solution is m est = [G T G] -1 G T d if d = G m
Let’s suppose the data, d, are uncorrelated and that they all have the same variance, C d = 2 I To compute the variance of m est note that m est =[G T G] -1 G T d is a linear rule of the form m=Md, with M=[G T G] -1 G T so we can apply the rule C m = M C d M T
M=[G T G] -1 G T C m = M C d M T = {[G T G] -1 G T } d 2 I {[G T G] -1 G T } T = d 2 [G T G] -1 G T G [G T G] -1T = d 2 [G T G] -1T = d 2 [G T G] -1 G T G is a symmetric matrix, so its inverse it symmetic, too Memorize !
Example – all the data assumed to have the same true value, m 1, and each measured with the same variance, d 2 d 1 1 d 2 1 d 3 = 1 m 1 … d N 1 G G T G = N so [G T G] -1 = N -1 G T d = i d i m est =[G T G] -1 G T d = ( i d i ) / N C m = d 2 / N
m 1 est = ( i d i ) / N … the traditional formula for the mean! the estimated mean has variance C m = d 2 / N = m 2 note then that m = d / N the estimated mean is a normally-distributed random variable the width of this distribution, m, decreases with the square root of the number of measurements
Accuracy grows only slowly with N N=1 N=100 N=10 N=1000
Another Example – fitting a straight line, with all the data assumed to have the same variance, d 2 d 1 1 x 1 d 2 1 x 2 d 3 = 1 x 3 m 1 … m 2 d N 1 x N G G T G = N i x i i x i i x i 2 C m = 2 d [G T G ] -1 = N i x i 2 – [ i x i ] 2 2d2d i x i 2 - i x i i x i N
C m = 2 d [G T G ] -1 = N i x i 2 – [ i x i ] 2 2d2d i x i 2 - i x i i x i N 2 intercept = N i x i 2 – [ i x i ] 2 2d2d ixi2ixi2 2 slope = 2d2d N i x i 2 – [ i x i ] 2 intercept: m 1 est ± 2 intercept slope: m 2 est ± 2 slope standard error of the intercept 95% confidence intervals
Beware! intercept: m 1 ± 2 intercept slope: m 2 ± 2 slope 95% confidence intervals These are probabilities of m 1 irrespective of the value of m 2 And m 2 irrespective of the value of m 1 not the joint probability of m 1 and m 2 taken together
p(m 1,m 2 ) m2m2 m 2 est ± 2 2 m1m1 m 1 est ± 2 1 probability m 2 in in this box is 95%
p(m 1,m 2 ) m2m2 m 2 est ± 2 2 m1m1 m 1 est ± 2 1 probability m 1 in in this box is 95%
p(m 1,m 2 ) m2m2 m 2 est ± 2 2 m1m1 m 1 est ± 2 1 probability that both m 1 and m 2 are in in this box is < 95%
Intercept and slope are uncorrelated only when i x i = 0, that is, the mean of the x’s is zero, which occurs when the data straddle the origin remember this discussion from a few lectures ago ? C m = 2 d [G T G ] -1 = N i x i 2 – [ i x i ] 2 2d2d i x i 2 - i x i i x i N
What 2 d do you use in these formulas?
Prior estimates of d Based on knowledge of the limits of you measuring technique … my ruler has only mm tics, so I’m going to assume that d = 0.5 mm the manufacturer claims that the instrument is accurate to 0.1%, so since my typical measurement is 25, I’ll assume d =0.025
posterior estimate of the error Based on error measured with respect to best fit 2 d = (1/N) i (d i obs -d i pre ) 2 = (1/N) i e i 2
Dangerous … Because it assumes that the model (“a straight line”) accurately represents the behavior of the data Maybe the data really followed an exponential curve …
One refinement to the formula 2 d = (1/N) i (d i obs -d i pre ) 2 having to do with the appearance of N, the number of data x y If there were only two data, then the best fitting straight line would have no error at all. x y If there were only three data, then the best fitting straight line would likely have just a little error.
2 d = (1/N) i (d i obs -d i pre ) 2 Therefore this formula very likely underestimates the error An improved formula would replace N with N-2 2 d = i (d i obs -d i pre ) 2 1 N-2 Where the “2” is chosen because two points exactly define a straight line
More generally, if there are M model parameters, then the formula would be the quantity N-M is often called the number of degrees of freedom 2 d = i (d i obs -d i pre ) 2 1 N-M