Lecture 5 Probability and Statistics. Please Read Doug Martinson’s Chapter 3: ‘Statistics’ Available on Courseworks.

Slides:



Advertisements
Similar presentations
Environmental Data Analysis with MatLab
Advertisements

MGMT 242 Spring, 1999 Random Variables and Probability Distributions Chapter 4 “Never draw to an inside straight.” from Maxims Learned at My Mother’s Knee.
Copula Regression By Rahul A. Parsa Drake University &
Lecture 6 Bootstraps Maximum Likelihood Methods. Boostrapping A way to generate empirical probability distributions Very handy for making estimates of.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Econ 140 Lecture 81 Classical Regression II Lecture 8.
Lecture 3 Probability and Measurement Error, Part 2.
The adjustment of the observations
Lecture 3 Review of Linear Algebra Simple least-squares.
Environmental Data Analysis with MatLab
Lecture 2 Probability and Measurement Error, Part 1.
Lecture 4 The L 2 Norm and Simple Least Squares. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture.
Factor Analysis Purpose of Factor Analysis
Point estimation, interval estimation
Lecture 3 Review of Linear Algebra Simple least-squares.
Lecture 7 Advanced Topics in Least Squares. the multivariate normal distribution for data, d p(d) = (2  ) -N/2 |C d | -1/2 exp{ -1/2 (d-d) T C d -1 (d-d)
Environmental Data Analysis with MatLab Lecture 24: Confidence Limits of Spectra; Bootstraps.
Lecture 2 Probability and what it has to do with data analysis.
Prediction and model selection
Tch-prob1 Chapter 4. Multiple Random Variables Ex Select a student’s name from an urn. S In some random experiments, a number of different quantities.
7. Least squares 7.1 Method of least squares K. Desch – Statistical methods of data analysis SS10 Another important method to estimate parameters Connection.
Lecture 4 Probability and what it has to do with data analysis.
Lecture 3: Inferences using Least-Squares. Abstraction Vector of N random variables, x with joint probability density p(x) expectation x and covariance.
Environmental Data Analysis with MatLab Lecture 7: Prior Information.
Random Variable and Probability Distribution
Lecture II-2: Probability Review
Joint Probability Distributions
Modern Navigation Thomas Herring
Chapter 5 Sampling and Statistics Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
Machine Learning Queens College Lecture 3: Probability and Statistics.
Introduction to Error Analysis
Chi-squared distribution  2 N N = number of degrees of freedom Computed using incomplete gamma function: Moments of  2 distribution:
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
CORRELATION & REGRESSION
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Functions of Random Variables. Methods for determining the distribution of functions of Random Variables 1.Distribution function method 2.Moment generating.
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
Essentials of Marketing Research
R. Kass/W03P416/Lecture 7 1 Lecture 7 Some Advanced Topics using Propagation of Errors and Least Squares Fitting Error on the mean (review from Lecture.
Chapter 12 Multiple Linear Regression Doing it with more variables! More is better. Chapter 12A.
Physics 114: Exam 2 Review Lectures 11-16
Chapter 3 Random vectors and their numerical characteristics.
Use of moment generating functions 1.Using the moment generating functions of X, Y, Z, …determine the moment generating function of W = h(X, Y, Z, …).
LECTURER PROF.Dr. DEMIR BAYKA AUTOMOTIVE ENGINEERING LABORATORY I.
Slide 6.1 Linear Hypotheses MathematicalMarketing In This Chapter We Will Cover Deductions we can make about  even though it is not observed. These include.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
Statistical Methods II&III: Confidence Intervals ChE 477 (UO Lab) Lecture 5 Larry Baxter, William Hecker, & Ron Terry Brigham Young University.
1 Two Functions of Two Random Variables In the spirit of the previous lecture, let us look at an immediate generalization: Suppose X and Y are two random.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
Sampling and estimation Petter Mostad
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Chapter 20 Statistical Considerations Lecture Slides The McGraw-Hill Companies © 2012.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Lecture 2 Probability and what it has to do with data analysis.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
R. Kass/Sp07P416/Lecture 71 More on Least Squares Fit (LSQF) In Lec 5, we discussed how we can fit our data points to a linear function (straight line)
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION Statistical Interpretation of Least Squares ASEN.
Probability and Statistics for Particle Physics Javier Magnin CBPF – Brazilian Center for Research in Physics Rio de Janeiro - Brazil.
Probability Theory and Parameter Estimation I
CHAPTER 29: Multiple Regression*
Introduction to Instrumentation Engineering
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005
5.4 General Linear Least-Squares
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

Lecture 5 Probability and Statistics

Please Read Doug Martinson’s Chapter 3: ‘Statistics’ Available on Courseworks

Abstraction Vector of N random variables, x with joint probability density p(x) expectation x and covariance C x x2x2 x1x1 Shown as 2D here, but actually N- dimensional

the multivariate normal distribution p(x) = (2  ) -N/2 |C x | -1/2 exp{ -1/2 (x-x) T C x -1 (x-x) } has expectation x covariance C x And is normalized to unit area

Special case of C x = p(x) = (2  ) -N/2 |C x | -1/2 exp{ -1/2 (x-x) T C x -1 (x-x) } Note |C x | =  1 2   2 2 …  N 2 and (x-x) T C x -1 (x-x) =  i (x i -x i ) 2 /  i 2 So p(x) =  i (2  ) -1/2  i -1 exp{ (x i -x i ) 2 / 2  i 2 } Which is the product of N individual one-variable normal distributions  … 0  … 0 0  … … Uncorrelated case

How would you show that the this distribution p(x) = (2  ) -N/2 |C x | -1/2 exp{ -1/2 (x-x) T C x -1 (x-x) } Really has expectation x And covariance C x ???

How would you prove this ? Do you remember how to transform a integral from x to y ?  …  p(x) d N x =  …  ? d N y =

given y(x) then  …  p(x) d N x =  …  p[x(y)] |dx/dy| d N y = Jacobian determinant, that is, the determinant of matrix J ij whose elements are dx i /dy j p(y)

Here’s how you prove the expectation … Insert p(x) into the usual formula for expectation E(x) = (2  ) -N/2 |C x | -1/2 ..  x exp{ -1/2 (x-x) T C x -1 (x-x) } d N x Now use the transformation y=C x -1/2 (x-x) Noting that the Jacobian determinant is |C x | 1/2 E(x) = (2  ) -N/2 ..  (x+ C x 1/2 y) exp{ -1/2 y T y } d N y = x ..  (2  ) -N/2 exp{ -1/2 y T y } d N y + (2  ) -N/2 C x 1/2 ..  y exp{ -1/2 y T y } d N y The first integral is the area under a N-dimensional gaussian, which is just unity The second integral contains an odd function of y times an even function, and so is zero, thus E(x) = x  = x

I’ve never tried to prove the covariance … But how much harder could it be ?

examples

x = 2 C x = p(x,y)

x = 2 C x = p(x,y)

x = 2 C x = p(x,y)

x = 2 C x = p(x,y)

x = 2 C x = p(x,y)

Remember this from last lecture ? x2x2 x1x1 x2x2 x1x1 x1x1 p(x 1 ) p(x 1 ) =  p(x 1,x 2 ) dx 2 x2x2 p(x 2 ) p(x 2 ) =  p(x 1,x 2 ) dx 1 distribution of x 1 (irrespective of x 2 ) distribution of x 2 (irrespective of x 1 )

p(x,y) p(y) y y x p(y) =  p(x,y) dx

p(x) x p(x,y) y x p(x) =  p(x,y) dy

Remember p(x,y) = p(x|y) p(y) = p(y|x) p(x) from the last lecture ? we can compute p(x|y) and p(y,x) as follows P(x|y) = P(x,y) / P(y) P(y|x) = P(x,y) / P(x)

p(x,y) p(x|y) p(y|x)

Any linear function of a normal distribution is a normal distribution p(x) = (2  ) -N/2 |C x | -1/2 exp{ -1/2 (x-x) T C x -1 (x-x) } And y=Mx then p(y) = (2  ) -N/2 |C y | -1/2 exp{ -1/2 (y-y) T C y -1 (y-y) } with y=Mx and C y =MC x M T

Proof needs rules [AB] -1 =B -1 A -1 and |AB|=|A||B| and |A -1 |=|A| -1 p(x) = (2  ) -N/2 |C x | -1/2 exp{ -1/2 (x-x) T C x -1 (x-x) } Transformation is p(y) = p[x(y)] |dx/dy| Substitute in x=M -1 y and |dx/dy|=|M -1 | P[x(y)]|dx/dy| = (2  ) -N |C x | -1/2 exp{ -1/2 (x-x) T M T M T-1 C x -1 M -1 M (x-x) }|M -1 | II Jacobian determinant

p[x(y)]|dx/dy| = (2  ) -N/2 |C x | -1/2 |M -1 | exp{ -1/2 (x-x) T M T M T-1 C x -1 M -1 M (x-x) } = |M -1/2 ||C x | -1/2 |M -1/2 | [M(x-x)] T [MC x M T ] -1 M(x-x) } |C y | -1/2 (y-y) T C y -1 (y-y) } So p(y) = (2  ) -N/2 |C y | -1/2 exp{ -1/2 (y-y) T C y -1 (y-y) }

Note that these rules work for the multivariate normal distribution if y is linearly related to x, y=Mx then y=Mx (rule for means) C y = M C x M T (rule for propagating error)

Do you remember this from a previous lecture? then the standard Least-squares Solution is m est = [G T G] -1 G T d if d = G m

Let’s suppose the data, d, are uncorrelated and that they all have the same variance, C d =  2 I To compute the variance of m est note that m est =[G T G] -1 G T d is a linear rule of the form m=Md, with M=[G T G] -1 G T so we can apply the rule C m = M C d M T

M=[G T G] -1 G T C m = M C d M T = {[G T G] -1 G T }  d 2 I {[G T G] -1 G T } T =  d 2 [G T G] -1 G T G [G T G] -1T =  d 2 [G T G] -1T =  d 2 [G T G] -1 G T G is a symmetric matrix, so its inverse it symmetic, too Memorize !

Example – all the data assumed to have the same true value, m 1, and each measured with the same variance,  d 2 d 1 1 d 2 1 d 3 = 1 m 1 … d N 1 G G T G = N so [G T G] -1 = N -1 G T d =  i d i m est =[G T G] -1 G T d = (  i d i ) / N C m =  d 2 / N

m 1 est = (  i d i ) / N … the traditional formula for the mean! the estimated mean has variance C m =  d 2 / N =  m 2 note then that  m =  d /  N the estimated mean is a normally-distributed random variable the width of this distribution,  m, decreases with the square root of the number of measurements

Accuracy grows only slowly with N N=1 N=100 N=10 N=1000

Another Example – fitting a straight line, with all the data assumed to have the same variance,  d 2 d 1 1 x 1 d 2 1 x 2 d 3 = 1 x 3 m 1 … m 2 d N 1 x N G G T G = N  i x i  i x i  i x i 2 C m =  2 d [G T G ] -1 = N  i x i 2 – [  i x i ] 2 2d2d  i x i 2 -  i x i  i x i N

C m =  2 d [G T G ] -1 = N  i x i 2 – [  i x i ] 2 2d2d  i x i 2 -  i x i  i x i N  2 intercept = N  i x i 2 – [  i x i ] 2 2d2d ixi2ixi2  2 slope = 2d2d  N  i x i 2 – [  i x i ] 2 intercept: m 1 est ± 2  intercept slope: m 2 est ± 2  slope standard error of the intercept 95% confidence intervals

Beware! intercept: m 1 ± 2  intercept slope: m 2 ± 2  slope 95% confidence intervals These are probabilities of m 1 irrespective of the value of m 2 And m 2 irrespective of the value of m 1 not the joint probability of m 1 and m 2 taken together

p(m 1,m 2 ) m2m2 m 2 est ± 2  2 m1m1 m 1 est ± 2  1 probability m 2 in in this box is 95%

p(m 1,m 2 ) m2m2 m 2 est ± 2  2 m1m1 m 1 est ± 2  1 probability m 1 in in this box is 95%

p(m 1,m 2 ) m2m2 m 2 est ± 2  2 m1m1 m 1 est ± 2  1 probability that both m 1 and m 2 are in in this box is < 95%

Intercept and slope are uncorrelated only when  i x i = 0, that is, the mean of the x’s is zero, which occurs when the data straddle the origin remember this discussion from a few lectures ago ? C m =  2 d [G T G ] -1 = N  i x i 2 – [  i x i ] 2 2d2d  i x i 2 -  i x i  i x i N

What  2 d do you use in these formulas?

Prior estimates of  d Based on knowledge of the limits of you measuring technique … my ruler has only mm tics, so I’m going to assume that  d = 0.5 mm the manufacturer claims that the instrument is accurate to 0.1%, so since my typical measurement is 25, I’ll assume  d =0.025

posterior estimate of the error Based on error measured with respect to best fit  2 d = (1/N)  i (d i obs -d i pre ) 2 = (1/N)  i e i 2

Dangerous … Because it assumes that the model (“a straight line”) accurately represents the behavior of the data Maybe the data really followed an exponential curve …

One refinement to the formula  2 d = (1/N)  i (d i obs -d i pre ) 2 having to do with the appearance of N, the number of data x y If there were only two data, then the best fitting straight line would have no error at all. x y If there were only three data, then the best fitting straight line would likely have just a little error.

 2 d = (1/N)  i (d i obs -d i pre ) 2 Therefore this formula very likely underestimates the error An improved formula would replace N with N-2  2 d =  i (d i obs -d i pre ) 2 1 N-2 Where the “2” is chosen because two points exactly define a straight line

More generally, if there are M model parameters, then the formula would be the quantity N-M is often called the number of degrees of freedom  2 d =  i (d i obs -d i pre ) 2 1 N-M