Making Models from Data A Basic Overview of Parameter Estimation and Inverse Theory or Four Centuries of Linear Algebra in 10 Equations Rick Aster Professor.

Slides:



Advertisements
Similar presentations
Ordinary Least-Squares
Advertisements

5.1 Real Vector Spaces.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
General Linear Model With correlated error terms  =  2 V ≠  2 I.
Statistical Techniques I EXST7005 Simple Linear Regression.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Linear Algebra Applications in Matlab ME 303. Special Characters and Matlab Functions.
Introduction: The General Linear Model b b The General Linear Model is a phrase used to indicate a class of statistical models which include simple linear.
Chapter 10 Curve Fitting and Regression Analysis
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
The General Linear Model. The Simple Linear Model Linear Regression.
Solving Linear Systems (Numerical Recipes, Chap 2)
Variance and covariance M contains the mean Sums of squares General additive models.
Dan Witzner Hansen  Groups?  Improvements – what is missing?
Appendix to Chapter 1 Mathematics Used in Microeconomics © 2004 Thomson Learning/South-Western.
Chapter 5 Orthogonality
Curve-Fitting Regression
Appendix to Chapter 1 Mathematics Used in Microeconomics © 2004 Thomson Learning/South-Western.
Nonlinear Regression Probability and Statistics Boris Gervits.
Lecture 11 Vector Spaces and Singular Value Decomposition.
Linear and generalised linear models
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems
6 6.3 © 2012 Pearson Education, Inc. Orthogonality and Least Squares ORTHOGONAL PROJECTIONS.
Chi Square Distribution (c2) and Least Squares Fitting
Linear and generalised linear models
Basics of regression analysis
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Last lecture summary independent vectors x
Separate multivariate observations
Regression Analysis British Biometrician Sir Francis Galton was the one who used the term Regression in the later part of 19 century.
Calibration & Curve Fitting
Least-Squares Regression
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Computational Methods in Physics PHYS 3437 Dr Rob Thacker Dept of Astronomy & Physics (MM-301C)
Introduction to Linear Regression and Correlation Analysis
Chapter 10 Review: Matrix Algebra
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Inner Product Spaces Euclidean n-space: Euclidean n-space: vector lengthdot productEuclidean n-space R n was defined to be the set of all ordered.
CHAPTER FIVE Orthogonality Why orthogonal? Least square problem Accuracy of Numerical computation.
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
V. Space Curves Types of curves Explicit Implicit Parametric.
Course 12 Calibration. 1.Introduction In theoretic discussions, we have assumed: Camera is located at the origin of coordinate system of scene.
Multivariate Statistics Matrix Algebra II W. M. van der Veld University of Amsterdam.
Linear Regression Andy Jacobson July 2006 Statistical Anecdotes: Do hospitals make you sick? Student’s story Etymology of “regression”
MECN 3500 Inter - Bayamon Lecture 9 Numerical Methods for Engineering MECN 3500 Professor: Dr. Omar E. Meza Castillo
AN ORTHOGONAL PROJECTION
Orthogonality and Least Squares
Curve-Fitting Regression
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Elementary Linear Algebra Anton & Rorres, 9th Edition
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION ASEN 5070 LECTURE 11 9/16,18/09.
Introduction to Matrices and Matrix Approach to Simple Linear Regression.
4 © 2012 Pearson Education, Inc. Vector Spaces 4.4 COORDINATE SYSTEMS.
AGC DSP AGC DSP Professor A G Constantinides©1 Signal Spaces The purpose of this part of the course is to introduce the basic concepts behind generalised.
Seismological Analysis Methods Receiver FunctionsBody Wave Tomography Surface (Rayleigh) wave tomography Good for: Imaging discontinuities (Moho, sed/rock.
Basic Theory (for curve 01). 1.1 Points and Vectors  Real life methods for constructing curves and surfaces often start with points and vectors, which.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
Camera Calibration Course web page: vision.cis.udel.edu/cv March 24, 2003  Lecture 17.
Richard Kass/F02P416 Lecture 6 1 Lecture 6 Chi Square Distribution (  2 ) and Least Squares Fitting Chi Square Distribution (  2 ) (See Taylor Ch 8,
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Bivariate Regression. Bivariate Regression analyzes the relationship between two variables. Bivariate Regression analyzes the relationship between two.
Inner Product Spaces Euclidean n-space: Euclidean n-space: vector lengthdot productEuclidean n-space R n was defined to be the set of all ordered.
Lecture XXVII. Orthonormal Bases and Projections Suppose that a set of vectors {x 1,…,x r } for a basis for some space S in R m space such that r  m.
Physics 114: Lecture 13 Probability Tests & Linear Fitting
Linear regression Fitting a straight line to observations.
Singular Value Decomposition SVD
5.4 General Linear Least-Squares
Simple Linear Regression
Presentation transcript:

Making Models from Data A Basic Overview of Parameter Estimation and Inverse Theory or Four Centuries of Linear Algebra in 10 Equations Rick Aster Professor of Geophysics New Mexico Tech Prerequisite: Basic geometry and linear algebra

The “Forward Problem” d = G(m) (1) d represents data (observations) m represents a “model” that we wish to find. We will consider d and m to be column vectors of numbers. G is a mathematical operator that maps the model to the data.

A Linear Forward Problem d = Gm (2) The forward problem is linear, if it can be written so that d is given as the product of m with a matrix G (even when problems are nonlinear, we can frequently solve them via linearized steps, so this theory is generally very useful). In physical problems, G encodes a law of nature that we wish to satisfy. An example that we’ll see later is the mathematics of seismic ray path travel times in a tomography problem, in which case m represents the (unknown) seismic velocity or slowness structure of a medium. In linear problems, each element of d is calculated as a linear combination of the elements of m; i.e., by the basic operation of matrix-vector multiplication, where the elements of d in (2) are simply d j = G j1 m 1 + G j2 m 2 + … + G jn m n (3a)

A Linear Inverse Problem m = G -g d (4) The essence of obtaining, or estimating, m, when given d, for a linear inverse problem is to find (or produce equivalent operations for) an inverse matrix, G -g that multiplies d. The elements of m are then given by linear combinations of the p elements of d. m k = G -g k1 d 1 + G -g k2 d 2 + … + G -g kp d p (3b) The inverse matrix, if it is to be effective, must “undo” the effect of G, so that, ideally, the model estimate, m est is close to or equal to what we want to find, the true value of the Earth properties, m true. So, we seek an inverse matrix such that m est = G -g d = G -g G m true ≈ m true

If G is a square (n by n) matrix and is nonsingular, then there is a unique inverse solution, specified by G -g = G -1, where G -1 is the familiar inverse matrix of basic linear algebra; G -1 G = I (where I is the n by n identity matrix). The forward problem in this case is just a classic “n equations in n unknowns” system that has the unique solution m = G -1 d, and m est = m true. The Simple Case

Things get considerably more interesting when G is singular (so that G -1 doesn’t exist) and especially if G is not a square matrix. In linear regression, the p by n matrix G has more rows than columns, so that the forward problem d = Gm represents a set of more linear constraint equations (p) than there are elements (n) in m. Systems with p > n will commonly be inconsistent, meaning that they do not have exact solutions (i.e., no m exists that can satisfy d = Gm exactly). A simple example that we will look at shortly arises when one wishes to find the intercept and slope of a line that “best” goes through p>2 points that are not exactly co-linear. When there is no exact solution, we can try to find a “best” m to satisfy the forward problem. The Less Simple Case

So, how do we define what is a “best” solution m for a linear regression problem that has no exact solution? We first define a metric (or misfit measure) that we wish to minimize. The most common approach is to use least squares, where we seek the solution for m that minimizes the length of the residual vector, r, between observed data d and predicted data Gm. We thus seek m such that | |r|| 2 = ||d – Gm|| 2 (5) Is as small as it can possibly be, where the (||) bars indicate the Euclidean or 2- norm length, e.g., ||r|| 2 = (r r … + r p 2 ) 1/ 2 (6) Because we are minimizing the sum of the squared misfit terms, this approach is called least squares. Least Squares

As an example, we will we set up the linear regression problem for fitting a line (by the way it is called linear regression because the mathematics is linear, not because we might be fitting to a line!) to an arbitrary set of data points: a)Set up the linear forward problem. We want the p data points, collected at points (x, y)=(x i, d i ) to fit a line specified by unknown parameters (m 1 and m 2 ). Let m 1 be the y-intercept and m 2 be the slope. The forward constraint equation for each data point is thus d i = m 1 + m 2 x i, and there will be p of them, one for each data point. Note that the elements of m are just coefficients here to a first degree polynomial (a line), so it is easy to see how the forward problem could be generalized for a polynomial of arbitrary complexity, we would just have more elements in m. b)Next, we put this forward problem into the form of d=G(m). Because the data points d i are a linear combination of the unknown model parameters, the system is linear and we can write it as a vector matrix equation d=Gm. It is easy to see from the form of the constraint equations that the elements of G are simply G i1 =1 and G i2 =x i for p matrix rows where i=1, …, p. c) Solve for the least squares solution that minimizes ||r||. One can of course easily find a solution formula for this ancient problem, which can be solved by a variety of methods (e.g., using differential calculus). A typical solution form for the least-squares linear regression to a line is shown at right. Here a is the slope (our m 2 ) and b is the intercept (our m 1 ) and the n (our p) data points are the y i, collected at the points x i : ear_regression.html

That is a pretty impressive and somewhat complicated looking (and not very insightful) formula. Suppose that we wanted to regress to a higher degree polynomial, you might naturally wonder how complicated such formulas would get! We will next generalize the solution for all linear regression problems into a simple linear algebra formula, deriving in the process the iconic normal equations. The derivation that we will show is taken from Appendix A of Parameter Estimation and Inverse Problems by Aster, Borchers, and Thurber; Elsevier, 2012).

To derive our solution to an arbitrary linear regression problem we will use only simple geometry and linear algebra. We begin with the insightful observation that the matrix-vector product Gm can be envisioned as a linear combination of the vectors comprised of the columns of the p by n sized matrix G, with the coefficients specified by the elements of m. The subspace of p-dimensional space spanned by the columns of G is called the range of G (abbreviated as R(G)). The forward problem is then: d = m 1 G.,1 + m 2 G.,2 + … + m n G.,n (7) where G.,j denotes the j th column of G. Because d has p elements, but G has only n<p columns, any such linear combination will span only a subspace of the space, R p, of all p-dimensional vectors. To span the entire space, we would need p (linearly independent, so that none can be constructed as a linear combination of the others) vectors.

Because we have an insufficient number of basis vectors in the columns of G, we cannot fit an arbitrary d with any m in the forward problem here. The only particular case where we can fit d exactly is if d fortuitously lives precisely in the range of G (this is the special case when, for example, all points lie exactly on a line in the linear regression to a line problem discussed earlier).

Because we cannot fit d exactly, we will instead seek the m so that Gm = proj R(G) d (8) by which we mean that we want Gm to be the projection of d onto the (at most n-dimensional) subspace of the range of G. We will call the m that satisfies this condition m ls. The geometric relationship between d and Gm for m=m ls depicted for p=3 and a 2- dimensinal R(G).

A key observation is that the vector r = d-Gm ls must be perpendicular to R(G), and thus perpendicular to Gm ls. This difference, r, is the projection of the p- dimensional d that we can never fit (again because Gm ls is only a subspace of R p ). The geometric relationship between d and Gm for m=m ls depicted for p=3 and a 2- dimensinal R(G).

Since r = d-Gm ls is perpendicular to R(G), it is perpendicular (normal) to every column of G. Thus, all dot products between the columns of G and r are zero, so that: G T (d-Gm ls ) = 0 (9) Because of this normality, the n equations of (9) are called the normal equations. If G T G is nonsingular, rearranging terms and left-multiplying both sides of (9) by G gives a straightforward solution for m ls : m ls = (G T G) -1 G T d (10) If the columns of G are linearly independent, it can be shown that (G T G) -1 always exists, and that (10) provides a unique least squares solution. We have thus found a general least-squares inverse for G; it is (G T G) -1 G T. We will next apply this solution to solve a basic seismic tomography problem.

A Simple Tomography Problem Consider a 4-cube model of a square region, 200 m on a side, where we take travel time data, t 1, …, t 5, for seismic waves traveling in five directions as shown at right. The slowness (the reciprocal of the seismic velocity) in each region is parameterized as S 11, S 12, S 21, S 22, as depicted in the figure. We will parameterize the model in terms of slowness (reciprocal velocity) instead of velocity because it results in a linear system of equations, since distance times slowness equals time. In this example we neglect refraction and approximate the ray paths by straight lines, shown in grey at right.

A Simple Tomography Problem Each travel time measurement has a forward model constraint equation associated with it, where the factors of 100 are the side lengths of the blocks in meters. For example: t 1 = S S (11) where the slownesses are specified in seconds/meter and the time is in seconds. The complete system of (m=5) constraint equations (in n=4 unknowns) can be as: (12) where the elements G ij are specified by the ray path geometry of the experiment and we have rearranged the unknown slownesses into a vector as shown in (12).

A Simple Tomography Problem 1) Find the elements of G. 2) Solve for a least-squares solution s using the normal equations (10) and MATLAB or some other linear algebra package given t 1 = s t 2 = s t 3 = s t 4 = s t 5 = s Note that there is some random noise in these times, so the system of equations (12) is inconsistent (it has no exact solution). 3) Convert your slownesses to velocities; where is the region seismically faster or slower? 4) Calculate the residual vector, r using (6). How well does your model actually fit the data on average?

Final Thoughts (1) The normal equations produce a simple and general solution for solving linear regression problems. However, for large problems, and/or when G T G is singular (in which case the system of equations will not have a unique solution), or is nearly singular (in which case the solutions will be extremely sensitive to noise or minor changes in the data) more advanced techniques are required to produce useful solutions. The singular value decomposition (SVD) provides a general, efficient, and stable methodology to solve least squares problems. It straightforwardly produces solutions that are both least squares and have minimum model length ||m|| 2 in cases where there is no unique solution. Details on the SVD and its applications can be found in Aster et al and in many other references.

Final Thoughts (2) What we have derived here is a solution for “ordinary least squares”. If data errors are estimated, it is easy to modify this theory to obtain expressions for “weighted least squares”. If the data errors are estimated to be Gaussian and statistically independent, then this is as simple as weighting each constraint equation (the data and associated G elements) by the reciprocal standard deviation and using the normal equations as before. Least squares is notoriously susceptible to outliers (wild data points) that are way off-trend. A powerful way to reduce the influence of outliers is to use robust estimation methods, where the misfit measure is less affected by a handful of out-of-place data points. The most common way to do this is to minimize the 1-norm misfit measure: | |r|| 1 = |r 1 | + |r 2 | + … + |r p | (13) where the bars indicate absolute value, instead of the 2-norm measure (6). Minimizing the 1-norm misfit is a bit more complicated, but can be done with a modified iterative algorithm called iteratively reweighted least squares. Again, Aster et al., 2012 and other standard references can show you how to do this.