Making Models from Data A Basic Overview of Parameter Estimation and Inverse Theory or Four Centuries of Linear Algebra in 10 Equations Rick Aster Professor.

Making Models from Data A Basic Overview of Parameter Estimation and Inverse Theory or Four Centuries of Linear Algebra in 10 Equations Rick Aster Professor of Geophysics New Mexico Tech Prerequisite: Basic geometry and linear algebra

The “Forward Problem” d = G(m) (1) d represents data (observations) m represents a “model” that we wish to find. We will consider d and m to be column vectors of numbers. G is a mathematical operator that maps the model to the data.

A Linear Forward Problem d = Gm (2) The forward problem is linear, if it can be written so that d is given as the product of m with a matrix G (even when problems are nonlinear, we can frequently solve them via linearized steps, so this theory is generally very useful). In physical problems, G encodes a law of nature that we wish to satisfy. An example that we’ll see later is the mathematics of seismic ray path travel times in a tomography problem, in which case m represents the (unknown) seismic velocity or slowness structure of a medium. In linear problems, each element of d is calculated as a linear combination of the elements of m; i.e., by the basic operation of matrix-vector multiplication, where the elements of d in (2) are simply d j = G j1 m 1 + G j2 m 2 + … + G jn m n (3a)

A Linear Inverse Problem m = G -g d (4) The essence of obtaining, or estimating, m, when given d, for a linear inverse problem is to find (or produce equivalent operations for) an inverse matrix, G -g that multiplies d. The elements of m are then given by linear combinations of the p elements of d. m k = G -g k1 d 1 + G -g k2 d 2 + … + G -g kp d p (3b) The inverse matrix, if it is to be effective, must “undo” the effect of G, so that, ideally, the model estimate, m est is close to or equal to what we want to find, the true value of the Earth properties, m true. So, we seek an inverse matrix such that m est = G -g d = G -g G m true ≈ m true

If G is a square (n by n) matrix and is nonsingular, then there is a unique inverse solution, specified by G -g = G -1, where G -1 is the familiar inverse matrix of basic linear algebra; G -1 G = I (where I is the n by n identity matrix). The forward problem in this case is just a classic “n equations in n unknowns” system that has the unique solution m = G -1 d, and m est = m true. The Simple Case

Things get considerably more interesting when G is singular (so that G -1 doesn’t exist) and especially if G is not a square matrix. In linear regression, the p by n matrix G has more rows than columns, so that the forward problem d = Gm represents a set of more linear constraint equations (p) than there are elements (n) in m. Systems with p > n will commonly be inconsistent, meaning that they do not have exact solutions (i.e., no m exists that can satisfy d = Gm exactly). A simple example that we will look at shortly arises when one wishes to find the intercept and slope of a line that “best” goes through p>2 points that are not exactly co-linear. When there is no exact solution, we can try to find a “best” m to satisfy the forward problem. The Less Simple Case

So, how do we define what is a “best” solution m for a linear regression problem that has no exact solution? We first define a metric (or misfit measure) that we wish to minimize. The most common approach is to use least squares, where we seek the solution for m that minimizes the length of the residual vector, r, between observed data d and predicted data Gm. We thus seek m such that | |r|| 2 = ||d – Gm|| 2 (5) Is as small as it can possibly be, where the (||) bars indicate the Euclidean or 2- norm length, e.g., ||r|| 2 = (r 1 2 + r 2 2 + … + r p 2 ) 1/ 2 (6) Because we are minimizing the sum of the squared misfit terms, this approach is called least squares. Least Squares

As an example, we will we set up the linear regression problem for fitting a line (by the way it is called linear regression because the mathematics is linear, not because we might be fitting to a line!) to an arbitrary set of data points: a)Set up the linear forward problem. We want the p data points, collected at points (x, y)=(x i, d i ) to fit a line specified by unknown parameters (m 1 and m 2 ). Let m 1 be the y-intercept and m 2 be the slope. The forward constraint equation for each data point is thus d i = m 1 + m 2 x i, and there will be p of them, one for each data point. Note that the elements of m are just coefficients here to a first degree polynomial (a line), so it is easy to see how the forward problem could be generalized for a polynomial of arbitrary complexity, we would just have more elements in m. b)Next, we put this forward problem into the form of d=G(m). Because the data points d i are a linear combination of the unknown model parameters, the system is linear and we can write it as a vector matrix equation d=Gm. It is easy to see from the form of the constraint equations that the elements of G are simply G i1 =1 and G i2 =x i for p matrix rows where i=1, …, p. c) Solve for the least squares solution that minimizes ||r||. One can of course easily find a solution formula for this ancient problem, which can be solved by a variety of methods (e.g., using differential calculus). A typical solution form for the least-squares linear regression to a line is shown at right. Here a is the slope (our m 2 ) and b is the intercept (our m 1 ) and the n (our p) data points are the y i, collected at the points x i : http://www.analyzemath.com/statistics/lin ear_regression.html

That is a pretty impressive and somewhat complicated looking (and not very insightful) formula. Suppose that we wanted to regress to a higher degree polynomial, you might naturally wonder how complicated such formulas would get! We will next generalize the solution for all linear regression problems into a simple linear algebra formula, deriving in the process the iconic normal equations. The derivation that we will show is taken from Appendix A of Parameter Estimation and Inverse Problems by Aster, Borchers, and Thurber; Elsevier, 2012).

To derive our solution to an arbitrary linear regression problem we will use only simple geometry and linear algebra. We begin with the insightful observation that the matrix-vector product Gm can be envisioned as a linear combination of the vectors comprised of the columns of the p by n sized matrix G, with the coefficients specified by the elements of m. The subspace of p-dimensional space spanned by the columns of G is called the range of G (abbreviated as R(G)). The forward problem is then: d = m 1 G.,1 + m 2 G.,2 + … + m n G.,n (7) where G.,j denotes the j th column of G. Because d has p elements, but G has only n<p columns, any such linear combination will span only a subspace of the space, R p, of all p-dimensional vectors. To span the entire space, we would need p (linearly independent, so that none can be constructed as a linear combination of the others) vectors.

Because we have an insufficient number of basis vectors in the columns of G, we cannot fit an arbitrary d with any m in the forward problem here. The only particular case where we can fit d exactly is if d fortuitously lives precisely in the range of G (this is the special case when, for example, all points lie exactly on a line in the linear regression to a line problem discussed earlier).

Because we cannot fit d exactly, we will instead seek the m so that Gm = proj R(G) d (8) by which we mean that we want Gm to be the projection of d onto the (at most n-dimensional) subspace of the range of G. We will call the m that satisfies this condition m ls. The geometric relationship between d and Gm for m=m ls depicted for p=3 and a 2- dimensinal R(G).

A key observation is that the vector r = d-Gm ls must be perpendicular to R(G), and thus perpendicular to Gm ls. This difference, r, is the projection of the p- dimensional d that we can never fit (again because Gm ls is only a subspace of R p ). The geometric relationship between d and Gm for m=m ls depicted for p=3 and a 2- dimensinal R(G).

Since r = d-Gm ls is perpendicular to R(G), it is perpendicular (normal) to every column of G. Thus, all dot products between the columns of G and r are zero, so that: G T (d-Gm ls ) = 0 (9) Because of this normality, the n equations of (9) are called the normal equations. If G T G is nonsingular, rearranging terms and left-multiplying both sides of (9) by G gives a straightforward solution for m ls : m ls = (G T G) -1 G T d (10) If the columns of G are linearly independent, it can be shown that (G T G) -1 always exists, and that (10) provides a unique least squares solution. We have thus found a general least-squares inverse for G; it is (G T G) -1 G T. We will next apply this solution to solve a basic seismic tomography problem.

A Simple Tomography Problem Consider a 4-cube model of a square region, 200 m on a side, where we take travel time data, t 1, …, t 5, for seismic waves traveling in five directions as shown at right. The slowness (the reciprocal of the seismic velocity) in each region is parameterized as S 11, S 12, S 21, S 22, as depicted in the figure. We will parameterize the model in terms of slowness (reciprocal velocity) instead of velocity because it results in a linear system of equations, since distance times slowness equals time. In this example we neglect refraction and approximate the ray paths by straight lines, shown in grey at right.

A Simple Tomography Problem Each travel time measurement has a forward model constraint equation associated with it, where the factors of 100 are the side lengths of the blocks in meters. For example: t 1 = S 11 100 + S 12 100 (11) where the slownesses are specified in seconds/meter and the time is in seconds. The complete system of (m=5) constraint equations (in n=4 unknowns) can be as: (12) where the elements G ij are specified by the ray path geometry of the experiment and we have rearranged the unknown slownesses into a vector as shown in (12).

A Simple Tomography Problem 1) Find the elements of G. 2) Solve for a least-squares solution s using the normal equations (10) and MATLAB or some other linear algebra package given t 1 = 0.1783 s t 2 = 0.1896 s t 3 = 0.2008 s t 4 = 0.1535 s t 5 = 0.2523 s Note that there is some random noise in these times, so the system of equations (12) is inconsistent (it has no exact solution). 3) Convert your slownesses to velocities; where is the region seismically faster or slower? 4) Calculate the residual vector, r using (6). How well does your model actually fit the data on average?

Final Thoughts (1) The normal equations produce a simple and general solution for solving linear regression problems. However, for large problems, and/or when G T G is singular (in which case the system of equations will not have a unique solution), or is nearly singular (in which case the solutions will be extremely sensitive to noise or minor changes in the data) more advanced techniques are required to produce useful solutions. The singular value decomposition (SVD) provides a general, efficient, and stable methodology to solve least squares problems. It straightforwardly produces solutions that are both least squares and have minimum model length ||m|| 2 in cases where there is no unique solution. Details on the SVD and its applications can be found in Aster et al. 2012 and in many other references.

Final Thoughts (2) What we have derived here is a solution for “ordinary least squares”. If data errors are estimated, it is easy to modify this theory to obtain expressions for “weighted least squares”. If the data errors are estimated to be Gaussian and statistically independent, then this is as simple as weighting each constraint equation (the data and associated G elements) by the reciprocal standard deviation and using the normal equations as before. Least squares is notoriously susceptible to outliers (wild data points) that are way off-trend. A powerful way to reduce the influence of outliers is to use robust estimation methods, where the misfit measure is less affected by a handful of out-of-place data points. The most common way to do this is to minimize the 1-norm misfit measure: | |r|| 1 = |r 1 | + |r 2 | + … + |r p | (13) where the bars indicate absolute value, instead of the 2-norm measure (6). Minimizing the 1-norm misfit is a bit more complicated, but can be done with a modified iterative algorithm called iteratively reweighted least squares. Again, Aster et al., 2012 and other standard references can show you how to do this.

Making Models from Data A Basic Overview of Parameter Estimation and Inverse Theory or Four Centuries of Linear Algebra in 10 Equations Rick Aster Professor.

Similar presentations

Presentation on theme: "Making Models from Data A Basic Overview of Parameter Estimation and Inverse Theory or Four Centuries of Linear Algebra in 10 Equations Rick Aster Professor."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Making Models from Data A Basic Overview of Parameter Estimation and Inverse Theory or Four Centuries of Linear Algebra in 10 Equations Rick Aster Professor.

Similar presentations

Presentation on theme: "Making Models from Data A Basic Overview of Parameter Estimation and Inverse Theory or Four Centuries of Linear Algebra in 10 Equations Rick Aster Professor."— Presentation transcript:

Similar presentations

About project

Feedback