Newton’s Method applied to a scalar function Newton’s method for minimizing f(x): Twice differentiable function f(x), initial solution x 0. Generate a.

Slides:



Advertisements
Similar presentations
Curved Trajectories towards Local Minimum of a Function Al Jimenez Mathematics Department California Polytechnic State University San Luis Obispo, CA
Advertisements

Instabilities of SVD Small eigenvalues -> m+ sensitive to small amounts of noise Small eigenvalues maybe indistinguishable from 0 Possible to remove small.
Optimization.
Optimization of thermal processes
Optimization 吳育德.
Least Squares example There are 3 mountains u,y,z that from one site have been measured as 2474 ft., 3882 ft., and 4834 ft.. But from u, y looks 1422 ft.
1.5 Elementary Matrices and a Method for Finding
Steepest Decent and Conjugate Gradients (CG). Solving of the linear equation system.
Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Jonathan Richard Shewchuk Reading Group Presention By David Cline
1cs542g-term Notes  Assignment 1 due tonight ( me by tomorrow morning)
Lecture 5 A Priori Information and Weighted Least Squared.
Function Optimization Newton’s Method. Conjugate Gradients
Newton’s Method applied to a scalar function Newton’s method for minimizing f(x): Twice differentiable function f(x), initial solution x 0. Generate a.
Some useful linear algebra. Linearly independent vectors span(V): span of vector space V is all linear combinations of vectors v i, i.e.
Tutorial 12 Unconstrained optimization Conjugate gradients.
Methods For Nonlinear Least-Square Problems
Lecture 12 Equality and Inequality Constraints. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture.
Tutorial 5-6 Function Optimization. Line Search. Taylor Series for Rn
Optimization Methods One-Dimensional Unconstrained Optimization
1 Neural Nets Applications Vectors and Matrices. 2/27 Outline 1. Definition of Vectors 2. Operations on Vectors 3. Linear Dependence of Vectors 4. Definition.
Linear and generalised linear models
Function Optimization. Newton’s Method Conjugate Gradients Method
Advanced Topics in Optimization
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Why Function Optimization ?
PETE 603 Lecture Session #29 Thursday, 7/29/ Iterative Solution Methods Older methods, such as PSOR, and LSOR require user supplied iteration.
Optimization Methods One-Dimensional Unconstrained Optimization

9 1 Performance Optimization. 9 2 Basic Optimization Algorithm p k - Search Direction  k - Learning Rate or.
By Mary Hudachek-Buswell. Overview Atmospheric Turbulence Blur.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
Computational Optimization
UNCONSTRAINED MULTIVARIABLE
Making Models from Data A Basic Overview of Parameter Estimation and Inverse Theory or Four Centuries of Linear Algebra in 10 Equations Rick Aster Professor.
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
ENCI 303 Lecture PS-19 Optimization 2
Nonlinear programming Unconstrained optimization techniques.
Nonlinear least squares Given m data points (t i, y i ) i=1,2,…m, we wish to find a vector x of n parameters that gives a best fit in the least squares.
G(m)=d mathematical model d data m model G operator d=G(m true )+  = d true +  Forward problem: find d given m Inverse problem (discrete parameter estimation):
1 Unconstrained Optimization Objective: Find minimum of F(X) where X is a vector of design variables We may know lower and upper bounds for optimum No.
1 Optimization Multi-Dimensional Unconstrained Optimization Part II: Gradient Methods.
Matrix Differential Calculus By Dr. Md. Nurul Haque Mollah, Professor, Dept. of Statistics, University of Rajshahi, Bangladesh Dr. M. N. H. MOLLAH.
Computer Animation Rick Parent Computer Animation Algorithms and Techniques Optimization & Constraints Add mention of global techiques Add mention of calculus.
Linear algebra: matrix Eigen-value Problems Eng. Hassan S. Migdadi Part 1.
Multivariate Unconstrained Optimisation First we consider algorithms for functions for which derivatives are not available. Could try to extend direct.
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION ASEN 5070 LECTURE 11 9/16,18/09.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
Chapter 10 Minimization or Maximization of Functions.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.
Linear Programming Chapter 9. Interior Point Methods  Three major variants  Affine scaling algorithm - easy concept, good performance  Potential.
Searching a Linear Subspace Lecture VI. Deriving Subspaces There are several ways to derive the nullspace matrix (or kernel matrix). ◦ The methodology.
Numerical Analysis – Data Fitting Hanyang University Jong-Il Park.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems
Econometrics III Evgeniya Anatolievna Kolomak, Professor.
Numerical Methods for Inverse Kinematics Kris Hauser ECE 383 / ME 442.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
Chapter 10. Numerical Solutions of Nonlinear Systems of Equations
~ Least Squares example
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Solving Linear Systems: Iterative Methods and Sparse Systems
~ Least Squares example
Performance Optimization
Presentation transcript:

Newton’s Method applied to a scalar function Newton’s method for minimizing f(x): Twice differentiable function f(x), initial solution x 0. Generate a sequence of solutions x 1, x 2, …and stop if the sequence converges to a solution with  f(x)=0. 1.Solve -  f(x k ) ≈  2 f(x k )  x 2. Let x k+1 =x k +  x. 3. let k=k+1

Newton’s Method applied to LS Not directly applicable to most nonlinear regression and inverse problems (not equal # of model parameters and data points, no exact solution to G(m)=d). Instead we will use N.M. to minimize a nonlinear LS problem, e.g. fit a vector of n parameters to a data vector d. f(m)=∑ [(G(m) i -d i )/  i ] 2 Let f i (m)=(G(m) i -d i )/  i i=1,2,…,m, F(m)=[f 1 (m) … f m (m)] T So that f(m)= ∑ f i (m) 2  f(m)=∑  f i (m) 2 ] m i=1 m i=1 m i=1

NM: Solve -  f(m k ) ≈  2 f(m k )  m LHS:  f(m k ) j = -∑ 2  f i (m k ) j F(m) j = -2 J(m k )F(m k ) RHS:  2 f(m k )  m = [2J(m) T J(m)+Q(m)]  m, where Q(m)= 2 ∑ f i (m)   f i (m) -2 J(m k )F(m k ) = 2 H(m)  m  m = -H -1 J(m k )F(m k ) = -H -1  f(m k ) (eq. 9.19) H(m)= 2J(m) T J(m)+Q(m) f i (m)=(G(m) i -d i )/  i i=1,2,…,m, F(m)=[f 1 (m) … f m (m)] T

Gauss-Newton (GN) method  2 f(m k )  m = H(m)  m = [2J(m k ) T J(m k )+Q(m)]  m ignores Q(m)=2∑ f i (m)   f i (m) :   f(m)≈2J(m) T J(m), assuming f i (m) will be reasonably small as we approach m*. That is, Solve -  f(x k ) ≈  2 f(x k )  x  f(m) j =∑ 2  f i (m) j F(m) j, i.e. J(m k ) T J(m k )  m=-J(m k ) T F(m k ) f i (m)=(G(m) i -d i )/  i i=1,2,…,m, F(m)=[f 1 (m) … f m (m)] T

Newton’s Method applied to LS Levenberg-Marquardt (LM) method uses [J(m k ) T J(m k )+ I]  m=-J(m k ) T F(m k ) ->0 : GN ->large, steepest descent (SD) (down-gradient most rapidly). SD provides slow but certain convergence. Which value of to use? Small values when GN is working well, switch to larger values in problem areas. Start with small value of, then adjust.

Steepest descent (SD) Problem of minimizing a quadratic function f(x) = 0.5 x T Ax − b T x, (equivalent to solving Ax=b) where b is in R n, A is an n x n symmetric positive definite matrix. The gradient of f(x) is Ax-b, i.e. x k+1 =x k -α k (Ax k -b) where α k is chosen to minimize f(x) along the direction of the negative gradient.

Thus for the quadratic case:

Alternative: line search

Example of slow convergence for SD Rosenbrock function: f(x 1,x 2 ) = (1 – x 1 ) (x 2 – x 1 2 ) 2

Statistics of iterative methods Cov(Ad)=A Cov(d) A T (d has multivariate N.D.) Cov(m L2 )=(G T G) -1 G T Cov(d) G(G T G) -1 Cov(d)=  2 I: Cov(m L2 )=  2 (G T G) -1 However, we don’t have a linear relationship between data and estimated model parameters for the nonlinear regression, so cannot use these formulas. Instead: F(m*+  m)≈F(m*)+J(m*)  m Cov(m*)≈(J(m*) T J(m*)) -1 not exact due to linearization, confidence intervals may not be accurate r i =G(m*) i -d i,  i =1 s=[∑ r i 2 /(m-n)] Cov(m*)=s 2 (J(m*) T J(m*)) -1 establish confidence intervals,  2

Implementation Issues 1.Explicit (analytical) expressions for derivatives 2.Finite difference approximation for derivatives 3.When to stop iterating?  f(m)~0 ||m k+1 -m k ||~0, |f(m) k+1 -f(m) k |~0 eqs Multistart method to optimize globally

Iterative Methods SVD impractical when matrix has 1000’s of rows and columns e.g., 256 x 256 cell tomography model, 100,000 ray paths, < 1% ray hits 50 Gb storage of system matrix, U ~ 80 Gb, V ~ 35 Gb Waste of storage when system matrix is sparse Iterative methods do not store the system matrix Provide approximate solution, rather than exact using Gaussian elimination Definition of iterative method: Starting point x 0, do steps x 1, x 2, … Hopefully converge toward right solution x

Iterative Methods Kaczmarz’s algorithm: Each of m rows of G i.m = d i define an n-dimensional hyperplane in R m 1)Project initial m(0) solution onto hyperplane defined by first row of G 2)Project m(1) solution onto hyperplane defined by second row of G 3)… until projections have been done onto all hyperplanes 4)Start new cycle of projections until converge If Gm=d has a unique solution, Kaczmarz’s algorithm will converge towards this If several solutions, it will converge toward the solution closest to m(0) If m(0)=0, we obtain the minimum length solution If no exact solution, converge fails, bouncing around near approximate solution If hyperplanes are nearly orthogonal, convergence is fast If hyperplanes are nearly parallel, convergence is slow

Let G i. be the ith row of G Consider the hyperplane defined by G i+1. m = d i+1. Since G T i+1. is perpendicular to this hyperplane, the update to m(i) from the constraint due to row i+1 of G will be proportional to G T i+1. m (i+1) = m (i) +βG T i+1. Since G i+1. m (i+1) = d i+1 then G i+1. (m (i) +βG T i+1. ) = d i+1 G i+1. m (i) - d i+1 = - βG i+1. G T i+1. β = [G i+1. m (i) - d i+1 ] / [G i+1. G T i+1. ] Kaczmarz’s algorithm:

1)Let m (0) = 0 2)For i=0,1,…,m, let m (i+1) = m (i) - G T i+1. [G i+1. m (i) - d i+1 ] / [G i+1. G T i+1. ] 3) If the solution has not yet converged, go back to step 1 Kaczmarz’s algorithm:

Kaczmarz’s Method y=x-1 y=1 y x

Specialized for tomography Kaczmarz: m (i+1) = m (i) - G T i+1. [G i+1. m (i) - d i+1 ] / [G i+1. G T i+1. ] Thus always adding a multiple of a row of G to the current solution G i+1. m (i) - d i+1 ] / [G i+1. G T i+1. is a normalized error in equation for i+1, and the correction is spread over the elements of m appearing in equation i+1 Often used approximation in ART is to replace all non-zero elements in row i+1 of G with ones, thus smearing the needed correction in traveltime equally over all cells in ray path i+1 SIRT Algorithm: ART with higher accuracy, each cell updated separately ART algorithm:

function x=kac(A,b,tolx,maxiter) [m,n]=size(A); AP=A’; x=zeros(n,1); iter=0; n2=zeros(m,1); for i=1:m n2(i)=norm(AP(:,i),2)^2; end while (iter <= maxiter) iter=iter+1; newx=x; for i=1:m newx=newx-((newx'*AP(:,i)-b(i))/(n2(i)))*AP(:,i); end if (norm(newx-x)/(1+norm(x)) < tolx) x=newx; return; end x=newx; end disp('Max iterations exceeded.'); return; Kaczmarz’s Algorithm

function x=sirt(A,b,tolx,maxiter) alpha=1.0; [m,n]=size(A); alpha=1.0; A1=(A>0); AP=A’; A1P=A1'; x=zeros(n,1); iter=0; N=zeros(m,1); L=zeros(m,1); NRAYS=zeros(n,1); for i=1:m N(i)=sum(A1P(:,i)); L(i)=sum(AP(:,i)); end for i=1:n NRAYS(i)=sum(A1(:,i)); end while (1==1) iter=iter+1; if (iter > maxiter) disp('Max iterations exceeded.'); return; end newx=x; deltax=zeros(n,1); for i=1:m, q=A1P(:,i)'*newx; delta=b(i)/L(i)-q/N(i); deltax=deltax+delta*A1P(:,i); end newx=newx+alpha*deltax./NRAYS; if (norm(newx-x)/(1+norm(x)) < tolx) return; end; x=newx; end SIRT Algorithm

function x=art(A,b,tolx,maxiter) alpha=1.0; [m,n]=size(A); A1=(A>0); AP=A’; A1P=A1’; x=zeros(n,1); iter=0; N=zeros(m,1); L=zeros(m,1); for i=1:m N(i)=sum(A1(i,:)); L(i)=sum(A(i,:)); end while (1==1) iter=iter+1; if (iter > maxiter) disp('Max iterations exceeded.'); x=newx; return; end newx=x; for i=1:m q=A1P(:,i)'*newx; delta=b(i)/L(i)-q/N(i); newx=newx+alpha*delta*A1P(:,i); end if (norm(newx-x)/(1+norm(x)) < tolx) x=newx; return; end x=newx; end ART Algorithm

True Model Reconstruction From Kaczmarz, ART, SIRT (all similar)

Conjugate Gradients Method Symmetric, positive definite system of equations Ax=b min  (x) = min(1/2 x T Ax - b T x)  (x) = Ax - b = 0 or Ax = b CG method: construct basis p 0, p 1, …, p n-1 such that p i T Ap j =0 when i≠j. Such basis is mutually conjugate w/r to A. Only walk once in each direction and minimize x = ∑  i p i - maximum of n steps required!  (  ) = 1/2 [ ∑  i p i ] T A ∑  i p i - b T [∑  i p i ] = 1/2 ∑ ∑  i  j p i T A p j - b T [∑  i p i ] (summations 0->n-1) n-1 i=0

Conjugate Gradients Method p i T Ap j =0 when i≠j. Such basis is mutually conjugate w/r to A. (p i are said to be ‘A orthogonal’)  (  ) = 1/2 ∑ ∑  i  j p i T A p j - b T [∑  i p i ] = 1/2 ∑  i  p i T A p i - b T [∑  i p i ] = 1/2 (∑  i  p i T A p i - 2  i b T p i ) - n independent terms Thus min  (  ) by minimizing i th term  i  p i T A p i - 2  i b T p i ->diff w/r  i and set derivative to zero:  i = b T p i / p i T A p i i.e., IF we have mutually conjugate basis, it is easy to minimize  (  )

Conjugate Gradients Method CG constructs sequence of x i, r i =b-Ax i, p i Start: x 0 =0, r 0 =b, p 0 =r 0,  0 =r 0 T r 0 /p 0 T Ap 0 Assume at k th iteration, we have x 0, x 1, …, x k ; r 0, r 1, …, r k ; p 0, p 1, …, p k ;  0,  1, …,  k Assume first k+1 basis vectors p i are mutually conjugate to A, first k+1 r i are mutually orthogonal, and r i T p j =0 for i ≠j Let x k+1 = x k +  k p k and r k+1 = r k -  k Ap k which updates correctly, since: r k+1 = b - Ax k+1 = b - A(x k +  k p k ) = (b-Ax k ) -  k Ap k = r k -  k Ap k

Conjugate Gradients Method Let x k+1 = x k +  k p k and r k+1 = r k -  k Ap k  k+1 = r k+1 T r k+1 /r k T r k p k+1 = r k+1 +  k+1 p k b T p k =r k T r k (eq ) Now we need proof of the assumptions 1)r k+1 is orthogonal to r i for i≤k (eq ) 2)r k+1 T r k =0 (eq ) 3)r k+1 is orthogonal to p i for i≤k (eq ) 4)p k+1 T Ap i = 0 for i≤k (eq ) 5)i=k: p k+1 T Ap k = 0 ie CG generates mutually conjugate basis

Conjugate Gradients Method Thus shown that CG generates a sequence of mutually conjugate basis vectors. In theory, the method will find an exact solution in n iterations. Given positive definite, symmetric system of eqs Ax=b, initial solution x 0, let  0 =0, p -1 =0,r 0 =b-Ax 0, k=0 1.If k>0, let  k = r k T r k /r k-1 T r k-1 2.Let p k = r k +  k p k-1 3.Let  k = r k T r k / p k T A p k 4. Let x k+1 = x k +  k p k 5. Let r k+1 = r k -  k Ap k 6. Let k=k+1

Conjugate Gradients Least Squares Method CG can only be applied to positive definite systems of equations, thus not applicable to general LS problems. Instead, we can apply the CGLS method to min ||Gm-d|| 2 G T Gm=G T d

Conjugate Gradients Least Squares Method G T Gm=G T d r k = G T d-G T Gm k = G T (d-Gm k ) = G T s k s k+1 =d-Gm k+1 =d-G(m k +  k p k )=(d-Gm k )-  k Gp k =s k -  k Gp k Given a system of eqs Gm=d, k=0, m 0 =0, p -1 =0,  0 =0, r 0 =G T s 0. 1.If k>0, let  k = r k T r k /[r k-1 T r k-1 ] 2.Let p k = r k +  k p k-1 3.Let  k = r k T r k / [Gp k ] T [G p k ] 4. Let m k+1 = m k +  k p k 5. Let s k+1 =s k -  k Gp k 6. Let r k+1 = r k -  k Gp k 7. Let k=k+1; never computing G T G, only Gp k, G T s k+1