Newton’s Method applied to a scalar function Newton’s method for minimizing f(x): Twice differentiable function f(x), initial solution x 0. Generate a.

Slides:



Advertisements
Similar presentations
Instabilities of SVD Small eigenvalues -> m+ sensitive to small amounts of noise Small eigenvalues maybe indistinguishable from 0 Possible to remove small.
Advertisements

3.3 Hypothesis Testing in Multiple Linear Regression
Lecture 13 L1 , L∞ Norm Problems and Linear Programming
Lecture 15 Orthogonal Functions Fourier Series. LGA mean daily temperature time series is there a global warming signal?
Siddharth Choudhary.  Refines a visual reconstruction to produce jointly optimal 3D structure and viewing parameters  ‘bundle’ refers to the bundle.
Optimization of thermal processes
Optimization 吳育德.
Least Squares example There are 3 mountains u,y,z that from one site have been measured as 2474 ft., 3882 ft., and 4834 ft.. But from u, y looks 1422 ft.
1.5 Elementary Matrices and a Method for Finding
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Newton’s Method Application to LMS Recursive Least Squares Exponentially-Weighted.
Steepest Decent and Conjugate Gradients (CG). Solving of the linear equation system.
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Jonathan Richard Shewchuk Reading Group Presention By David Cline
Lecture 4 The L 2 Norm and Simple Least Squares. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture.
Function Optimization Newton’s Method. Conjugate Gradients
Tutorial 12 Unconstrained optimization Conjugate gradients.
Curve-Fitting Regression
Lecture 8 The Principle of Maximum Likelihood. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture.
Tutorial 5-6 Function Optimization. Line Search. Taylor Series for Rn
Optimization Methods One-Dimensional Unconstrained Optimization
Newton’s Method applied to a scalar function Newton’s method for minimizing f(x): Twice differentiable function f(x), initial solution x 0. Generate a.
1 Neural Nets Applications Vectors and Matrices. 2/27 Outline 1. Definition of Vectors 2. Operations on Vectors 3. Linear Dependence of Vectors 4. Definition.
Linear and generalised linear models
Function Optimization. Newton’s Method Conjugate Gradients Method
Advanced Topics in Optimization
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems
Why Function Optimization ?
Linear and generalised linear models
1 Systems of Linear Equations Error Analysis and System Condition.
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
PETE 603 Lecture Session #29 Thursday, 7/29/ Iterative Solution Methods Older methods, such as PSOR, and LSOR require user supplied iteration.
Optimization Methods One-Dimensional Unconstrained Optimization
Adaptive Signal Processing

Computational Optimization
UNCONSTRAINED MULTIVARIABLE
Collaborative Filtering Matrix Factorization Approach
Making Models from Data A Basic Overview of Parameter Estimation and Inverse Theory or Four Centuries of Linear Algebra in 10 Equations Rick Aster Professor.
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
Nonlinear programming Unconstrained optimization techniques.
Nonlinear least squares Given m data points (t i, y i ) i=1,2,…m, we wish to find a vector x of n parameters that gives a best fit in the least squares.
Non-Linear Models. Non-Linear Growth models many models cannot be transformed into a linear model The Mechanistic Growth Model Equation: or (ignoring.
G(m)=d mathematical model d data m model G operator d=G(m true )+  = d true +  Forward problem: find d given m Inverse problem (discrete parameter estimation):
1 Unconstrained Optimization Objective: Find minimum of F(X) where X is a vector of design variables We may know lower and upper bounds for optimum No.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Computer Animation Rick Parent Computer Animation Algorithms and Techniques Optimization & Constraints Add mention of global techiques Add mention of calculus.
Efficient computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm Anders Eriksson and Anton van den Hengel.
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION ASEN 5070 LECTURE 11 9/16,18/09.
Numerical Methods.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
Chapter 10 Minimization or Maximization of Functions.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
Numerical Analysis – Data Fitting Hanyang University Jong-Il Park.
Econometrics III Evgeniya Anatolievna Kolomak, Professor.
Nonlinear regression.
6.5 Taylor Series Linearization
~ Least Squares example
5.2 Least-Squares Fit to a Straight Line
Learning Theory Reza Shadmehr
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Maths for Signals and Systems Linear Algebra in Engineering Lecture 6, Friday 21st October 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN SIGNAL.
~ Least Squares example
Section 3: Second Order Methods
Presentation transcript:

Newton’s Method applied to a scalar function Newton’s method for minimizing f(x): Twice differentiable function f(x), initial solution x 0. Generate a sequence of solutions x 1, x 2, …and stop if the sequence converges to a solution with  f(x)=0. 1.Solve -  f(x k ) ≈  2 f(x k )  x 2. Let x k+1 =x k +  x. 3. let k=k+1

Newton’s Method applied to LS Not directly applicable to most nonlinear regression and inverse problems (not equal # of model parameters and data points, no exact solution to G(m)=d). Instead we will use N.M. to minimize a nonlinear LS problem, e.g. fit a vector of n parameters to a data vector d. f(m)=∑ [(G(m) i -d i )/  i ] 2 Let f i (m)=(G(m) i -d i )/  i i=1,2,…,m, F(m)=[f 1 (m) … f m (m)] T So that f(m)= ∑ f i (m) 2  f(m)=∑  f i (m) 2 ] m i=1 m i=1 m i=1

NM: Solve -  f(m k ) ≈  2 f(m k )  m LHS:  f(m k ) j = -∑ 2  f i (m k ) j F(m) j = -2 J(m k )F(m k ) RHS:  2 f(m k )  m = [2J(m) T J(m)+Q(m)]  m, where Q(m)= 2 ∑ f i (m)   f i (m) -2 J(m k )F(m k ) = 2 H(m)  m  m = -H -1 J(m k )F(m k ) = -H -1  f(m k ) (eq. 9.19) H(m)= 2J(m) T J(m)+Q(m) f i (m)=(G(m) i -d i )/  i i=1,2,…,m, F(m)=[f 1 (m) … f m (m)] T

Gauss-Newton (GN) method  2 f(m k )  m = H(m)  m = [2J(m k ) T J(m k )+Q(m)]  m ignores Q(m)=2∑ f i (m)   f i (m) :   f(m)≈2J(m) T J(m), assuming f i (m) will be reasonably small as we approach m*. That is, Solve -  f(x k ) ≈  2 f(x k )  x  f(m) j =∑ 2  f i (m) j F(m) j, i.e. J(m k ) T J(m k )  m=-J(m k ) T F(m k ) f i (m)=(G(m) i -d i )/  i i=1,2,…,m, F(m)=[f 1 (m) … f m (m)] T

Newton’s Method applied to LS Levenberg-Marquardt (LM) method uses [J(m k ) T J(m k )+ I]  m=-J(m k ) T F(m k ) ->0 : GN ->large, steepest descent (SD) (down-gradient most rapidly). SD provides slow but certain convergence. Which value of to use? Small values when GN is working well, switch to larger values in problem areas. Start with small value of, then adjust.

Statistics of iterative methods Cov(Ad)=A Cov(d) A T (d has multivariate N.D.) Cov(m L2 )=(G T G) -1 G T Cov(d) G(G T G) -1 Cov(d)=  2 I: Cov(m L2 )=  2 (G T G) -1 However, we don’t have a linear relationship between data and estimated model parameters for the nonlinear regression, so cannot use these formulas. Instead: F(m*+  m)≈F(m*)+J(m*)  m Cov(m*)≈(J(m*) T J(m*)) -1 not exact due to linearization, confidence intervals may not be accurate :) r i =G(m*) i -d i,  i =1 s=[∑ r i 2 /(m-n)] Cov(m*)=s 2 (J(m*) T J(m*)) -1 establish confidence intervals,  2

Implementation Issues 1.Explicit (analytical) expressions for derivatives 2.Finite difference approximation for derivatives 3.When to stop iterating?  f(m)~0 ||m k+1 -m k ||~0, |f(m) k+1 -f(m) k |~0 eqs Multistart method to optimize globally

Iterative Methods SVD impractical when matrix has 1000’s of rows and columns e.g., 256 x 256 cell tomography model, 100,000 ray paths, < 1% ray hits 50 Gb storage of system matrix, U ~ 80 Gb, V ~ 35 Gb Waste of storage when system matrix is sparse Iterative methods do not store the system matrix Provide approximate solution, rather than exact using Gaussian elimination Definition iterative method: Starting point x 0, do steps x 1, x 2, … Hopefully converge toward right solution x

Iterative Methods Kaczmarz’s algorithm: Each of m rows of G i.m = d i define an n-dimensional hyperplane in R m 1)Project initial m(0) solution onto hyperplane defined by first row of G 2)Project m(1) solution onto hyperplane defined by second row of G 3)… until projections have been done onto all hyperplanes 4)Start new cycle of projections until converge If Gm=d has a unique solution, Kaczmarz’s algorithm will converge towards this If several solutions, it will converge toward the solution closest to m(0) If m(0)=0, we obtain the minimum length solution If no exact solution, converge fails, bouncing around near approximate solution If hyperplanes are nearly orthogonal, convergence is fast If hyperplanes are nearly parallel, convergence is slow

Conjugate Gradients Method Symmetric, positive definite system of equations Ax=b min  (x) = min(1/2 x T Ax - b T x)  (x) = Ax - b = 0 or Ax = b CG method: construct basis p 0, p 1, …, p n-1 such that p i T Ap j =0 when i≠j. Such basis is mutually conjugate w/r to A. Only walk once in each direction and minimize x = ∑  i p i - maximum of n steps required!  (  ) = 1/2 [ ∑  i p i ] T A ∑  i p i - b T [∑  i p i ] = 1/2 ∑ ∑  i  j p i T A p j - b T [∑  i p i ] (summations 0->n-1) n-1 i=0

Conjugate Gradients Method p i T Ap j =0 when i≠j. Such basis is mutually conjugate w/r to A. (p i are said to be ‘A orthogonal’)  (  ) = 1/2 ∑ ∑  i  j p i T A p j - b T [∑  i p i ] = 1/2 ∑  i  p i T A p i - b T [∑  i p i ] = 1/2 (∑  i  p i T A p i - 2  i b T p i ) - n independent terms Thus min  (  ) by minimizing i th term  i  p i T A p i - 2  i b T p i ->diff w/r  i and set derivative to zero:  i = b T p i / p i T A p i i.e., IF we have mutually conjugate basis, it is easy to minimize  (  )

Conjugate Gradients Method CG constructs sequence of x i, r i =b-Ax i, p i Start: x 0, r 0 =b, p 0 =r 0,  0 =r 0 T r 0 /p 0 T Ap 0 Assume at k th iteration, we have x 0, x 1, …, x k ; r 0, r 1, …, r k ; p 0, p 1, …, p k ;  0,  1, …,  k Assume first k+1 basis vectors p i are mutually conjugate to A, first k+1 r i are mutually orthogonal, and r i T p j =0 for i ≠j Let x k+1 = x k +  k p k and r k+1 = r k -  k Ap k which updates correctly, since: r k+1 = b - Ax k+1 = b - A(x k +  k p k ) = (b-Ax k ) -  k Ap k = r k -  k Ap k

Conjugate Gradients Method Let x k+1 = x k +  k p k and r k+1 = r k -  k Ap k  k+1 = r k+1 T r k+1 /r k T r k p k+1 = r k+1 +  k+1 p k b T p k =r k T r k (eq ) Now we need proof of the assumptions 1)r k+1 is orthogonal to r i for i≤k (eq ) 2)r k+1 T r k =0 (eq ) 3)r k+1 is orthogonal to p i for i≤k (eq ) 4)p k+1 T Ap i = 0 for i≤k (eq ) 5)i=k: p k+1 T Ap k = 0 ie CG generates mutually conjugate basis

Conjugate Gradients Method Thus shown that CG generates a sequence of mutually conjugate basis vectors. In theory, the method will find an exact solution in n iterations. Given positive definite, symmetric system of eqs Ax=b, initial solution x 0, let  0 =0, p -1 =0,r 0 =b-Ax 0, k=0 1.If k>0, let  k = r k T r k /r k-1 T r k-1 2.Let p k = r k +  k p k-1 3.Let  k = r k T r k / p k T A p k 4. Let x k+1 = x k +  k p k 5. Let r k+1 = r k -  k Ap k 6. Let k=k+1

Conjugate Gradients Least Squares Method CG can only be applied to positive definite systems of equations, thus not applicable to general LS problems. Instead, we can apply the CGLS method to min ||Gm-d|| 2 G T Gm=G T d

Conjugate Gradients Least Squares Method G T Gm=G T d r k = G T d-G T Gm k = G T (d-Gm k ) = G T s k s k+1 =d-Gm k+1 =d-G(m k +  k p k )=(d-Gm k )-  k Gp k =s k -  k Gp k Given a system of eqs Gm=d, k=0, m 0 =0, p -1 =0,  0 =0, r 0 =G T s 0. 1.If k>0, let  k = r k T r k /[r k-1 T r k-1 ] 2.Let p k = r k +  k p k-1 3.Let  k = r k T r k / [Gp k ] T [G p k ] 4. Let m k+1 = m k +  k p k 5. Let s k+1 =s k -  k Gp k 6. Let r k+1 = r k -  k Gp k 7. Let k=k+1; never computing G T G, only Gp k, G T s k+1

L 1 Regression LS (L 2 ) is strongly affected by outliers If outliers are due to incorrect measurements, the inversion should minimize their effect on the estimated model. Effects of outliers in LS is shown by rapid fall-off of the tails of the Normal Distribution In contrast the Exponential Distribution has a longer tail, implying that the probability of realizing data far from the mean is higher. A few data points several  from is much more probable if drawn from an exponential rather than from a normal distribution. Therefore methods based on exponential distributions are able to handle outliers better than methods based on normal distributions. Such methods are said to be robust.

L 1 Regression min ∑ [d i -(Gm) i ]/  i = min ||d w -G w m|| 1 thus more robust to outliers because error is not squared Example: repeating measurement m times: [1 1 … 1] T m =[d 1 d 2 … d m ] T m L2 = (G T G) -1 G T d = m -1 ∑ d i f(m) = ||d-Gm|| 1 = ∑ |d i -m| Non-differentiable if m=d i Convex, so local minima=global minima f’(m) = ∑ sgn(d i -m), sgn(x)=+1 if x>0, =-1 if x<0, =0 if x=0 =0 if half is +, half is - est = median, where 1/2 of data is est, 1/2 > est

L 1 Regression Finding min ||d w -G w m|| 1 is not trivial. Several methods available, such as IRLS, solving a series of LS problems converging to a 1-norm: r=d-Gm f(m) = ||d-Gm|| 1 = ||r|| 1 = ∑ |r i | non-differentiable if r i =0. At other points: f’(m) = ∂f(m)/∂m k = - ∑ G i,k sgn(r i ) = -∑ G i,k r i /|r i |  f(m) = -G T Rr = -G T R(d-Gm) R i,i =1/|r i |  f(m) = -G T R(d-Gm) = 0 G T RGm = G T Rd R depends on m, nonlinear system :( IRLS!

convolution S(t)=h(t)*f(t) = ∫h(t-k) f(k) dk = ∑ h t-k f k h s 1 assuming h(t) and f(t) are of length h 2 h 1 0 s 2 5 and 3, respectively h 3 h 2 h 1 f 1 s 3 h 4 h 3 h 2 f 2 = s 4 h 5 h 4 h 3 f 3 s 5 0 h 5 h 4 s h 5 s 7 Here, recursive solution is easy

convolution ‘Shaping’ filtering: A*x=D, D ‘desired’ response, A,D known a 1 a 2 a 3 a 0 a 1 a 2 a -1 a 0 a 1. The matrix  ij is formed by the auto-correlation of at with zero- lag values along the diagonal and auto-correlations of successively higher lags off the diagonal.  ij is symmetric of order n a 1 a 0 a -1 a 2 a 1 a 0 a 3 a 2 a 1.                   

convolution A T D becomes. a -1 a -2 a -3 a 0 a -1 a -2 a 1 a 0 a -1. The matrix c ij is formed by the cross-correlation of the elements of A and D. Solution: (A T A) -1 A T D =  -1 c. d -1 d 0 d 1. c 1 = c 0 c -1.

Example Find a filter, 3 elements long, that convolved with (2,1) produced (1,0,0,0): (2,1)*(f 1,f 2,f 3 )=(1,0,0,0) a -1 a -2 a -3 a 0 a -1 a -2 a 1 a 0 a -1. The matrix c ij is formed by the cross-correlation of the elements of A and D. Solution: (A T A) -1 A T D =  -1 c d -1 d 0 d 1. c 1 = c 0 c -1.