Qualifier Exam in HPC February 10 th, 2010
Quasi-Newton methods Alexandru Cioaca
Quasi-Newton methods (nonlinear systems) Nonlinear systems: F(x) = 0,F : R n R n F(x) = [ f i (x 1,…,x n ) ] T Such systems appear in the simulation of processes (physical, chemical, etc.) Iterative algorithm to solve nonlinear systems Newton’s method != Nonlinear least-squares
Quasi-Newton methods (nonlinear systems) Standard assumptions 1. F – continuously differentiable in an open convex set D 2. F – Lipschitz continuous on D 3. There is x * in D s.t. F(x * )=0, F’(x * ) nonsingular Newton’s method: Starting from x 0 (initial iterate) x k+1 = x k – F’(x k ) -1 * F(x k ),{x k } x * Until termination criterion is satisfied
Quasi-Newton methods (nonlinear systems) Linear model around x k : M n (x) = F(x n ) + F’(x n )(x-x n ) M n (x) = 0 x n+1 = x n - F’(x n ) -1 *F(x n ) Iterates are computed as: F’(x n ) * s n = F(x n ) x n+1 = x n - s n
Quasi-Newton methods (nonlinear systems) Evaluate F’(x n ) Symbolically Numerically with finite differences Automatic differentiation Solve the linear system F’(x n ) * s n = F(x n ) Direct solve: LU, Cholesky Iterative methods: GMRES, CG
Quasi-Newton methods (nonlinear systems) Computation: F(xk)n scalar functions F’(xk)n 2 scalar functions LUO(2n 3 /3) CholeskyO(n 3 /3) Krylov methods(depends on condition number)
Quasi-Newton methods (nonlinear systems) LU and Cholesky are useful when we want to reuse the factorization (quasi-implicit) Difficult to parallelize and balance the workload Cholesky is faster and more stable but needs SPD (!) For n large, factorization is very impractical (n~10 6 ) Krylov methods contain elements easily parallelizable (updates, inner products, matrix-vector products) CG is faster and more stable but needs SPD
Quasi-Newton methods (nonlinear systems) Advantages: Under standard assumptions, Newton’s method converges locally and quadratically There exists a domain of attraction S which contains the solution Once the iterates enter S, they stay in S and eventually converge to x* The algorithm is memoryless (self-corrective)
Quasi-Newton methods (nonlinear systems) Disadvantages: Convergence depends on the choice of x 0 F’(x) has to be evaluated for each x k Computation can be expensive: F(x k ), F’(x k ), s k
Quasi-Newton methods (nonlinear systems) Implicit schemes for ODEs y’ = f(t,y) Forward Euler: y n+1 = y n + hf(t n,y n )(explicit) Backward Euler: y n+1 = y n + hf(t n+1, y n+1 ) (implicit) Implicit schemes need the solution of a nonlinear system (also CN, RK, LMF)
Quasi-Newton methods (nonlinear systems) How to circumvent evaluating F’(x k ) ? Broyden’s method B k+1 = B k + (y k – B k *s k )*s k T / x k+1 = x k – B k -1 * F(x k ) Inverse update (Sherman-Morrison formula) H k+1 =H k +(s k -H k *y k )*s k T *H k / x k+1 = x k – H k * F(x k ) ( s k+1 = x k+1 – x k,y k+1 = F(x k+1 ) – F(x k ) )
Quasi-Newton methods (nonlinear systems) Advantages: No need to compute F’(x k ) For inverse update – no linear system to solve Disadvantages: Superlinear convergence No longer memoryless
Quasi-Newton methods (unconstrained optimization) Problem: Find the global minimizer of a cost function f : R n R, x * = arg min f f differentiable means the problem can be attacked by looking for zeros of the gradient
Quasi-Newton methods (unconstrained optimization) Descent methods x k+1 =x k – λ k *P k * f(x k ) P k = I n -steepest descent P k = 2 f(x k ) -1 -Newton’s method P k = B k -1 -Quasi-Newton Angle between P k, f(x k ) less than 90 B k has to mimic the behavior of the Hessian
Quasi-Newton methods (unconstrained optimization) Global convergence Line search Step length: backtracking, interpolation Sufficient decrease: Wolfe conditions Trust regions
Quasi-Newton methods (unconstrained optimization) For Quasi-Newton, B k has to resemble 2 f(x k ) Single-Rank: Symmetry: Positive def.: Inverse update:
Quasi-Newton methods (unconstrained optimization) Computation Matrix updates, inner products DFP, PSB3 matrix-vector products BFGS 2 matrix-matrix products Storage Limited memory versions (L-BFGS) Store {sk, yk} for the last m iterations and recompute H
Further improvements Preconditioning the linear system For faster convergence one may solve K*B k *p k = K*F(x k ) If B is spd (and sparse) we can use sparse approximate inverses to generate the preconditioner This preconditioner can be refined on a subspace of B k using an algebraic multigrid technique We need to solve the eigenvalue problem
Further improvements Model reduction Sometimes the dimension of the system is very large Smaller model that captures the essence of the original An approximation of the model variability can be retrieved from an ensemble of forward simulations The covariance matrix gives the subspace We need to solve the eigenvalue problem
QR/QL algorithms for symmetric matrices Solves the eigenvalue problem Iterative algorithm Uses QR/QL factorization at each step (A=Q*R, Q unitary, R upper triangular) for k = 1,2,.. A k =Q k *R k A k+1 =R k *Q k end Diagonal of A k converges to eigenvalues of A
QR/QL algorithms for symmetric matrices The matrix A is reduced to upper Hessenberg form before starting the iterations Householder reflections (U=I-v*v’) Reduction is made column-wise If A is symmetric, it is reduced to tridiagonal form
QR/QL algorithms for symmetric matrices Convergence to a triangular form can be slow Origin shifts are used to accelerate it for k = 1,2,.. A k -z k *I=Q k *R k A k+1 =R k *Q k +z k *I end Wilkinson shift QR makes heavy use of matrix-matrix products
Alternatives to quasi-Newton Inexact Newton methods Inner iteration – determine a search direction by solving the linear system with a certain tolerance Only Hessian-vector products are necessary Outer iteration – line search on the search direction Nonlinear CG Residual replaced by gradient of cost function Line search Different flavors
Alternatives to quasi-Newton Direct search Does not involve derivatives of the cost function Uses a structure called simplex to search for decrease in f Stops when further progress cannot be achieved Can get stuck in a local minima
More alternatives Monte Carlo Computational method relying on random sampling Can be used for optimization (MDO), inverse problems by using random walks In the case where we have multiple correlated variables, the correlation matrix is spd so we can use Cholesky to factorize it
Conclusions Newton’s method is a very powerful method with many applications and uses (solving nonlinear systems, finding minima of cost functions). Newton’s method can be used together with many other numerical algorithms (factorizations, linear solvers) The optimization and parallelization of matrix-vector, matrix-matrix products, decompositions and other numerical methods can have a significant impact in overall performance
Thank you for your time!