Outline Preface Fundamentals of Optimization

1 Outline Preface Fundamentals of Optimization
Outline Preface Fundamentals of Optimization Unconstrained Optimization Ideas of finding solutions One-Dimensional Search Gradient Methods Newton's Method and Its Variations

Newton’s Method In the above, only the first derivatives (gradients) are used to define a suitable search direction. This is referred to as the incremental search. If direct approaches are considered, the task is to find solution for df(x)/dx=0 or f(x)=0. Recall in one-dimension, we also mentioned about the newton’s method, which is to find f(x)=0 as x(k+1)=x(k)- f’(x(k))/f’’(x(k)). In other works, we can use the second derivatives to define the search.

Newton’s Method This is referred to as Newton’s method or Newton- Raphson method. Note that the Newton’s formula is obtained from the quadratic form. Thus, the idea is that given a starting point, a quadratic approximation for the objective function at this point is obtained. By using the Newton’s method, the minimizer of the quadratic function is obtained. Then this minimizer is used as the next starting point to obtain the related quadratic function and its corresponding minimizer. This kind of procedure is repeated.

Newton’s Method

Newton’s Method In fact, the approach can be stated as the following:, given an objective function f(x): Define f(x)=g(x) and 2f(x)=H(x). Then the search algorithm is x(k+1)=x(k)H-1(x(k))g(x(k)) For the requirement of minimum, H(x(k))>0 This recursive formula is referred to as Newton’s method. Newton’s method indeed performs better than the steepest descent method does if the initial point is close to the minimizer. Matrix, > means positive definite

Newton’s Method Example: consider the starting point [3, -1, 0, 1]T for Ans:

Newton's Method The search algorithm is x(k+1)=x(k)H-1(x(k))g(x(k))
®Copyright of Shun-Feng Su Newton’s Method The search algorithm is x(k+1)=x(k)H-1(x(k))g(x(k)) x(1)=x(0)H-1(x(0))g(x(0)), g(x(0))=[306, -144, -2, -310]T H(x(0))= , H-1(x(0))=  x(1)=[1.5873, , , ]T , f(x(1))=31.8

Newton's Method With similar process, we can get
®Copyright of Shun-Feng Su Newton’s Method With similar process, we can get x(2)=[1.0582, , , ]T and , f(x(2))=6.28 Conduct the process again… x(3)=[0.7037, , , ]T and , f(x(2))=1.24 Continue… The approach seems promising for finding the minimizer.

Newton’s Method As in a one variable case, there is no guarantee that newton’s algorithm heads in the direction of decreasing values of the object function if H(x(k)) is not positive definite. Even if H(x(k))>0, Newton’s method may not be a descent method (i.e., maybe f(x(k+1)) f(x(k))). Again, Newton’s method has superior convergence properties when the starting point is near the solution.

Newton’s Method When f(x) is a quadratic function, Newton’s method reaches f(x)= 0 (the minimizer) in just one step. Consider a quadratic function as f(x)=1/2 xTQx-bTx Assuming Q is a symmetric matrix. Then, f(x)=g(x)=Qx-b and 2f(x)=H(x)=Q Given an initial x(0), x(1)=x(0)H-1(x(0))g(x(0)) =x(0)Q-1(Qx(0)-b) =Q-1b=x*

Newton’s Method Definition: Given a sequence {x(k)}, that converges to x* (limk||x(k)x*||=0), we say that the order of convergence is p, where p , if If for all p>0, then we say the order of convergence is .

Newton’s Method Theorem: Suppose that fC 3, and x* is a point such that f(x)= 0 and H(x*) is invertible. Then for all x(0) sufficiently close to x*, Newton’s method is well defined for all k and converges to x* with the order at least 2. Note that the order of convergence of Newton’s algorithm for any is  initial point. Theorem: The order of convergence of the steepest descent algorithm is 1 in the worst case.

Convergence of Newton's Method
®Copyright of Shun-Feng Su Convergence of Newton’s Method

Convergence of Steepest Descent
®Copyright of Shun-Feng Su Convergence of Steepest Descent

Newton’s Method Theorem: let {x(k)} be sequence generated by Newton’s method. If H(x(k))>0 and g(x(k))0, then the direction d(k)=H-1(x(k))g(x(k))= x(k+1)x(k) is the descent direction. The descent direction means there exists an 0 such that for any [0, 0], f(x(k)+d(k))<f(x(k)). Proof: Define ()=f(x(k)+d(k)). Then ’()= f(x(k)+d(k))Td(k). ’(0)=f(x(k))Td(k)=gT(x(k))H-1(x(k))g(x(k))<0 In other words, f(x(k)+d(k))<f(x(k)) for a small .

x(k+1)=x(k)kH-1(x(k))g(x(k))
®Copyright of Shun-Feng Su Newton’s Method Since the direction d(k)=H-1(x(k))g(x(k)) is the descent direction, it is then possible to have the following modification of Newton’s method: x(k+1)=x(k)kH-1(x(k))g(x(k)) where k =arg min0 f(x(k)H-1(x(k))g(x(k))) Similar to the steepest descent, we can perform line search on the H-1(x(k))g(x(k)) direction. It can be concluded that this modified Newton’s method has the descent property.

Newton’s Method A drawback for Newton’s method is we need to calculate H and then H-1, which may be computational expensive and may have some problems in finding inverse when the number of variables is large. Another problem is that the Hessian matrix may not be positive definite. We will discuss some approaches for those problems in the following.

19 x(k+1)=x(k)(H (x(k))+kI)-1g(x(k)) where k>0.
®Copyright of Shun-Feng Su Newton’s Method If the Hessian matrix is not positive definite, the search may not be in a descent direction. To overcome this problem, the Levenberg- Marquardt modification is considered. x(k+1)=x(k)(H (x(k))+kI)-1g(x(k)) where k>0. The idea is to make it positive definite. Not positive definite means some eigenvalues are not positive, then by adding some sufficient large k it will make the matrix become (H (x(k))+kI) positive definite.

Newton’s Method The Levenberg-Marquardt modification of Newton’s method becomes Newton’s methods when k0 and become a gradient method with a small step size when k. In practice, we can start with a small value of k, and then slowly increase it until the iteration is descent (i.e., f(x(k+1))< f(x(k))). Homework for prob-4: 9.1 and 9.3.

Conjugate Direction Methods
®Copyright of Shun-Feng Su Conjugate Direction Methods The class of conjugate direction methods can be viewed as intermediate between the steepest descent method and Newton’s method. The conjugate direction methods have the following properties: Solve quadratics of n variables in n steps. The usual implementation does not require the Hessian matrix. No operation (inverse or even storage) on nn matrices are required.

Conjugate Direction Methods
®Copyright of Shun-Feng Su Conjugate Direction Methods The conjugate direction methods typically can perform better than the steepest descent method, but worse than Newton’s method. The crucial factor in the efficiency of an iterative search algorithm is the direction of search at each iteration. Thus, the conjugate direction methods are to define the so-called conjugate direction in the search.

Conjugate Direction Methods
®Copyright of Shun-Feng Su Conjugate Direction Methods Definition: Let Q be a real symmetric matrix. The directions d(0), d(1), …, d(m) are Q-conjugate, if for all ij, we have d(i)TQd(j)=0. Lemma: Let Q be a symmetric positive definite nn matrix. If the directions d(0), d(1), …, d(k) are non- zero and Q-conjugate, then they are linearly independent. Proof: Let 0, … k, be scalars such that 0d(0)+1d(1) … +kd(k)=0.  Pre-multiply d(i)TQ. d(i)TQd(i)=0 (other terms are 0 by Q-conjugate) Since d(i)0, i=0, for i=0, 1, …, k.  L.I.

Conjugate Direction Methods
®Copyright of Shun-Feng Su Conjugate Direction Methods Example: Q= (symmetric positive definite) All leading principal minors are all positive. 1=3, 2=det( )=12, 3=det(Q)=20 Let d(0)=[1, 0, 0]T. Find d(1) in d(0)TQd(1)=0. 3d1(1)+ d3(1)=0; select d1(1)=1, d2(1)=0, d3(1)=3. Find d(2) with d(0)TQd(2)=0 and d(1)TQd(2)=0.  3d1(2)+d3(2)=0 and 6d2(2)8d3(2)=0 d3=[1, 4, -3]T

Conjugate Direction Methods
®Copyright of Shun-Feng Su Conjugate Direction Methods A systematic procedure of finding Q-conjugate vectors is the Gram-Schmidt process (finding an orthonormal basis) as follows. Given a set of linearly independent vectors, p(0), p(1), …, p(n-1), the Gram-Schmidt process is d(0)=p(0), and d(k+1)=p(k+1) then d(0), d(1), …, d(n-1) are Q-conjugate.

Conjugate Direction Methods
®Copyright of Shun-Feng Su Conjugate Direction Methods Consider a quadratic function as f(x)=1/2 xTQx-bTx Q is a symmetric positive definite matrix. It is easy to see the global minimizer satisfies Qx=b. Basic conjugate direction algorithm: Given a starting point x(0) and Q-conjugate vector d(0), d(1), …, d(n-1), x(k+1)=x(k)+kd(k) with k , where f(x(k))=Qx(k)-b.

Conjugate Direction Methods
®Copyright of Shun-Feng Su Conjugate Direction Methods For any starting point x(0), the basic conjugate direction algorithm (Q-conjugate vector d(0), d(1), …, d(n-1)) converges to the unique x* in n steps. Since d(0), d(1), …, d(n-1) are linearly independent, x*  x(0)= 0d(0)+1d(1) … +n-1d(n-1) (basis) Pre-multiply d(k)TQ, for k=0, 1, …, n-1. We have d(i)TQ(x*  x(0))=kd(k)TQd(k). Then k

Conjugate Direction Methods
®Copyright of Shun-Feng Su Conjugate Direction Methods x(i+1)=x(i)+id(i), then after k steps, x(k)=x(0)+1d(1)+ … +k-1d(k-1). Then, x*x(0)=(x*x(k))+(x(k)x(0)) Pre-multiply d(k)TQ. d(k)TQ(x*x(0))=d(k)TQ(x*x(k))+ 0 (orthogonal) =d(k)Tf(x(k)) (note f(x(k))= Qx(k)b and Qx*=b) d(0), d(1), …, d(n-1)) x* in n steps. Then k = k and x*=x(n).

