Download presentation
Presentation is loading. Please wait.
1
Unconstrained Optimization Rong Jin
2
Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How to do it efficiently ?
3
Gradient Ascent Compute the gradient Increase weights w and threshold b in the gradient direction
4
Problem with Gradient Ascent Difficult to find the appropriate step size Small slow convergence Large oscillation or “bubbling” Convergence conditions Robbins-Monroe conditions Along with “regular” objective function will ensure convergence
5
Newton Method Utilizing the second order derivative Expand the objective function to the second order around x 0 The minimum point is Newton method for optimization Guarantee to converge when the objective function is convex
6
Multivariate Newton Method Object function comprises of multiple variables Example: logistic regression model Text categorization: thousands of words thousands of variables Multivariate Newton Method Multivariate function: First order derivative a vector Second order derivative Hessian matrix Hessian matrix is mxm matrix Each element in Hessian matrix is defined as:
7
Multivariate Newton Method Updating equation: Hessian matrix for logistic regression model Can be expensive to compute Example: text categorization with 10,000 words Hessian matrix is of size 10,000 x 10,000 100 million entries Even worse, we have compute the inverse of Hessian matrix H -1
8
Quasi-Newton Method Approximate the Hessian matrix H -1 with another B matrix: B is update iteratively (BFGS): Utilizing derivatives of previous iterations
9
Limited-Memory Quasi-Newton Quasi-Newton Avoid computing the inverse of Hessian matrix But, it still requires computing the B matrix large storage Limited-Memory Quasi-Newton (L-BFGS) Even avoid explicitly computing B matrix B can be expressed as a product of vectors Only keep the most recently vectors of (3~20)
10
Efficiency Number of Variable Standard Newton method: O(n 3 ) Small Medium Quasi Newton method (BFGS): O(n 2 ) Limited-memory Quasi Newton method (L-BFGS): O(n) Large Convergence Rate V-Fast Fast R-Fast
11
Empirical Study: Learning Conditional Exponential Model DatasetInstancesFeatures Rule29,602246 Lex42,509135,182 Summary24,044198,467 Shallow8,625,782264,142 DatasetIterationsTime (s) Rule3504.8 811.13 Lex1545114.21 17620.02 Summary3321190.22 698.52 Shallow1452785962.53 4212420.30 Limited-memory Quasi-Newton method Gradient ascent
12
Free Software http://www.ece.northwestern.edu/~nocedal/so ftware.html http://www.ece.northwestern.edu/~nocedal/so ftware.html L-BFGS L-BFGSB
13
Linear Conjugate Gradient Method Consider optimizing the quadratic function Conjugate vectors The set of vector {p 1, p 2, …, p l } is said to be conjugate with respect to a matrix A if Important property The quadratic function can be optimized by simply optimizing the function along individual direction in the conjugate set. Optimal solution: k is the minimizer along the kth conjugate direction
14
Example Minimize the following function Matrix A Conjugate direction Optimization First direction, x 1 = x 2 =x: Second direction, x 1 =- x 2 =x: Solution: x 1 = x 2 =1
15
How to Efficiently Find a Set of Conjugate Directions Iterative procedure Given conjugate directions {p 1,p 2,…, p k-1 } Set p k as follows: Theorem: The direction generated in the above step is conjugate to all previous directions {p 1,p 2,…, p k-1 }, i.e., Note: compute the k direction p k only requires the previous direction p k-1
16
Nonlinear Conjugate Gradient Even though conjugate gradient is derived for a quadratic objective function, it can be applied directly to other nonlinear functions Several variants: Fletcher-Reeves conjugate gradient (FR-CG) Polak-Ribiere conjugate gradient (PR-CG) More robust than FR-CG Compared to Newton method No need for computing the Hessian matrix No need for storing the Hessian matrix
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.