Unconstrained Optimization Rong Jin
Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How to do it efficiently ?
Gradient Ascent Compute the gradient Increase weights w and threshold b in the gradient direction
Problem with Gradient Ascent Difficult to find the appropriate step size Small slow convergence Large oscillation or “bubbling” Convergence conditions Robbins-Monroe conditions Along with “regular” objective function will ensure convergence
Newton Method Utilizing the second order derivative Expand the objective function to the second order around x 0 The minimum point is Newton method for optimization Guarantee to converge when the objective function is convex
Multivariate Newton Method Object function comprises of multiple variables Example: logistic regression model Text categorization: thousands of words thousands of variables Multivariate Newton Method Multivariate function: First order derivative a vector Second order derivative Hessian matrix Hessian matrix is mxm matrix Each element in Hessian matrix is defined as:
Multivariate Newton Method Updating equation: Hessian matrix for logistic regression model Can be expensive to compute Example: text categorization with 10,000 words Hessian matrix is of size 10,000 x 10,000 100 million entries Even worse, we have compute the inverse of Hessian matrix H -1
Quasi-Newton Method Approximate the Hessian matrix H -1 with another B matrix: B is update iteratively (BFGS): Utilizing derivatives of previous iterations
Limited-Memory Quasi-Newton Quasi-Newton Avoid computing the inverse of Hessian matrix But, it still requires computing the B matrix large storage Limited-Memory Quasi-Newton (L-BFGS) Even avoid explicitly computing B matrix B can be expressed as a product of vectors Only keep the most recently vectors of (3~20)
Efficiency Number of Variable Standard Newton method: O(n 3 ) Small Medium Quasi Newton method (BFGS): O(n 2 ) Limited-memory Quasi Newton method (L-BFGS): O(n) Large Convergence Rate V-Fast Fast R-Fast
Empirical Study: Learning Conditional Exponential Model DatasetInstancesFeatures Rule29, Lex42,509135,182 Summary24,044198,467 Shallow8,625,782264,142 DatasetIterationsTime (s) Rule Lex Summary Shallow Limited-memory Quasi-Newton method Gradient ascent
Free Software ftware.html ftware.html L-BFGS L-BFGSB
Linear Conjugate Gradient Method Consider optimizing the quadratic function Conjugate vectors The set of vector {p 1, p 2, …, p l } is said to be conjugate with respect to a matrix A if Important property The quadratic function can be optimized by simply optimizing the function along individual direction in the conjugate set. Optimal solution: k is the minimizer along the kth conjugate direction
Example Minimize the following function Matrix A Conjugate direction Optimization First direction, x 1 = x 2 =x: Second direction, x 1 =- x 2 =x: Solution: x 1 = x 2 =1
How to Efficiently Find a Set of Conjugate Directions Iterative procedure Given conjugate directions {p 1,p 2,…, p k-1 } Set p k as follows: Theorem: The direction generated in the above step is conjugate to all previous directions {p 1,p 2,…, p k-1 }, i.e., Note: compute the k direction p k only requires the previous direction p k-1
Nonlinear Conjugate Gradient Even though conjugate gradient is derived for a quadratic objective function, it can be applied directly to other nonlinear functions Several variants: Fletcher-Reeves conjugate gradient (FR-CG) Polak-Ribiere conjugate gradient (PR-CG) More robust than FR-CG Compared to Newton method No need for computing the Hessian matrix No need for storing the Hessian matrix