Unconstrained Optimization Rong Jin
Recap Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in determining the step size Small step size slow convergence Large step size oscillation or bubbling
Recap: Newton Method Univariate Newton method Mulvariate Newton method Guarantee to converge when the objective function is convex/concave Hessian matrix
Recap Problem with standard Newton method Computing inverse of Hessian matrix H is expensive (O(n^3)) The size of Hessian matrix H can be very large (O(n^2)) Quasi-Newton method (BFGS): Approximate the inverse of Hessian matrix H with another matrix B Avoid the difficulty in computing inverse of H However, still have problem when the size of B is large Limited memory Quasi-Newton method (L-BFGS) Storing a set of vectors instead of matrix B Avoid the difficulty in computing the inverse of H Avoid the difficulty in storing the large-size B
Recap Number of Variable Standard Newton method: O(n 3 ) Small Medium Quasi Newton method (BFGS): O(n 2 ) Limited-memory Quasi Newton method (L-BFGS): O(n) Large Convergence Rate V-Fast Fast R-Fast
Empirical Study: Learning Conditional Exponential Model DatasetInstancesFeatures Rule29, Lex42,509135,182 Summary24,044198,467 Shallow8,625,782264,142 DatasetIterationsTime (s) Rule Lex Summary Shallow Limited-memory Quasi-Newton method Gradient ascent
Free Software ftware.html ftware.html L-BFGS L-BFGSB
Conjugate Gradient Another Great Numerical Optimization Method !
Linear Conjugate Gradient Method Consider optimizing the quadratic function Conjugate vectors The set of vector {p 1, p 2, …, p l } is said to be conjugate with respect to a matrix A if Important property The quadratic function can be optimized by simply optimizing the function along individual direction in the conjugate set. Optimal solution: k is the minimizer along the kth conjugate direction
Example Minimize the following function Matrix A Conjugate direction Optimization First direction, x 1 = x 2 =x: Second direction, x 1 =- x 2 =x: Solution: x 1 = x 2 =1
How to Efficiently Find a Set of Conjugate Directions Iterative procedure Given conjugate directions {p 1,p 2,…, p k-1 } Set p k as follows: Theorem: The direction generated in the above step is conjugate to all previous directions {p 1,p 2,…, p k-1 }, i.e., Note: compute the k direction p k only requires the previous direction p k-1
Nonlinear Conjugate Gradient Even though conjugate gradient is derived for a quadratic objective function, it can be applied directly to other nonlinear functions Guarantee convergence if the objective is convex/concave Variants: Fletcher-Reeves conjugate gradient (FR-CG) Polak-Ribiere conjugate gradient (PR-CG) More robust than FR-CG Compared to Newton method The first order method Usually less efficient than Newton method However, it is simple to implement
Empirical Study: Learning Conditional Exponential Model DatasetInstancesFeatures Rule29, Lex42,509135,182 Summary24,044198,467 Shallow8,625,782264,142 DatasetIterationsTime (s) Rule Lex Summary Shallow Limited-memory Quasi-Newton method Conjugate Gradient (PR)
Free Software ftware.html ftware.html CG+
When Should We Use Which Optimization Technique Using Newton method if you can find a package Using conjugate gradient if you have to implement it Using gradient ascent/descent if you are lazy
Logarithm Bound Algorithms To maximize Start with a guess Do it for t = 1, 2, …, T Compute Find a decoupling function Find optimal solution Touch Point
Logarithm Bound Algorithm Start with initial guess x 0 Come up with a lower bounded function (x) f(x) + f(x 0 ) Touch point: (x 0 ) =0 Touch Point Optimal solution x 1 for (x)
Logarithm Bound Algorithm Start with initial guess x 0 Come up with a lower bounded function (x) f(x) + f(x 0 ) Touch point: (x 0 ) =0 Optimal solution x 1 for (x) Repeat the above procedure
Logarithm Bound Algorithm Start with initial guess x 0 Come up with a lower bounded function (x) f(x) + f(x 0 ) Touch point: (x 0 ) =0 Optimal solution x 1 for (x) Repeat the above procedure Converge to the optimal point Optimal Point
Property of Concave Functions For any concave function
Important Inequality log(x), -exp(x) are concave functions Therefore
Expectation-Maximization Algorithm Derive the EM algorithm for Hierarchical Mixture Model m 1 (x) r(x) m 2 (x) X y Log-likelihood of training data