Sparse and low-rank recovery problems in signal processing and machine learning Jeremy Watt and Aggelos Katsaggelos Northwestern University Department of EECS
Part 3: Accelerated Proximal Gradient Methods
Why learn this? More widely applicable than Greedy methods: narrow in problem type, broad in scale Smooth reformulations: broad in problem type, narrow in scale Accelerated” part makes the methods very scalable “Proximal” part is natural extension of standard gradient descent scheme Used as a sub-routine in primal-dual approaches
Contents The “Accelerated” part The “Proximal” part Nestorov’s optimal gradient step is often used since functions often dealt with are convex The “Proximal” part Standard gradient descent: proximal definition Natural extensions to sparse and low-rank problems
The “Accelerated” part
Gradient descent algorithm The recpirical of the Lipshitz constant is typically used as the step-length for convex functions. e.g. if f(x) = |Ax – b|2 you’ll have L = |A’A|2
Gradient steps towards the optimum in the valley of a long narrow tube
Gradient steps towards the optimum with a “momentum” term added to cancel out the perpendicular “noise” and prevent zig-zagging3,4
standard gradient
momentum term
evens out sideways “noise”
Nesterov’s optimal gradient method4,5 evens out sideways “noise”
Nesterov’s optimal gradient method4,5 evens out sideways “noise” order of magnitude faster than standard gradient descent!
Optimal gradient descent algorithm The gradient of f is often assumed Lipchitz continuous, but this isn’t required. There are many variations on this theme.
The “Accelerated” piece of the Proximal Gradient Method we’ll see next1 Replace the standard gradient in the proximal methods discussed next to make it “Accelerated” We’ll stick with the gradient for ease of exposition in the introduction to proximal methods next. Replacing the gradient step in each of the final Proximal Gradient approaches gives the corresponding Accelerated Proximal Gradient approach.
The “Proximal” part
Projection onto a convex set This will become a familiar shape
Gradient step: proximal definition 2nd order “almost” Taylor series expansion
Gradient step: proximal definition a little rearranging To go from the first to second line just throw away terms independent of x and complete the square. where
Proximal Gradient step a simple projection! minimized at
Proximal Gradient step a simple projection! again notice
Extension to L-1 regularized problems
Convex L-1 regularized problems e.g. the Lasso
Proximal Gradient step drag and drop It’s not clear how to generalize the notion of a gradient step to the L-1 case from the standard perspective. However from the proximal perspective it is fairly straightforward. same quadratic approximation to f
Proximal Gradient step same business as before
Proximal Gradient step same shape as proximal version of projection
Proximal Gradient step at this point expect for some
Shrinkage operator
Proximal Gradient step same business as before
Proximal Gradient Algorithm for general L-1 Regularized problem Complexity: just like gradient descent.
Iterative Shrinkage Thresholding Algorithm (ISTA) 1 Complexity: just like gradient descent.
Iterative Shrinkage Thresholding Algorithm (ISTA) With the optimal gradient step known as the Fast Iterative Shrinkage Thresholding Algorithm (FISTA)1 Complexity: just like gradient descent.
Extension to nuclear-norm regularized problems
Problem archetype proximal gradient
Problem archetype drag and drop same quadratic approximation to f
Problem archetype same shape as proximal version of projection
Problem archetype As usual expect
What is X*? Well if the SVD decomposition can be written in outer-product form as
What is X*? then since and
What is X*? could in analogy to ISTA be given by soft-thresholding the singular values of Y? i.e.
What is X*? could in analogy to ISTA be given by soft-thresholding the singular values of Y? i.e. (since )
Yes, it is2
Example: RPCA w/no Gaussian noise Moral to continue to drive home: reformulating is half the battle in optimization. This is an example use of the Quadratic Penalty Method
Example: RPCA w/no Gaussian noise Quadratic penalty method Moral to continue to drive home: reformulating is half the battle in optimization.
RPCA: reformulation via Quadratic Penalty Method Perform alternating minimization in X and E using proximal gradient steps Moral to continue to drive home: reformulating is half the battle in optimization.
Both of these are in the bag
Demo: PG versus APG
Where to go from here Primal dual methods Top of the line algorithms in the field for a wide array of large-scale sparse/low-rank problems Often employ proximal methods Rapid convergence to “reasonable” solutions, often good enough for sparse/low-rank problems Dual ascent7, Augmented Lagrangian8/Alternating Direction Method of Multipliers9
References Beck, Amir, and Marc Teboulle. "A fast iterative shrinkage-thresholding algorithm for linear inverse problems." SIAM Journal on Imaging Sciences 2.1 (2009): 183-202. Cai, Jian-Feng, Emmanuel J. Candès, and Zuowei Shen. "A singular value thresholding algorithm for matrix completion." SIAM Journal on Optimization 20.4 (2010): 1956-1982. Qian, Ning. "On the momentum term in gradient descent learning algorithms." Neural networks 12.1 (1999): 145-151. Emmanuel J. Candès. Class notes for “Math 301: Advanced topics in convex optimization” found online at http://www-stat.stanford.edu/~candes/math301/index.html Nesterov, Yurii, and I͡U E. Nesterov. Introductory lectures on convex optimization: A basic course. Vol. 87. Springer, 2004. Made with GeoGebra – a free tool for producing geometric figures. Available for download online at http://www.geogebra.org/cms/en/
References Cai, Jian-Feng, Emmanuel J. Candès, and Zuowei Shen. "A singular value thresholding algorithm for matrix completion." SIAM Journal on Optimization 20.4 (2010): 1956-1982. Lin, Zhouchen, Minming Chen, and Yi Ma. "The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices." arXiv preprint arXiv:1009.5055 (2010). Boyd, Stephen, et al. "Distributed optimization and statistical learning via the alternating direction method of multipliers." Foundations and Trends® in Machine Learning 3.1 (2011): 1-122.