Jeremy Watt and Aggelos Katsaggelos Northwestern University

Sparse and low-rank recovery problems in signal processing and machine learning
Jeremy Watt and Aggelos Katsaggelos Northwestern University Department of EECS

Part 3: Accelerated Proximal Gradient Methods

Why learn this? More widely applicable than
Greedy methods: narrow in problem type, broad in scale Smooth reformulations: broad in problem type, narrow in scale Accelerated” part makes the methods very scalable “Proximal” part is natural extension of standard gradient descent scheme Used as a sub-routine in primal-dual approaches

Contents The “Accelerated” part The “Proximal” part
Nestorov’s optimal gradient step is often used since functions often dealt with are convex The “Proximal” part Standard gradient descent: proximal definition Natural extensions to sparse and low-rank problems

The “Accelerated” part

Gradient descent algorithm
The recpirical of the Lipshitz constant is typically used as the step-length for convex functions. e.g. if f(x) = |Ax – b|2 you’ll have L = |A’A|2

Gradient steps towards the optimum
in the valley of a long narrow tube

Gradient steps towards the optimum
with a “momentum” term added to cancel out the perpendicular “noise” and prevent zig-zagging3,4

standard gradient

momentum term

evens out sideways “noise”

Nesterov’s optimal gradient method4,5
evens out sideways “noise”

Nesterov’s optimal gradient method4,5
evens out sideways “noise” order of magnitude faster than standard gradient descent!

Optimal gradient descent algorithm
The gradient of f is often assumed Lipchitz continuous, but this isn’t required. There are many variations on this theme.

The “Accelerated” piece of the Proximal Gradient Method we’ll see next1
Replace the standard gradient in the proximal methods discussed next to make it “Accelerated” We’ll stick with the gradient for ease of exposition in the introduction to proximal methods next. Replacing the gradient step in each of the final Proximal Gradient approaches gives the corresponding Accelerated Proximal Gradient approach.

The “Proximal” part

Projection onto a convex set
This will become a familiar shape

Gradient step: proximal definition
2nd order “almost” Taylor series expansion

Gradient step: proximal definition
a little rearranging To go from the first to second line just throw away terms independent of x and complete the square. where

Proximal Gradient step
a simple projection! minimized at

a simple projection! again notice

Extension to L-1 regularized problems

Convex L-1 regularized problems
e.g. the Lasso

drag and drop It’s not clear how to generalize the notion of a gradient step to the L-1 case from the standard perspective. However from the proximal perspective it is fairly straightforward. same quadratic approximation to f

same business as before

same shape as proximal version of projection

at this point expect for some

Shrinkage operator

same business as before

Proximal Gradient Algorithm for general L-1 Regularized problem
Complexity: just like gradient descent.

Iterative Shrinkage Thresholding Algorithm (ISTA)
1 Complexity: just like gradient descent.

Iterative Shrinkage Thresholding Algorithm (ISTA)
With the optimal gradient step known as the Fast Iterative Shrinkage Thresholding Algorithm (FISTA)1 Complexity: just like gradient descent.

Extension to nuclear-norm regularized problems

Problem archetype proximal gradient

Problem archetype drag and drop same quadratic approximation to f

Problem archetype same shape as proximal version of projection

Problem archetype As usual expect

What is X*? Well if the SVD decomposition can be written in outer-product form as

What is X*? then since and

What is X*? could in analogy to ISTA be given by soft-thresholding the singular values of Y? i.e.

What is X*? could in analogy to ISTA be given by soft-thresholding the singular values of Y? i.e. (since )

Yes, it is2

Example: RPCA w/no Gaussian noise
Moral to continue to drive home: reformulating is half the battle in optimization. This is an example use of the Quadratic Penalty Method

Example: RPCA w/no Gaussian noise
Quadratic penalty method Moral to continue to drive home: reformulating is half the battle in optimization.

RPCA: reformulation via Quadratic Penalty Method
Perform alternating minimization in X and E using proximal gradient steps Moral to continue to drive home: reformulating is half the battle in optimization.

Both of these are in the bag

Demo: PG versus APG

Where to go from here Primal dual methods
Top of the line algorithms in the field for a wide array of large-scale sparse/low-rank problems Often employ proximal methods Rapid convergence to “reasonable” solutions, often good enough for sparse/low-rank problems Dual ascent7, Augmented Lagrangian8/Alternating Direction Method of Multipliers9

References Beck, Amir, and Marc Teboulle. "A fast iterative shrinkage-thresholding algorithm for linear inverse problems." SIAM Journal on Imaging Sciences 2.1 (2009): Cai, Jian-Feng, Emmanuel J. Candès, and Zuowei Shen. "A singular value thresholding algorithm for matrix completion." SIAM Journal on Optimization 20.4 (2010): Qian, Ning. "On the momentum term in gradient descent learning algorithms." Neural networks 12.1 (1999): Emmanuel J. Candès. Class notes for “Math 301: Advanced topics in convex optimization” found online at Nesterov, Yurii, and I͡U E. Nesterov. Introductory lectures on convex optimization: A basic course. Vol. 87. Springer, 2004. Made with GeoGebra – a free tool for producing geometric figures. Available for download online at

References Cai, Jian-Feng, Emmanuel J. Candès, and Zuowei Shen. "A singular value thresholding algorithm for matrix completion." SIAM Journal on Optimization 20.4 (2010): Lin, Zhouchen, Minming Chen, and Yi Ma. "The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices." arXiv preprint arXiv: (2010). Boyd, Stephen, et al. "Distributed optimization and statistical learning via the alternating direction method of multipliers." Foundations and Trends® in Machine Learning 3.1 (2011):

Jeremy Watt and Aggelos Katsaggelos Northwestern University

Similar presentations

Presentation on theme: "Jeremy Watt and Aggelos Katsaggelos Northwestern University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jeremy Watt and Aggelos Katsaggelos Northwestern University

Similar presentations

Presentation on theme: "Jeremy Watt and Aggelos Katsaggelos Northwestern University"— Presentation transcript:

Similar presentations

About project

Feedback