Download presentation
Presentation is loading. Please wait.
Published byMavis Fitzgerald Modified over 6 years ago
1
Sparse and low-rank recovery problems in signal processing and machine learning
Jeremy Watt and Aggelos Katsaggelos Northwestern University Department of EECS
2
Part 3: Accelerated Proximal Gradient Methods
3
Why learn this? More widely applicable than
Greedy methods: narrow in problem type, broad in scale Smooth reformulations: broad in problem type, narrow in scale Accelerated” part makes the methods very scalable “Proximal” part is natural extension of standard gradient descent scheme Used as a sub-routine in primal-dual approaches
4
Contents The “Accelerated” part The “Proximal” part
Nestorov’s optimal gradient step is often used since functions often dealt with are convex The “Proximal” part Standard gradient descent: proximal definition Natural extensions to sparse and low-rank problems
5
The “Accelerated” part
6
Gradient descent algorithm
The recpirical of the Lipshitz constant is typically used as the step-length for convex functions. e.g. if f(x) = |Ax – b|2 you’ll have L = |A’A|2
7
Gradient steps towards the optimum
in the valley of a long narrow tube
8
Gradient steps towards the optimum
with a “momentum” term added to cancel out the perpendicular “noise” and prevent zig-zagging3,4
9
standard gradient
10
momentum term
11
evens out sideways “noise”
12
Nesterov’s optimal gradient method4,5
evens out sideways “noise”
13
Nesterov’s optimal gradient method4,5
evens out sideways “noise” order of magnitude faster than standard gradient descent!
14
Optimal gradient descent algorithm
The gradient of f is often assumed Lipchitz continuous, but this isn’t required. There are many variations on this theme.
15
The “Accelerated” piece of the Proximal Gradient Method we’ll see next1
Replace the standard gradient in the proximal methods discussed next to make it “Accelerated” We’ll stick with the gradient for ease of exposition in the introduction to proximal methods next. Replacing the gradient step in each of the final Proximal Gradient approaches gives the corresponding Accelerated Proximal Gradient approach.
16
The “Proximal” part
17
Projection onto a convex set
This will become a familiar shape
18
Gradient step: proximal definition
2nd order “almost” Taylor series expansion
19
Gradient step: proximal definition
a little rearranging To go from the first to second line just throw away terms independent of x and complete the square. where
20
Proximal Gradient step
a simple projection! minimized at
21
Proximal Gradient step
a simple projection! again notice
22
Extension to L-1 regularized problems
23
Convex L-1 regularized problems
e.g. the Lasso
24
Proximal Gradient step
drag and drop It’s not clear how to generalize the notion of a gradient step to the L-1 case from the standard perspective. However from the proximal perspective it is fairly straightforward. same quadratic approximation to f
25
Proximal Gradient step
same business as before
26
Proximal Gradient step
same shape as proximal version of projection
27
Proximal Gradient step
at this point expect for some
28
Shrinkage operator
29
Proximal Gradient step
same business as before
30
Proximal Gradient Algorithm for general L-1 Regularized problem
Complexity: just like gradient descent.
31
Iterative Shrinkage Thresholding Algorithm (ISTA)
1 Complexity: just like gradient descent.
32
Iterative Shrinkage Thresholding Algorithm (ISTA)
With the optimal gradient step known as the Fast Iterative Shrinkage Thresholding Algorithm (FISTA)1 Complexity: just like gradient descent.
33
Extension to nuclear-norm regularized problems
34
Problem archetype proximal gradient
35
Problem archetype drag and drop same quadratic approximation to f
36
Problem archetype same shape as proximal version of projection
37
Problem archetype As usual expect
38
What is X*? Well if the SVD decomposition can be written in outer-product form as
39
What is X*? then since and
40
What is X*? could in analogy to ISTA be given by soft-thresholding the singular values of Y? i.e.
41
What is X*? could in analogy to ISTA be given by soft-thresholding the singular values of Y? i.e. (since )
42
Yes, it is2
43
Example: RPCA w/no Gaussian noise
Moral to continue to drive home: reformulating is half the battle in optimization. This is an example use of the Quadratic Penalty Method
44
Example: RPCA w/no Gaussian noise
Quadratic penalty method Moral to continue to drive home: reformulating is half the battle in optimization.
45
RPCA: reformulation via Quadratic Penalty Method
Perform alternating minimization in X and E using proximal gradient steps Moral to continue to drive home: reformulating is half the battle in optimization.
46
Both of these are in the bag
47
Demo: PG versus APG
48
Where to go from here Primal dual methods
Top of the line algorithms in the field for a wide array of large-scale sparse/low-rank problems Often employ proximal methods Rapid convergence to “reasonable” solutions, often good enough for sparse/low-rank problems Dual ascent7, Augmented Lagrangian8/Alternating Direction Method of Multipliers9
49
References Beck, Amir, and Marc Teboulle. "A fast iterative shrinkage-thresholding algorithm for linear inverse problems." SIAM Journal on Imaging Sciences 2.1 (2009): Cai, Jian-Feng, Emmanuel J. Candès, and Zuowei Shen. "A singular value thresholding algorithm for matrix completion." SIAM Journal on Optimization 20.4 (2010): Qian, Ning. "On the momentum term in gradient descent learning algorithms." Neural networks 12.1 (1999): Emmanuel J. Candès. Class notes for “Math 301: Advanced topics in convex optimization” found online at Nesterov, Yurii, and I͡U E. Nesterov. Introductory lectures on convex optimization: A basic course. Vol. 87. Springer, 2004. Made with GeoGebra – a free tool for producing geometric figures. Available for download online at
50
References Cai, Jian-Feng, Emmanuel J. Candès, and Zuowei Shen. "A singular value thresholding algorithm for matrix completion." SIAM Journal on Optimization 20.4 (2010): Lin, Zhouchen, Minming Chen, and Yi Ma. "The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices." arXiv preprint arXiv: (2010). Boyd, Stephen, et al. "Distributed optimization and statistical learning via the alternating direction method of multipliers." Foundations and Trends® in Machine Learning 3.1 (2011):
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.