Jeremy Watt and Aggelos Katsaggelos Northwestern University

Slides:

Advertisements

Similar presentations

Overview Definition of Norms Low Rank Matrix Recovery Low Rank Approaches + Deformation Optimization Applications.

Advertisements

Eigen Decomposition and Singular Value Decomposition

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Bregman Iterative Algorithms for L1 Minimization with

Globally Optimal Estimates for Geometric Reconstruction Problems Tom Gilat, Adi Lakritz Advanced Topics in Computer Vision Seminar Faculty of Mathematics.

Optimization : The min and max of a function

Least Squares example There are 3 mountains u,y,z that from one site have been measured as 2474 ft., 3882 ft., and 4834 ft.. But from u, y looks 1422 ft.

Wangmeng Zuo, Deyu Meng, Lei Zhang, Xiangchu Feng, David Zhang

Separating Hyperplanes

Shape From Light Field meets Robust PCA

Distributed Optimization with Arbitrary Local Solvers

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Newton’s Method Application to LMS Recursive Least Squares Exponentially-Weighted.

Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.

Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.

Numerical Optimization

Nonlinear Optimization for Optimal Control

1 L-BFGS and Delayed Dynamical Systems Approach for Unconstrained Optimization Xiaohui XIE Supervisor: Dr. Hon Wah TAM.

Martin Burger Institut für Numerische und Angewandte Mathematik European Institute for Molecular Imaging CeNoS Total Variation and related Methods Numerical.

SYSTEMS Identification

Improved BP algorithms ( first order gradient method) 1.BP with momentum 2.Delta- bar- delta 3.Decoupled momentum 4.RProp 5.Adaptive BP 6.Trinary BP 7.BP.

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

ECE 530 – Analysis Techniques for Large-Scale Electrical Systems

Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington.

Today Wrap up of probability Vectors, Matrices. Calculus

Unitary Extension Principle: Ten Years After Zuowei Shen Department of Mathematics National University of Singapore.

Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.

CPSC 491 Xin Liu Nov 17, Introduction Xin Liu PhD student of Dr. Rokne Contact Slides downloadable at pages.cpsc.ucalgary.ca/~liuxin.

Fast and incoherent dictionary learning algorithms with application to fMRI Authors: Vahid Abolghasemi Saideh Ferdowsi Saeid Sanei. Journal of Signal Processing.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.

Elliptic PDEs and the Finite Difference Method

Parameter estimation. 2D homography Given a set of (x i,x i ’), compute H (x i ’=Hx i ) 3D to 2D camera projection Given a set of (X i,x i ), compute.

Inference of Poisson Count Processes using Low-rank Tensor Data Juan Andrés Bazerque, Gonzalo Mateos, and Georgios B. Giannakis May 29, 2013 SPiNCOM, University.

Biointelligence Laboratory, Seoul National University

Network Lasso: Clustering and Optimization in Large Graphs

Analytic Placement Algorithms Chung-Kuan Cheng CSE Department, UC San Diego, CA Contact: 1.

Rank Minimization for Subspace Tracking from Incomplete Data

+ Quadratic Programming and Duality Sivaraman Balakrishnan.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Ridge Regression: Biased Estimation for Nonorthogonal Problems by A.E. Hoerl and R.W. Kennard Regression Shrinkage and Selection via the Lasso by Robert.

Parameter estimation class 5 Multiple View Geometry CPSC 689 Slides modified from Marc Pollefeys’ Comp

Wildlife Census via LSH-based animal tracking APOORV PATWARDHAN 1.

For convex optimization

Jinbo Bi Joint work with Tingyang Xu, Chi-Ming Chen, Jason Johannesen

Large Margin classifiers

Jeremy Watt and Aggelos Katsaggelos Northwestern University

Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides

Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

CSE291 Convex Optimization (CSE203B Pending)

Parameter estimation class 5

Probabilistic Models for Linear Regression

Structure from motion Input: Output: (Tomasi and Kanade)

NESTA: A Fast and Accurate First-Order Method for Sparse Recovery

For convex optimization

Estimating Networks With Jumps

Announcements more panorama slots available now

~ Least Squares example

~ Least Squares example

CS639: Data Management for Data Science

Announcements more panorama slots available now

CSE203B Convex Optimization

Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks

Structure from motion Input: Output: (Tomasi and Kanade)

Section 3: Second Order Methods

SVMs for Document Ranking

Sebastian Semper1 and Florian Roemer1,2

Presentation transcript:

Sparse and low-rank recovery problems in signal processing and machine learning Jeremy Watt and Aggelos Katsaggelos Northwestern University Department of EECS

Part 3: Accelerated Proximal Gradient Methods

Why learn this? More widely applicable than Greedy methods: narrow in problem type, broad in scale Smooth reformulations: broad in problem type, narrow in scale Accelerated” part makes the methods very scalable “Proximal” part is natural extension of standard gradient descent scheme Used as a sub-routine in primal-dual approaches

Contents The “Accelerated” part The “Proximal” part Nestorov’s optimal gradient step is often used since functions often dealt with are convex The “Proximal” part Standard gradient descent: proximal definition Natural extensions to sparse and low-rank problems

The “Accelerated” part

Gradient descent algorithm The recpirical of the Lipshitz constant is typically used as the step-length for convex functions. e.g. if f(x) = |Ax – b|2 you’ll have L = |A’A|2

Gradient steps towards the optimum in the valley of a long narrow tube

Gradient steps towards the optimum with a “momentum” term added to cancel out the perpendicular “noise” and prevent zig-zagging3,4

standard gradient

momentum term

evens out sideways “noise”

Nesterov’s optimal gradient method4,5 evens out sideways “noise”

Nesterov’s optimal gradient method4,5 evens out sideways “noise” order of magnitude faster than standard gradient descent!

Optimal gradient descent algorithm The gradient of f is often assumed Lipchitz continuous, but this isn’t required. There are many variations on this theme.

The “Accelerated” piece of the Proximal Gradient Method we’ll see next1 Replace the standard gradient in the proximal methods discussed next to make it “Accelerated” We’ll stick with the gradient for ease of exposition in the introduction to proximal methods next. Replacing the gradient step in each of the final Proximal Gradient approaches gives the corresponding Accelerated Proximal Gradient approach.

The “Proximal” part

Projection onto a convex set This will become a familiar shape

Gradient step: proximal definition 2nd order “almost” Taylor series expansion

Gradient step: proximal definition a little rearranging To go from the first to second line just throw away terms independent of x and complete the square. where

Proximal Gradient step a simple projection! minimized at

Proximal Gradient step a simple projection! again notice

Extension to L-1 regularized problems

Convex L-1 regularized problems e.g. the Lasso

Proximal Gradient step drag and drop It’s not clear how to generalize the notion of a gradient step to the L-1 case from the standard perspective. However from the proximal perspective it is fairly straightforward. same quadratic approximation to f

Proximal Gradient step same business as before

Proximal Gradient step same shape as proximal version of projection

Proximal Gradient step at this point expect for some

Shrinkage operator

Proximal Gradient step same business as before

Proximal Gradient Algorithm for general L-1 Regularized problem Complexity: just like gradient descent.

Iterative Shrinkage Thresholding Algorithm (ISTA) 1 Complexity: just like gradient descent.

Iterative Shrinkage Thresholding Algorithm (ISTA) With the optimal gradient step known as the Fast Iterative Shrinkage Thresholding Algorithm (FISTA)1 Complexity: just like gradient descent.

Extension to nuclear-norm regularized problems

Problem archetype proximal gradient

Problem archetype drag and drop same quadratic approximation to f

Problem archetype same shape as proximal version of projection

Problem archetype As usual expect

What is X*? Well if the SVD decomposition can be written in outer-product form as

What is X*? then since and

What is X*? could in analogy to ISTA be given by soft-thresholding the singular values of Y? i.e.

What is X*? could in analogy to ISTA be given by soft-thresholding the singular values of Y? i.e. (since )

Yes, it is2

Example: RPCA w/no Gaussian noise Moral to continue to drive home: reformulating is half the battle in optimization. This is an example use of the Quadratic Penalty Method

Example: RPCA w/no Gaussian noise Quadratic penalty method Moral to continue to drive home: reformulating is half the battle in optimization.

RPCA: reformulation via Quadratic Penalty Method Perform alternating minimization in X and E using proximal gradient steps Moral to continue to drive home: reformulating is half the battle in optimization.

Both of these are in the bag

Demo: PG versus APG

Where to go from here Primal dual methods Top of the line algorithms in the field for a wide array of large-scale sparse/low-rank problems Often employ proximal methods Rapid convergence to “reasonable” solutions, often good enough for sparse/low-rank problems Dual ascent7, Augmented Lagrangian8/Alternating Direction Method of Multipliers9

References Beck, Amir, and Marc Teboulle. "A fast iterative shrinkage-thresholding algorithm for linear inverse problems." SIAM Journal on Imaging Sciences 2.1 (2009): 183-202. Cai, Jian-Feng, Emmanuel J. Candès, and Zuowei Shen. "A singular value thresholding algorithm for matrix completion." SIAM Journal on Optimization 20.4 (2010): 1956-1982. Qian, Ning. "On the momentum term in gradient descent learning algorithms." Neural networks 12.1 (1999): 145-151. Emmanuel J. Candès. Class notes for “Math 301: Advanced topics in convex optimization” found online at http://www-stat.stanford.edu/~candes/math301/index.html Nesterov, Yurii, and I͡U E. Nesterov. Introductory lectures on convex optimization: A basic course. Vol. 87. Springer, 2004. Made with GeoGebra – a free tool for producing geometric figures. Available for download online at http://www.geogebra.org/cms/en/

References Cai, Jian-Feng, Emmanuel J. Candès, and Zuowei Shen. "A singular value thresholding algorithm for matrix completion." SIAM Journal on Optimization 20.4 (2010): 1956-1982. Lin, Zhouchen, Minming Chen, and Yi Ma. "The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices." arXiv preprint arXiv:1009.5055 (2010). Boyd, Stephen, et al. "Distributed optimization and statistical learning via the alternating direction method of multipliers." Foundations and Trends® in Machine Learning 3.1 (2011): 1-122.