For convex optimization

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

CPSC 322, Lecture 14Slide 1 Local Search Computer Science cpsc322, Lecture 14 (Textbook Chpt 4.8) Oct, 5, 2012.
Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Linear Regression.
Bregman Iterative Algorithms for L1 Minimization with
Neural and Evolutionary Computing - Lecture 4 1 Random Search Algorithms. Simulated Annealing Motivation Simple Random Search Algorithms Simulated Annealing.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Optimization 吳育德.
Machine Learning Week 2 Lecture 1.
Dynamics of Learning VQ and Neural Gas Aree Witoelar, Michael Biehl Mathematics and Computing Science University of Groningen, Netherlands in collaboration.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Newton’s Method Application to LMS Recursive Least Squares Exponentially-Weighted.
Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.
Gradient Methods April Preview Background Steepest Descent Conjugate Gradient.
Martin Burger Institut für Numerische und Angewandte Mathematik European Institute for Molecular Imaging CeNoS Total Variation and related Methods Numerical.
Gradient Methods May Preview Background Steepest Descent Conjugate Gradient.
Unconstrained Optimization Problem
Gradient Methods Yaron Lipman May Preview Background Steepest Descent Conjugate Gradient.
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.
Why Function Optimization ?
Radial Basis Function Networks
Normalised Least Mean-Square Adaptive Filtering
Computational Optimization
Analysis of Algorithms
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Mathematical Analysis of MaxEnt for Mixed Pixel Decomposition
Optimization in Engineering Design 1 Introduction to Non-Linear Optimization.
Some links between min-cuts, optimal spanning forests and watersheds Cédric Allène, Jean-Yves Audibert, Michel Couprie, Jean Cousty & Renaud Keriven ENPC.
Optimal Control.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Complexity Analysis (Part I)
For convex optimization
Introduction to Analysis of Algorithms
Large-scale Machine Learning
Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides
LECTURE 11: Advanced Discriminant Analysis
Jeremy Watt and Aggelos Katsaggelos Northwestern University
Zhu Han University of Houston Thanks for Dr. Mingyi Hong’s slides
Non-linear Minimization
Lecture 07: Soft-margin SVM
Boosting and Additive Trees (2)
Understanding Generalization in Adaptive Data Analysis
Computational Optimization
Generalization and adaptivity in stochastic convex optimization
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Computer Science cpsc322, Lecture 14
Structure from motion Input: Output: (Tomasi and Kanade)
NESTA: A Fast and Accurate First-Order Method for Sparse Recovery
CS5321 Numerical Optimization
Chapter 6. Large Scale Optimization
RL methods in practice Alekh Agarwal.
Lecture 07: Soft-margin SVM
CSCI B609: “Foundations of Data Science”
CS5321 Numerical Optimization
Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits
Lecture 08: Soft-margin SVM
Lecture 07: Soft-margin SVM
CS639: Data Management for Data Science
Generalization bounds for uniformly stable algorithms
Local Search Algorithms
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
Structure from motion Input: Output: (Tomasi and Kanade)
Complexity Analysis (Part I)
Chapter 6. Large Scale Optimization
Complexity Analysis (Part I)
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

For convex optimization First order methods For convex optimization Test J. Saketha Nath (IIT Bombay; Microsoft)

Topics Part – I Optimal methods for unconstrained convex programs Smooth objective Non-smooth objective Part – II Optimal methods for constrained convex programs Projection based Frank-Wolfe based Prox-based methods for structured non-smooth programs

Non-Topics Step-size schemes Bundle methods Stochastic methods Inexact oracles Non-Euclidean extensions (Mirror-friends)

Motivation & Example ApplicationS

Machine Learning Applications Training data: { 𝑥 1 , 𝑦 1 , …, 𝑥 𝑚 , 𝑦 𝑚 } Goal: Construct 𝑓 : 𝑋 𝑌

Machine Learning Applications Set of Temple Images Corresponding Architecture Labels

Machine Learning Applications Set of Temple Images Corresponding Architecture Labels 𝑓{ }= “Vijayanagara Style”

Machine Learning Applications Input data: { 𝑥 1 , 𝑦 1 , …, 𝑥 𝑚 , 𝑦 𝑚 } Goal: Construct 𝑓 : 𝑋 𝑌 Model: 𝑓 𝑥 = 𝑤 𝑇 𝜙 𝑥

Machine Learning Applications Input data: { 𝑥 1 , 𝑦 1 , …, 𝑥 𝑚 , 𝑦 𝑚 } Goal: Construct 𝑓 : 𝑋 𝑌 Model: 𝑓 𝑥 = 𝑤 𝑇 𝜙 𝑥 Algorithm: Find simple functions that explain data min 𝑤∈𝑊 Ω(𝑤)+ 𝑖=1 𝑚 𝑙( 𝑤 𝑇 𝜙 𝑥 𝑖 , 𝑦 𝑖 )

Typical Program – Machine Learning Smooth surrogate min 𝑤∈𝑊 Ω(𝑤)+ 𝑖=1 𝑚 𝑙( 𝑤 𝑇 𝜙 𝑥 𝑖 , 𝑦 𝑖 ) Smooth/Non-Smooth m, n are large Data term is a sum Domain 𝑊 is restricted

Scale is the issue! m, n as well as no. models may run into millions! Even a single iteration of IPM/Newton-variants is in-feasible. “Slower” but “cheaper” methods are the alternative Decomposition based First order methods

First Order Methods - Overview Iterative, gradient-like information, 𝑂(𝑚𝑛) per iteration  E.g. Gradient method, Cutting planes, Conjugate gradient Very old methods (1950s) Far slower than IPM: Sub-linear rate  . (Not crucial for ML) But (nearly) n-independent  Widely employed in state-of-the-art ML systems Choice of variant depends on problem structure

First Order Methods - Overview Iterative, gradient-like information, 𝑂(𝑚𝑛) per iteration  E.g. Gradient method, Cutting planes, Conjugate gradient Very old methods (1950s) No. iterations: Sub-linear rate  . But (nearly) n-independent  Widely employed in state-of-the-art ML systems Choice of variant depends on problem structure

Smooth un-constrained min 𝑤∈ 𝑅 𝑛 𝑖=1 𝑚 𝑤 𝑇 𝜙 𝑥 𝑖 − 𝑦 𝑖 2

Smooth Convex Functions Continuously differentiable Gradient is Lipschitz continuous

Smooth Convex Functions Continuously differentiable Gradient is Lipschitz continuous 𝛻𝑓 𝑥 −𝛻𝑓(𝑦) ≤𝐿 𝑥−𝑦 E.g. 𝑔 𝑥 ≡ 𝑥 2 is not L-conts. over 𝑅 but is over [0,1] with L=2 E.g. 𝑔 𝑥 ≡ 𝑥 is L-conts. with L=1

Smooth Convex Functions Continuously differentiable Gradient is Lipschitz continuous 𝛻𝑓 𝑥 −𝛻𝑓(𝑦) ≤𝐿 𝑥−𝑦 E.g. 𝑔 𝑥 ≡ 𝑥 2 is not L-conts. over 𝑅 but is over [0,1] with L=2 E.g. 𝑔 𝑥 ≡ 𝑥 is L-conts. with L=1 Theorem: Let 𝑓 be convex twice differentiable. Then 𝑓 is smooth with const. 𝐿 ⇔ 𝑓 ′′ 𝑥 ≼𝐿 I 𝑛

Smooth Convex Functions min 𝑤∈ 𝑅 𝑛 𝑖=1 𝑚 𝑤 𝑇 𝜙 𝑥 𝑖 − 𝑦 𝑖 2 is indeed smooth! Continuously differentiable Gradient is Lipschitz continuous 𝛻𝑓 𝑥 −𝛻𝑓(𝑦) ≤𝐿 𝑥−𝑦 E.g. 𝑔 𝑥 ≡ 𝑥 2 is not L-conts. over 𝑅 but is over [0,1] with L=2 E.g. 𝑔 𝑥 ≡ 𝑥 is L-conts. with L=1 Theorem: Let 𝑓 be convex twice differentiable. Then 𝑓 is smooth with const. 𝐿 ⇔ 𝑓 ′′ 𝑥 ≼𝐿 I 𝑛

Gradient Method [Cauchy1847] Move iterate in direction of instantaneous decrease 𝑥 𝑘+1 = 𝑥 𝑘 − 𝑠 𝑘 𝛻𝑓 𝑥 𝑘 , s k >0

Gradient Method Move iterate in direction of instantaneous decrease 𝑥 𝑘+1 = 𝑥 𝑘 − 𝑠 𝑘 𝛻𝑓 𝑥 𝑘 , s k >0 Regularized minimization of first order approx. 𝑥 𝑘+1 =argmi n 𝑥∈ 𝑅 𝑛 𝑓 𝑥 𝑘 +𝛻𝑓 𝑥 𝑘 𝑇 (𝑥− 𝑥 𝑘 )+ 1 2 𝑠 𝑘 𝑥− 𝑥 𝑘 2

Gradient Method Move iterate in direction of instantaneous decrease 𝑥 𝑘+1 = 𝑥 𝑘 − 𝑠 𝑘 𝛻𝑓 𝑥 𝑘 , s k >0 Regularized minimization of first order approx. 𝑥 𝑘+1 =argmi n 𝑥∈ 𝑅 𝑛 𝑓 𝑥 𝑘 +𝛻𝑓 𝑥 𝑘 𝑇 (𝑥− 𝑥 𝑘 )+ 1 2 𝑠 𝑘 𝑥− 𝑥 𝑘 2 𝜙( 𝑥 𝑘 ) 𝜙( 𝑥 𝑘+1 ) 𝑓 𝑥 𝑘 𝑥 𝑘+1 𝑥 𝑘

Gradient Method Move iterate in direction of instantaneous decrease 𝑥 𝑘+1 = 𝑥 𝑘 − 𝑠 𝑘 𝛻𝑓 𝑥 𝑘 , s k >0 Regularized minimization of first order approx. 𝑥 𝑘+1 =argmi n 𝑥∈ 𝑅 𝑛 𝑓 𝑥 𝑘 +𝛻𝑓 𝑥 𝑘 𝑇 (𝑥− 𝑥 𝑘 )+ 1 2 𝑠 𝑘 𝑥− 𝑥 𝑘 2 Various step-size schemes Constant (1/𝐿) Diminishing ( 𝑠 𝑘 ↓0,∑ 𝑠 𝑘 =∞) Exact or back-tracking line search

Convergence rate – Gradient method Theorem[Ne04]: If 𝑓 is smooth with const. 𝐿 and s 𝑘 = 1 𝐿 , then gradient method generates 𝑥 𝑘 such that: 𝑓 𝑥 𝑘 −𝑓 𝑥 ∗ ≤ 2𝐿 𝑥 0 − 𝑥 ∗ 2 𝑘+4 . Proof Sketch: 𝑓 𝑥 𝑘+1 ≤𝑓 𝑥 𝑘 +𝛻𝑓 𝑥 𝑘 𝑇 𝑥 𝑘+1 − 𝑥 𝑘 + 𝐿 2 𝑥 𝑘+1 − 𝑥 𝑘 2 𝑓 𝑥 𝑘+1 ≥𝑓 𝑥 𝑘 +𝛻𝑓 𝑥 𝑘 𝑇 𝑥 𝑘+1 − 𝑥 𝑘 + 1 2𝐿 𝛻𝑓( 𝑥 𝑘+1 )−𝛻𝑓( 𝑥 𝑘 ) 2 𝑥 𝑘+1 − 𝑥 ∗ 2 ≤ 𝑥 𝑘 − 𝑥 ∗ 2 − 1 𝐿 2 𝛻𝑓 𝑥 𝑘 2 Δ 𝑘+1 ≤ Δ 𝑘 − Δ 2 𝑟 0 2 (Solve recursion)

Convergence rate – Gradient method Theorem[Ne04]: If 𝑓 is smooth with const. 𝐿 and s 𝑘 = 1 𝐿 , then gradient method generates 𝑥 𝑘 such that: 𝑓 𝑥 𝑘 −𝑓 𝑥 ∗ ≤ 2𝐿 𝑥 0 − 𝑥 ∗ 2 𝑘+4 . Proof Sketch: 𝑓 𝑥 𝑘+1 ≤𝑓 𝑥 𝑘 +𝛻𝑓 𝑥 𝑘 𝑇 𝑥 𝑘+1 − 𝑥 𝑘 + 𝐿 2 𝑥 𝑘+1 − 𝑥 𝑘 2 𝑓 𝑥 𝑘+1 ≥𝑓 𝑥 𝑘 +𝛻𝑓 𝑥 𝑘 𝑇 𝑥 𝑘+1 − 𝑥 𝑘 + 1 2𝐿 𝛻𝑓( 𝑥 𝑘+1 )−𝛻𝑓( 𝑥 𝑘 ) 2 𝑥 𝑘+1 − 𝑥 ∗ 2 ≤ 𝑥 𝑘 − 𝑥 ∗ 2 − 1 𝐿 2 𝛻𝑓 𝑥 𝑘 2 Δ 𝑘+1 ≤ Δ 𝑘 − Δ 2 𝑟 0 2 (Solve recursion) Majorization minimization

Convergence rate – Gradient method Theorem[Ne04]: If 𝑓 is smooth with const. 𝐿 and s 𝑘 = 1 𝐿 , then gradient method generates 𝑥 𝑘 such that: 𝑓 𝑥 𝑘 −𝑓 𝑥 ∗ ≤ 2𝐿 𝑥 0 − 𝑥 ∗ 2 𝑘+4 . Proof Sketch: 𝑓 𝑥 𝑘+1 ≤𝑓 𝑥 𝑘 +𝛻𝑓 𝑥 𝑘 𝑇 𝑥 𝑘+1 − 𝑥 𝑘 + 𝐿 2 𝑥 𝑘+1 − 𝑥 𝑘 2 𝑓 𝑥 𝑘+1 ≥𝑓 𝑥 𝑘 +𝛻𝑓 𝑥 𝑘 𝑇 𝑥 𝑘+1 − 𝑥 𝑘 + 1 2𝐿 𝛻𝑓( 𝑥 𝑘+1 )−𝛻𝑓( 𝑥 𝑘 ) 2 𝑥 𝑘+1 − 𝑥 ∗ 2 ≤ 𝑥 𝑘 − 𝑥 ∗ 2 − 1 𝐿 2 𝛻𝑓 𝑥 𝑘 2 Δ 𝑘+1 ≤ Δ 𝑘 − Δ 2 𝑟 0 2 (Solve recursion)

Comments on rate of convergence Sub-linear, very slow compared to IPM Applies to conjugate gradient and other traditional variants Sub-optimal (may be?): Theorem[Ne04]: For any 𝑘≤ 𝑛−1 2 , and any 𝑥 0 , there exists a smooth function f, with const. L, such that with any first order method, we have: 𝑓 𝑥 𝑘 −𝑓 𝑥 ∗ ≥ 3𝐿 𝑥 0 − 𝑥 ∗ 2 32 𝑘+1 2 . Proof Sketch: Choose function such that 𝑥 𝑘 ∈𝑙𝑖𝑛 𝛻𝑓 𝑥 0 ,…,𝛻𝑓 𝑥 𝑘−1 ⊂ 𝑅 𝑘,𝑛

Comments on rate of convergence Sub-linear, very slow compared to IPM Applies to conjugate gradient and other traditional variants Sub-optimal (may be?): Theorem[Ne04]: For any 𝑘≤ 𝑛−1 2 , and any 𝑥 0 , there exists a smooth function f, with const. L, such that with any first order method, we have: 𝑓 𝑥 𝑘 −𝑓 𝑥 ∗ ≥ 3𝐿 𝑥 0 − 𝑥 ∗ 2 32 𝑘+1 2 . Proof Sketch: Choose function such that 𝑥 𝑘 ∈𝑙𝑖𝑛 𝛻𝑓 𝑥 0 ,…,𝛻𝑓 𝑥 𝑘−1 ⊂ 𝑅 𝑘,𝑛

Comments on rate of convergence Sub-linear, very slow compared to IPM Applies to conjugate gradient and other traditional variants Sub-optimal (may be?): Theorem[Ne04]: For any 𝑘≤ 𝑛−1 2 , and any 𝑥 0 , there exists a smooth function f, with const. L, such that with any first order method, we have: 𝑓 𝑥 𝑘 −𝑓 𝑥 ∗ ≥ 3𝐿 𝑥 0 − 𝑥 ∗ 2 32 𝑘+1 2 . Proof Sketch: Choose function such that 𝑥 𝑘 ∈𝑙𝑖𝑛 𝛻𝑓 𝑥 0 ,…,𝛻𝑓 𝑥 𝑘−1 ⊂ 𝑅 𝑘,𝑛

Comments on rate of convergence Strongly convex: 𝑶 𝑸−𝟏 𝑸+𝟏 𝟐𝒌 Sub-linear, very slow compared to IPM Applies to conjugate gradient and other traditional variants Sub-optimal (may be?): Theorem[Ne04]: For any 𝑘≤ 𝑛−1 2 , and any 𝑥 0 , there exists a smooth function f, with const. L, such that with any first order method, we have: 𝑓 𝑥 𝑘 −𝑓 𝑥 ∗ ≥ 3𝐿 𝑥 0 − 𝑥 ∗ 2 32 𝑘+1 2 . Proof Sketch: Choose function such that 𝑥 𝑘 ∈𝑙𝑖𝑛 𝛻𝑓 𝑥 0 ,…,𝛻𝑓 𝑥 𝑘−1 ⊂ 𝑅 𝑘,𝑛 Strongly convex: 𝑶 𝑸 −𝟏 𝑸 +𝟏 𝟐𝒌

Intuition for non-optimality All variants are descent methods Descent essential for proof Overkill leading to restrictive movements Try non-descent alternatives!

Intuition for non-optimality All variants are descent methods Descent essential for proof Overkill leading to restrictive movements Try non-descent alternatives!

Accelerated Gradient Method [Ne83,88,Be09] 𝑦 𝑘 = 𝑥 𝑘−1 + 𝑘−2 𝑘+1 𝑥 𝑘−1 − 𝑥 𝑘−2 (Extrapolation or momentum) 𝑥 𝑘 = 𝑦 𝑘 − 𝑠 𝑘 𝛻𝑓( 𝑦 𝑘 ) (Usual gradient step) Two step history

Accelerated Gradient Method [Ne83,88,Be09] 𝑦 𝑘 = 𝑥 𝑘−1 + 𝑘−2 𝑘+1 𝑥 𝑘−1 − 𝑥 𝑘−2 (Extrapolation or momentum) 𝑥 𝑘 = 𝑦 𝑘 − 𝑠 𝑘 𝛻𝑓( 𝑦 𝑘 ) (Usual gradient step) 𝑥 𝑘 𝑥 𝑘−2 𝑥 𝑘−1 𝑦 𝑘

Towards optimality [Moritz Hardt] Sub-optimal: 𝑶 𝟏− 𝟏 𝑸 𝒌 𝑓 𝑥 = 1 2 𝑥 𝑇 𝐴𝑥−𝑏𝑥 ; 𝑥 0 =𝑏 𝑥 𝑘 = 𝑥 𝑘−1 − 1 𝐿 𝐴 𝑥 𝑘−1 −𝑏 = 𝑥 0 + 𝑖=0 𝑘 𝐼− 𝐴 𝐿 𝑖 𝑏 𝐿 Lemma[Mo12]: There is a (Chebyshev) poly. of degree 𝑂 𝑄 log 1 𝜖 such that 𝑝 0 =1 and 𝑝 𝑥 ≤𝜖 ∀ 𝑥∈ 𝜇,𝐿 . Chebyshev poly. have two term recursive formula, hence we expect: 𝑥 𝑘 = 𝑥 𝑘−1 − 𝑠 𝑘−1 𝛻𝑓 𝑥 𝑘−1 + 𝜆 𝑘−1 𝛻𝑓 𝑥 𝑘−2 , to be optimal (acceleration)

Towards optimality [Moritz Hardt] Sub-optimal: 𝑶 𝟏− 𝟏 𝑸 𝒌 𝑓 𝑥 = 1 2 𝑥 𝑇 𝐴𝑥−𝑏𝑥 ; 𝑥 0 =𝑏 𝑥 𝑘 = 𝑥 𝑘−1 − 1 𝐿 𝐴 𝑥 𝑘−1 −𝑏 = 𝑥 0 + 𝑖=0 𝑘 𝐼− 𝐴 𝐿 𝑖 𝑏 𝐿 Lemma[Mo12]: There is a (Chebyshev) poly. of degree 𝑂 𝑄 log 1 𝜖 such that 𝑝 0 =1 and 𝑝 𝑥 ≤𝜖 ∀ 𝑥∈ 𝜇,𝐿 . Chebyshev poly. have two term recursive formula, hence we expect: 𝑥 𝑘 = 𝑥 𝑘−1 − 𝑠 𝑘−1 𝛻𝑓 𝑥 𝑘−1 + 𝜆 𝑘−1 𝛻𝑓 𝑥 𝑘−2 , to be optimal (acceleration) Optimal: 𝑶 𝟏− 𝟏 𝑸 𝒌

Rate of Convergence – Accelerated gradient Theorem [Be09]: If 𝑓 is smooth with const. 𝐿 and s 𝑘 = 1 𝐿 , then accelerated gradient method generates 𝑥 𝑘 such that: 𝑓 𝑥 𝑘 −𝑓 𝑥 ∗ ≤ 2𝐿 𝑥 0 − 𝑥 ∗ 2 𝑘+1 2 . Proof Sketch: 𝑓 𝑥 𝑘 ≤𝑓 𝑧 +𝐿 𝑥 𝑘 −𝑦 𝑇 𝑧− 𝑥 𝑘 + 𝐿 2 𝑥 𝑘 −𝑦 2 ∀ 𝑧∈ 𝑅 𝑛 Convex combination at 𝑧= 𝑥 𝑘 , 𝑧= 𝑥 ∗ leads to: 𝑘+1 2 2𝐿 𝑓 𝑥 𝑘 − 𝑓 ∗ + 𝒚 𝒌 − 𝒙 ∗ 2 ≤ 𝑘 2 2𝐿 𝑓 𝑥 𝑘−1 − 𝑓 ∗ + 𝒚 𝒌−𝟏 − 𝒙 ∗ 2 ≤ 𝑥 0 − 𝑥 ∗ 2 Indeed optimal!

Rate of Convergence – Accelerated gradient Theorem [Be09]: If 𝑓 is smooth with const. 𝐿 and s 𝑘 = 1 𝐿 , then accelerated gradient method generates 𝑥 𝑘 such that: 𝑓 𝑥 𝑘 −𝑓 𝑥 ∗ ≤ 2𝐿 𝑥 0 − 𝑥 ∗ 2 𝑘+1 2 . Proof Sketch: 𝑓 𝑥 𝑘 ≤𝑓 𝑧 +𝐿 𝑥 𝑘 −𝑦 𝑇 𝑧− 𝑥 𝑘 + 𝐿 2 𝑥 𝑘 −𝑦 2 ∀ 𝑧∈ 𝑅 𝑛 Convex combination at 𝑧= 𝑥 𝑘 , 𝑧= 𝑥 ∗ leads to: 𝑘+1 2 2𝐿 𝑓 𝑥 𝑘 − 𝑓 ∗ + 𝒚 𝒌 − 𝒙 ∗ 2 ≤ 𝑘 2 2𝐿 𝑓 𝑥 𝑘−1 − 𝑓 ∗ + 𝒚 𝒌−𝟏 − 𝒙 ∗ 2 ≤ 𝑥 0 − 𝑥 ∗ 2

Rate of Convergence – Accelerated gradient Theorem [Be09]: If 𝑓 is smooth with const. 𝐿 and s 𝑘 = 1 𝐿 , then accelerated gradient method generates 𝑥 𝑘 such that: 𝑓 𝑥 𝑘 −𝑓 𝑥 ∗ ≤ 2𝐿 𝑥 0 − 𝑥 ∗ 2 𝑘+1 2 . Proof Sketch: 𝑓 𝑥 𝑘 ≤𝑓 𝑧 +𝐿 𝑥 𝑘 −𝑦 𝑇 𝑧− 𝑥 𝑘 + 𝐿 2 𝑥 𝑘 −𝑦 2 ∀ 𝑧∈ 𝑅 𝑛 Convex combination at 𝑧= 𝑥 𝑘 , 𝑧= 𝑥 ∗ leads to: 𝑘+1 2 2𝐿 𝑓 𝑥 𝑘 − 𝑓 ∗ + 𝒚 𝒌 − 𝒙 ∗ 2 ≤ 𝑘 2 2𝐿 𝑓 𝑥 𝑘−1 − 𝑓 ∗ + 𝒚 𝒌−𝟏 − 𝒙 ∗ 2 ≤ 𝑥 0 − 𝑥 ∗ 2

A Comparison of the two gradient methods min 𝑥∈ 𝑅 1000 log⁡ 𝑖=1 2000 𝑒 𝑎 𝑖 𝑇 𝑥+ 𝑏 𝑖 [L. Vandenberghe EE236C Notes]

Junk variants other than Accelerated gradient? Accelerated gradient is Less robust than gradient method [Moritz Hardt] Accumulates error with inexact oracles [De13] Who knows what will happen in your application?

Summary of un-constrained smooth convex programs Gradient method and friends: 𝜖≈𝑂( 1 𝑘 ) Sub-linear and sub-optimal rate. Additionally, strong convexity gives: 𝜖≈𝑶 𝑸−𝟏 𝑸+𝟏 𝟐𝒌 . Sub-optimal but linear rate. Accelerated gradient methods: 𝜖≈𝑂( 1 𝑘 2 ) Sub-linear but optimal 𝑂(𝑛) computation per iteration Additionally, strong convexity gives: 𝜖≈𝑶 𝑸 −𝟏 𝑸 +𝟏 𝟐𝒌 . Optimal but still linear rate.

Summary of un-constrained smooth convex programs Gradient method and friends: 𝜖≈𝑂( 1 𝑘 ) Sub-linear and sub-optimal rate. Additionally, strong convexity gives: 𝜖≈𝑶 𝑸−𝟏 𝑸+𝟏 𝟐𝒌 . Sub-optimal but linear rate. Accelerated gradient methods: 𝜖≈𝑂( 1 𝑘 2 ) Sub-linear but optimal 𝑂(𝑚𝑛) computation per iteration Additionally, strong convexity gives: 𝜖≈𝑶 𝑸 −𝟏 𝑸 +𝟏 𝟐𝒌 . Optimal but still linear rate.

Non-smooth unconstrained min 𝑤∈ 𝑅 𝑛 𝑖=1 𝑚 𝑤 ⊤ 𝜙 𝑥 𝑖 − 𝑦 𝑖

What is first order info? ( 𝑥 0 ,𝑓( 𝑥 0 ))

What is first order info? 𝐿 𝑥 ≡𝑓 𝑥 0 +𝛻𝑓 𝑥 0 𝑇 𝑥− 𝑥 0 ( 𝑥 0 ,𝑓( 𝑥 0 ))

What is first order info? g is defined as a sub-gradient ( 𝑥 1 ,𝑓( 𝑥 1 )) Canonical form: 𝐿 𝑥 ≡𝑓 𝑥 1 + 𝑔 𝑇 𝑥− 𝑥 1 . Multiple 𝑔 exist such that 𝐿 𝑥 ≤𝑓 𝑥 ∀𝑥

First Order Methods (Non-smooth) Theorem: Let 𝑓 be a closed convex function. Then At any 𝑥∈𝑟𝑖(𝑑𝑜𝑚 𝑓), sub-gradient exists and set of all sub-gradients (denoted by 𝜕𝑓(𝑥); sub-differential set) is closed convex. If 𝑓 is differentiable at 𝑥∈𝑖𝑛𝑡(𝑑𝑜𝑚 𝑓), then gradient is the only sub- gradient. Theorem: 𝑥∈ 𝑅 𝑛 minimizes 𝑓 if and only if 0∈𝜕𝑓(𝑥).

Sub-gradient Method 𝑥 𝑘+1 = 𝑥 𝑘 − 𝑠 𝑘 𝑔 𝑓 𝑥 𝑘 Assume oracle that throws a sub-gradient. Sub-gradient method: 𝑥 𝑘+1 =argmi n 𝑥∈ 𝑅 𝑛 𝑓 𝑥 𝑘 + 𝑔 𝑓 𝑥 𝑘 𝑇 𝑥− 𝑥 𝑘 + 1 2 𝑠 𝑘 𝑥− 𝑥 𝑘 2 𝑥 𝑘+1 = 𝑥 𝑘 − 𝑠 𝑘 𝑔 𝑓 𝑥 𝑘

Can sub-gradient replace gradient? No majorization minimization − 𝑔 𝑓 𝑥 not even descent direction E.g. 𝑓 𝑥 1 , 𝑥 2 ≡ 𝑥 1 +2| 𝑥 2 | (1,2) (1,0)

How far can sub-gradient take? Expect slower than 𝑂( 1 𝑘 )

How far can sub-gradient take? Always exists! Theorem[Ne04]: Let 𝑥 0 − 𝑥 ∗ ≤𝑅 and L the Lip. const. of 𝑓 over this ball. Then sequence generated by sub-gradient descent satisfies: min 𝑖∈{1,..,𝑘} 𝑓( 𝑥 𝑖 ) −𝑓 𝑥 ∗ ≤ 𝐿𝑅 𝑘+1 . Proof Sketch: 2 𝑠 𝑘 Δ 𝑘 ≤ 𝑟 𝑘 2 − 𝑟 𝑘+1 2 + 𝑠 𝑘 2 𝑔 𝑓 ( 𝑥 𝑘 ) 2 LHS ≤ 𝑟 0 2 + 𝑖=0 𝑘 𝑠 𝑖 2 𝑔 𝑓 𝑥 𝑘 2 2 𝑖=0 𝑘 𝑠 𝑖 ≤ 𝑅 2 + 𝑖=0 𝑘 𝑠 𝑖 2 𝐿 2 𝑖=0 𝑘 𝑠 𝑖 ; Choose 𝑠 𝑘 = 𝑅 𝑘+1

How far can sub-gradient take? Theorem[Ne04]: Let 𝑥 0 − 𝑥 ∗ ≤𝑅 and L the Lip. const. of 𝑓 over this ball. Then sequence generated by sub-gradient descent satisfies: min 𝑖∈{1,..,𝑘} 𝑓( 𝑥 𝑖 ) −𝑓 𝑥 ∗ ≤ 𝐿𝑅 𝑘+1 . Proof Sketch: 2 𝑠 𝑘 Δ 𝑘 ≤ 𝑟 𝑘 2 − 𝑟 𝑘+1 2 + 𝑠 𝑘 2 𝑔 𝑓 ( 𝑥 𝑘 ) 2 LHS ≤ 𝑟 0 2 + 𝑖=0 𝑘 𝑠 𝑖 2 𝑔 𝑓 𝑥 𝑘 2 2 𝑖=0 𝑘 𝑠 𝑖 ≤ 𝑅 2 + 𝑖=0 𝑘 𝑠 𝑖 2 𝐿 2 2 𝑖=0 𝑘 𝑠 𝑖 ; Choose 𝑠 𝑘 = 𝑅 𝑘+1

Is this optimal? Theorem[Ne04]: For any 𝑘≤𝑛−1, and any 𝑥 0 such that 𝑥 0 − 𝑥 ∗ ≤𝑅, there exists a convex 𝑓, with const. L over the ball, such that with any first order method, we have: 𝑓 𝑥 𝑘 −𝑓 𝑥 ∗ ≥ 𝐿𝑅 2(1+ 𝑘+1 ) . Proof Sketch: Choose function such that 𝑥 𝑘 ∈𝑙𝑖𝑛 𝑔 𝑓 𝑥 0 ,…, 𝑔 𝑓 𝑥 𝑘−1 ⊂ 𝑅 𝑘,𝑛

Summary of non-smooth unconstrained Sub-gradient descent method: 𝜖≈𝑂 1 𝑘 . Sub-linear, slower than smooth case But, optimal! Can do better if additional structure (later)

Summary of Unconstrained Case

Bibliography [Ne04] Nesterov, Yurii. Introductory lectures on convex optimization : a basic course. Kluwer Academic Publ., 2004. http://hdl.handle.net/2078.1/116858. [Ne83] Nesterov, Yurii. A method of solving a convex programming problem with convergence rate O (1/k2). Soviet Mathematics Doklady, Vol. 27(2), 372-376 pages. [Mo12] Moritz Hardt, Guy N. Rothblum and Rocco A. Servedio. Private data release via learning thresholds. SODA 2012, 168-187 pages. [Be09] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal of Imaging Sciences, Vol. 2(1), 2009. 183-202 pages. [De13] Olivier Devolder, François Glineur and Yurii Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming 2013.