A Numerical Analysis Approach to Convex Optimization

Slides:

Advertisements

Similar presentations

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Advertisements

05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.

Primal Dual Combinatorial Algorithms Qihui Zhu May 11, 2009.

Fast Regression Algorithms Using Spectral Graph Theory Richard Peng.

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Bregman Iterative Algorithms for L1 Minimization with

Introduction to Algorithms

The Combinatorial Multigrid Solver Yiannis Koutis, Gary Miller Carnegie Mellon University TexPoint fonts used in EMF. Read the TexPoint manual before you.

Siddharth Choudhary.  Refines a visual reconstruction to produce jointly optimal 3D structure and viewing parameters  ‘bundle’ refers to the bundle.

+ Convex Functions, Convex Sets and Quadratic Programs Sivaraman Balakrishnan.

Interchanging distance and capacity in probabilistic mappings Uriel Feige Weizmann Institute.

MATH 685/ CSI 700/ OR 682 Lecture Notes

Preconditioning in Expectation Richard Peng Joint with Michael Cohen (MIT), Rasmus Kyng (Yale), Jakub Pachocki (CMU), and Anup Rao (Yale) MIT CMU theory.

Infinite Horizon Problems

Numerical Optimization

Totally Unimodular Matrices Lecture 11: Feb 23 Simplex Algorithm Elliposid Algorithm.

Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne Reductions Some of these lecture slides are adapted from CLRS Chapter 31.5 and.

Computer Algorithms Integer Programming ECE 665 Professor Maciej Ciesielski By DFG.

Shortest Path Problems Directed weighted graph. Path length is sum of weights of edges on path. The vertex at which the path begins is the source vertex.

Complexity of direct methods n 1/2 n 1/3 2D3D Space (fill): O(n log n)O(n 4/3 ) Time (flops): O(n 3/2 )O(n 2 ) Time and space to solve any problem on any.

Station A: Linear Functions: Solve each equation Solve for the specified variable: Write an equation in slope-intercept form for the line.

Institute for Advanced Study, April Sushant Sachdeva Princeton University Joint work with Lorenzo Orecchia, Nisheeth K. Vishnoi Linear Time Graph.

Computer Animation Rick Parent Computer Animation Algorithms and Techniques Optimization & Constraints Add mention of global techiques Add mention of calculus.

1 Incorporating Iterative Refinement with Sparse Cholesky April 2007 Doron Pearl.

3.4: Linear Programming Objectives: Students will be able to… Use linear inequalities to optimize the value of some quantity To solve linear programming.

Chapter 10 Minimization or Maximization of Functions.

CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections

INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.

Linear Programming Chapter 9. Interior Point Methods  Three major variants  Affine scaling algorithm - easy concept, good performance  Potential.

Mesh Segmentation via Spectral Embedding and Contour Analysis Speaker: Min Meng

Numerical Analysis – Data Fitting Hanyang University Jong-Il Park.

Algorithm for non-negative matrix factorization Daniel D. Lee, H. Sebastian Seung. Algorithm for non-negative matrix factorization. Nature.

Regularized Least-Squares and Convex Optimization.

Honors Track: Competitive Programming & Problem Solving Seminar Topics Kevin Verbeek.

Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.

Laplacian Matrices of Graphs: Algorithms and Applications ICML, June 21, 2016 Daniel A. Spielman.

TU/e Algorithms (2IL15) – Lecture 8 1 MAXIMUM FLOW (part II)

Laplacian Matrices of Graphs: Algorithms and Applications ICML, June 21, 2016 Daniel A. Spielman.

High Performance Linear System Solvers with Focus on Graph Laplacians

Richard Peng Georgia Tech Michael Cohen Jon Kelner John Peebles

Lap Chi Lau we will only use slides 4 to 19

Application and Research on the 3-D adjustment of control network in particle accelerator Luo Tao

Resparsification of Graphs

Topics in Algorithms Lap Chi Lau.

Efficient methods for finding low-stretch spanning trees

Finding a Path With Largest Smallest Edge

Solving Linear Systems Ax=b

CS4234 Optimiz(s)ation Algorithms

Parameter estimation class 5

Roberto Battiti, Mauro Brunato

Nearly-Linear Time Algorithms for Markov Chains and New Spectral Primitives for Directed Graphs Richard Peng Georgia Tech.

Density Independent Algorithms for Sparsifying

Singular Value Decomposition

NESTA: A Fast and Accurate First-Order Method for Sparse Recovery

Instructor: Shengyu Zhang

Analysis of Algorithms

Matrix Martingales in Randomized Numerical Linear Algebra

CSCI B609: “Foundations of Data Science”

Chapter 10. Numerical Solutions of Nonlinear Systems of Equations

Conjugate Gradient Method

LSM with Sparsity Constraints

~ Least Squares example

Algorithms (2IL15) – Lecture 7

Solving Linear Systems: Iterative Methods and Sparse Systems

~ Least Squares example

On Solving Linear Systems in Sublinear Time

Optimization on Graphs

Much Faster Algorithms for Matrix Scaling

Rong Ge, Duke University

Presentation transcript:

A Numerical Analysis Approach to Convex Optimization Richard Peng Georgia Tech Jan 11, 2019 Based on projects joint with Deeksha Adil (University of Toronto) Rasmus Kyng (Harvard) Sushant Sachdeva (University of Toronto) Di Wang (Georgia Tech)

Outline Convex (norm) optimization High accuracy vs. approximate Constructing residual problems Applications / extensions

Convex Optimization Minimize f(x) For some convex f

Norm Optimization Minimize ‖ x ‖p Subject to: Ax = b ‖ x ‖p= (∑ |xi|p )1/p p = 1 / ∞: complete for linear programs p = 2: systems of linear equations LASSO / compressive sensing: p = 1

p = 2 Minimize ‖ x ‖2 Minimize f(x) Subject to: Ax = b ‖ x ‖2: Euclidean distance Equivalent to solving ATAx = ATb

p = 1 Minimize ‖ x ‖1 Minimize f(x) Subject to: Ax = b ‖ x ‖1: Manhattan distance Equivalent to linear programming

p = ∞ Minimize ‖ x ‖1 Minimize f(x) Subject to: Ax = b ‖ x ‖∞: max value of entry Also equivalent to linear programming

p = 4 Minimize ‖ x ‖4 Subject to: Ax = b Previous results: Interior point methods Homotopy methods [Bubeck-Cohen-Lee-Li `18] Accelerated gradient descent [Bullins `18]

Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞

Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞ Algorithms motivated by the p = 2 case: Faster p-norm regression: Op(m<1/3log(𝜖-1)) linear system solves, and Op(mωlog(𝜖-1)) time P-norm flows in Op(m1+O(1/p)log(𝜖-1)) time

Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞ Algorithms motivated by the p = 2 case: Faster p-norm regression: Op(m<1/3log(𝜖-1)) linear system solves, and Op(mωlog(𝜖-1)) time P-norm flows in Op(m1+O(1/p)log(𝜖-1)) time In progress: replace Op() by O(p)

m?log(𝜖-1) linear system solves Interior point: 1/2 for all p [Bubeck-Cohen-Lee-Li `18]: |1/2 – 1/p| [Bullins `18 (indep)]: 1/5 for p = 4 Our result: |1/2 – 1/p| / (1 + |1/2 – 1/p|) BCLL Our result 𝑝

Outline Convex (norm) optimization High accuracy vs. approximate Constructing residual problems Applications / extensions

What are we good at optimizing? p = 2, aka, solving linear systems Reasons: Gaussian elimination Approximate Gaussian elimination Easy to correct errors

What are we good at optimizing? p = 2, aka, solving linear systems Approximately minimizing symmetric functions Reasons: errors are allowed Gradient descent (e.g. [Durfee-Lai-Sawlani `18]) Multiplicative weights / regret minimization Iterative reweighted least squares

A constant factor approximation Minimize ∑ xi4 Subject to: Ax = b Minimize ∑ wixi2 Subject to: Ax = b

A constant factor approximation Adjust wi based on xi Minimize ∑ xi4 Subject to: Ax = b Minimize ∑ wixi2 Subject to: Ax = b Done if wi = xi2 Variants of this gives O(1)-approx in [Chin-Madry-Miller-P `13] m1/3 iters for p = 1 & ∞ [Adil-Kyng-P-Sachdeva `19] m|1/2 – 1/p| / (1 + |1/2 – 1/p|) iters

But… 2-approx to min ∑ xi4 Subject to: Ax = b

But… Incomplete Cholesky: solve ATAx = ATb by pretending ATA is sparse

Why are we good at p = 2? Residual problem at x: Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b

Why are we good at p = 2? Simplify: Residual problem at x: Minimize ∑ 2xiΔi+ Δi2 s.t. AΔ = 0 Residual problem at x: Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b

Why are we good at p = 2? Simplify: Residual problem at x: Minimize ∑ 2xiΔi+ Δi2 s.t. AΔ = 0 Residual problem at x: Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b Binary/line search on linear term: Minimize ∑ Δi2 s.t. AΔ = 0 ∑ 2xiΔi = 2xTΔ = ⍺ Another instance of 2-norm minimization!

Iterative refinement Minimize f(x) Subject to: Ax = b Repeatedly: Create residual problem, fres( ) Approximately min fres(Δ) s.t. AΔ = 0 Update x  x + 0.1 Δ

Outline Convex (norm) optimization High accuracy vs. approximate Constructing residual problems Applications / extensions

Simplifications Ignore A: fres(Δ) approximates fres(x + Δ) - f(x) Suffices to approximate each coordinate (1 + Δ)4 - 14

Gradient: consider Δ  0 fres(x + Δ) - f(x)  <grad(x), Δ> Must have fres(Δ) = <grad(x), Δ> + hres(Δ) (1 + Δ)4 - 14 Gradient

Higher Order Terms Scale down step size: (x + Δ)2 - x2 = 2 Δx + Δ2

Higher Order Terms Scale down step size: (x + Δ)2 - x2 = 2 Δx + Δ2 Unconditionally, even when Δx < 0!

Higher Order Terms Scale down step size: (x + Δ)2 - x2 = 2 Δx + Δ2 Unconditionally, even when Δx < 0! Factor κ error in higher order terms can be absorbed by making 1/κ sized steps

Equivalent goal of hres() Symmetric around 0 hres(Δ) ≅ f(x + Δ) - f(x) - <grad(x), Δ> Approx factor  iteration count

Equivalent goal of hres() Symmetric around 0 hres(Δ) ≅ f(x + Δ) - f(x) - <grad(x), Δ> Approx factor  iteration count Bregman Divergence: difference between f() and its first order approximation

Bregman Divergence of x4 (x + Δ)4 = x4 + 4Δx3 + 6Δ2x2 + 4Δ3x + Δ4 Div(x+ Δ, x) = 6Δ2x2 + 4Δ3x + Δ4 Δ3x is not symmetric in Δ 

Drop the odd term?!? 3(6Δ2+ Δ4) Div(1 + Δ, 1) 0.1(6Δ2+ Δ4)

Prove this? 6Δ2x2 + 4Δ3x + Δ4 ≅ Δ2x2 + Δ4 Proof: By arithmetic-geometric mean inequality (a2+b2 ≥ 2|ab|): |4Δ3x| ≤ 5Δ2x2 + 0.8Δ4

Prove this? 6Δ2x2 + 4Δ3x + Δ4 ≅ Δ2x2 + Δ4 Proof: By arithmetic-geometric mean inequality (a2+b2 ≥ 2|ab|): |4Δ3x| ≤ 5Δ2x2 + 0.8Δ4 Substitute back into Div(x+Δ, x) Δ2x2 + 0.2Δ4 ≤ 6Δ2x2 + 4Δ3x + Δ4 ≤ 9Δ2x2 + 1.8Δ4

Outline Convex (norm) optimization High accuracy vs. approximate Constructing residual problems Applications / extensions

Overall Algorithm Repeatedly approximate min <Δ, x3> + <Δ2, xp - 2> + ‖Δ‖pp s.t. AΔ = 0 And adjust: x  x + ⍺Δ Gradient term <Δ, xp>: Ternary/line search on its value Becomes another linear constraint

Inner Problem min <Δ2, xp-2> + ‖Δ‖pp s.t. AΔ = 0; Symmetric intermediate problems! Modify multiplicative weights: Op(m|p-2|/|3p-2|log(𝜖-1)) iterations Total time can be reduced to Op(mωlog(𝜖-1)) via lazy-updates to inverses

Difference Constraints On graphs… edge-vertex incidence matrix Beu = -1/1 for endpoints u 0 everywhere else BTf = b: f has residue b f =Bɸ: f is potential flow p BTf = b f =Bɸ 1 Shortest Path Min-cut 2 Electrical flow Electrical voltage ∞ Max-flow Difference Constraints 1 < p < ∞: p-Laplacians, p-Lipschitz learning

p-norm flows for large p ` ` Intermediate problem: min 2-norm + ∞-norm On unit capacity graphs, [Spielman-Teng `04] works! Reason: any tree has small total L∞ stretch. Main technical hurdle: sparsification.

Large p (in progress): Overhead: O(p) Easier after p > m1/2 Limitation: change in 2nd order derivative xp-2 stable up to factor of 1/(p-2) Where things get messy: numerical precision

Questions / Future directions Sparsification for mixed 2/p norms 4-norm flows? Currently best about m1.2 Incomplete Cholesky for p-norm minimization?