A Numerical Analysis Approach to Convex Optimization Richard Peng Georgia Tech Jan 11, 2019 Based on projects joint with Deeksha Adil (University of Toronto) Rasmus Kyng (Harvard) Sushant Sachdeva (University of Toronto) Di Wang (Georgia Tech)
Outline Convex (norm) optimization High accuracy vs. approximate Constructing residual problems Applications / extensions
Convex Optimization Minimize f(x) For some convex f
Norm Optimization Minimize ‖ x ‖p Subject to: Ax = b ‖ x ‖p= (∑ |xi|p )1/p p = 1 / ∞: complete for linear programs p = 2: systems of linear equations LASSO / compressive sensing: p = 1
p = 2 Minimize ‖ x ‖2 Minimize f(x) Subject to: Ax = b ‖ x ‖2: Euclidean distance Equivalent to solving ATAx = ATb
p = 1 Minimize ‖ x ‖1 Minimize f(x) Subject to: Ax = b ‖ x ‖1: Manhattan distance Equivalent to linear programming
p = ∞ Minimize ‖ x ‖1 Minimize f(x) Subject to: Ax = b ‖ x ‖∞: max value of entry Also equivalent to linear programming
p = 4 Minimize ‖ x ‖4 Subject to: Ax = b Previous results: Interior point methods Homotopy methods [Bubeck-Cohen-Lee-Li `18] Accelerated gradient descent [Bullins `18]
Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞
Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞ Algorithms motivated by the p = 2 case: Faster p-norm regression: Op(m<1/3log(𝜖-1)) linear system solves, and Op(mωlog(𝜖-1)) time P-norm flows in Op(m1+O(1/p)log(𝜖-1)) time
Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞ Algorithms motivated by the p = 2 case: Faster p-norm regression: Op(m<1/3log(𝜖-1)) linear system solves, and Op(mωlog(𝜖-1)) time P-norm flows in Op(m1+O(1/p)log(𝜖-1)) time In progress: replace Op() by O(p)
m?log(𝜖-1) linear system solves Interior point: 1/2 for all p [Bubeck-Cohen-Lee-Li `18]: |1/2 – 1/p| [Bullins `18 (indep)]: 1/5 for p = 4 Our result: |1/2 – 1/p| / (1 + |1/2 – 1/p|) BCLL Our result 𝑝
Outline Convex (norm) optimization High accuracy vs. approximate Constructing residual problems Applications / extensions
What are we good at optimizing? p = 2, aka, solving linear systems Reasons: Gaussian elimination Approximate Gaussian elimination Easy to correct errors
What are we good at optimizing? p = 2, aka, solving linear systems Approximately minimizing symmetric functions Reasons: errors are allowed Gradient descent (e.g. [Durfee-Lai-Sawlani `18]) Multiplicative weights / regret minimization Iterative reweighted least squares
A constant factor approximation Minimize ∑ xi4 Subject to: Ax = b Minimize ∑ wixi2 Subject to: Ax = b
A constant factor approximation Adjust wi based on xi Minimize ∑ xi4 Subject to: Ax = b Minimize ∑ wixi2 Subject to: Ax = b Done if wi = xi2 Variants of this gives O(1)-approx in [Chin-Madry-Miller-P `13] m1/3 iters for p = 1 & ∞ [Adil-Kyng-P-Sachdeva `19] m|1/2 – 1/p| / (1 + |1/2 – 1/p|) iters
But… 2-approx to min ∑ xi4 Subject to: Ax = b
But… Incomplete Cholesky: solve ATAx = ATb by pretending ATA is sparse
Why are we good at p = 2? Residual problem at x: Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b
Why are we good at p = 2? Simplify: Residual problem at x: Minimize ∑ 2xiΔi+ Δi2 s.t. AΔ = 0 Residual problem at x: Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b
Why are we good at p = 2? Simplify: Residual problem at x: Minimize ∑ 2xiΔi+ Δi2 s.t. AΔ = 0 Residual problem at x: Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b Binary/line search on linear term: Minimize ∑ Δi2 s.t. AΔ = 0 ∑ 2xiΔi = 2xTΔ = ⍺ Another instance of 2-norm minimization!
Iterative refinement Minimize f(x) Subject to: Ax = b Repeatedly: Create residual problem, fres( ) Approximately min fres(Δ) s.t. AΔ = 0 Update x x + 0.1 Δ
Outline Convex (norm) optimization High accuracy vs. approximate Constructing residual problems Applications / extensions
Simplifications Ignore A: fres(Δ) approximates fres(x + Δ) - f(x) Suffices to approximate each coordinate (1 + Δ)4 - 14
Gradient: consider Δ 0 fres(x + Δ) - f(x) <grad(x), Δ> Must have fres(Δ) = <grad(x), Δ> + hres(Δ) (1 + Δ)4 - 14 Gradient
Higher Order Terms Scale down step size: (x + Δ)2 - x2 = 2 Δx + Δ2
Higher Order Terms Scale down step size: (x + Δ)2 - x2 = 2 Δx + Δ2 Unconditionally, even when Δx < 0!
Higher Order Terms Scale down step size: (x + Δ)2 - x2 = 2 Δx + Δ2 Unconditionally, even when Δx < 0! Factor κ error in higher order terms can be absorbed by making 1/κ sized steps
Equivalent goal of hres() Symmetric around 0 hres(Δ) ≅ f(x + Δ) - f(x) - <grad(x), Δ> Approx factor iteration count
Equivalent goal of hres() Symmetric around 0 hres(Δ) ≅ f(x + Δ) - f(x) - <grad(x), Δ> Approx factor iteration count Bregman Divergence: difference between f() and its first order approximation
Bregman Divergence of x4 (x + Δ)4 = x4 + 4Δx3 + 6Δ2x2 + 4Δ3x + Δ4 Div(x+ Δ, x) = 6Δ2x2 + 4Δ3x + Δ4 Δ3x is not symmetric in Δ
Drop the odd term?!? 3(6Δ2+ Δ4) Div(1 + Δ, 1) 0.1(6Δ2+ Δ4)
Prove this? 6Δ2x2 + 4Δ3x + Δ4 ≅ Δ2x2 + Δ4 Proof: By arithmetic-geometric mean inequality (a2+b2 ≥ 2|ab|): |4Δ3x| ≤ 5Δ2x2 + 0.8Δ4
Prove this? 6Δ2x2 + 4Δ3x + Δ4 ≅ Δ2x2 + Δ4 Proof: By arithmetic-geometric mean inequality (a2+b2 ≥ 2|ab|): |4Δ3x| ≤ 5Δ2x2 + 0.8Δ4 Substitute back into Div(x+Δ, x) Δ2x2 + 0.2Δ4 ≤ 6Δ2x2 + 4Δ3x + Δ4 ≤ 9Δ2x2 + 1.8Δ4
Outline Convex (norm) optimization High accuracy vs. approximate Constructing residual problems Applications / extensions
Overall Algorithm Repeatedly approximate min <Δ, x3> + <Δ2, xp - 2> + ‖Δ‖pp s.t. AΔ = 0 And adjust: x x + ⍺Δ Gradient term <Δ, xp>: Ternary/line search on its value Becomes another linear constraint
Inner Problem min <Δ2, xp-2> + ‖Δ‖pp s.t. AΔ = 0; Symmetric intermediate problems! Modify multiplicative weights: Op(m|p-2|/|3p-2|log(𝜖-1)) iterations Total time can be reduced to Op(mωlog(𝜖-1)) via lazy-updates to inverses
Difference Constraints On graphs… edge-vertex incidence matrix Beu = -1/1 for endpoints u 0 everywhere else BTf = b: f has residue b f =Bɸ: f is potential flow p BTf = b f =Bɸ 1 Shortest Path Min-cut 2 Electrical flow Electrical voltage ∞ Max-flow Difference Constraints 1 < p < ∞: p-Laplacians, p-Lipschitz learning
p-norm flows for large p ` ` Intermediate problem: min 2-norm + ∞-norm On unit capacity graphs, [Spielman-Teng `04] works! Reason: any tree has small total L∞ stretch. Main technical hurdle: sparsification.
Large p (in progress): Overhead: O(p) Easier after p > m1/2 Limitation: change in 2nd order derivative xp-2 stable up to factor of 1/(p-2) Where things get messy: numerical precision
Questions / Future directions Sparsification for mixed 2/p norms 4-norm flows? Currently best about m1.2 Incomplete Cholesky for p-norm minimization?