A Numerical Analysis Approach to Convex Optimization

A Numerical Analysis Approach to Convex Optimization
Richard Peng Georgia Tech Jan 11, 2019 Based on projects joint with Deeksha Adil (University of Toronto) Rasmus Kyng (Harvard) Sushant Sachdeva (University of Toronto) Di Wang (Georgia Tech)

Outline Convex (norm) optimization High accuracy vs. approximate
Constructing residual problems Applications / extensions

Convex Optimization Minimize f(x) For some convex f

Norm Optimization Minimize ‖ x ‖p Subject to: Ax = b
‖ x ‖p= (∑ |xi|p )1/p p = 1 / ∞: complete for linear programs p = 2: systems of linear equations LASSO / compressive sensing: p = 1

p = 2 Minimize ‖ x ‖2 Minimize f(x) Subject to: Ax = b
‖ x ‖2: Euclidean distance Equivalent to solving ATAx = ATb

p = 1 Minimize ‖ x ‖1 Minimize f(x) Subject to: Ax = b
‖ x ‖1: Manhattan distance Equivalent to linear programming

p = ∞ Minimize ‖ x ‖1 Minimize f(x) Subject to: Ax = b
‖ x ‖∞: max value of entry Also equivalent to linear programming

p = 4 Minimize ‖ x ‖4 Subject to: Ax = b Previous results:
Interior point methods Homotopy methods [Bubeck-Cohen-Lee-Li `18] Accelerated gradient descent [Bullins `18]

Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞

Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞ Algorithms motivated by the p = 2 case: Faster p-norm regression: Op(m<1/3log(𝜖-1)) linear system solves, and Op(mωlog(𝜖-1)) time P-norm flows in Op(m1+O(1/p)log(𝜖-1)) time

Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞ Algorithms motivated by the p = 2 case: Faster p-norm regression: Op(m<1/3log(𝜖-1)) linear system solves, and Op(mωlog(𝜖-1)) time P-norm flows in Op(m1+O(1/p)log(𝜖-1)) time In progress: replace Op() by O(p)

m?log(𝜖-1) linear system solves
Interior point: 1/2 for all p [Bubeck-Cohen-Lee-Li `18]: |1/2 – 1/p| [Bullins `18 (indep)]: 1/5 for p = 4 Our result: |1/2 – 1/p| / (1 + |1/2 – 1/p|) BCLL Our result 𝑝

What are we good at optimizing?
p = 2, aka, solving linear systems Reasons: Gaussian elimination Approximate Gaussian elimination Easy to correct errors

What are we good at optimizing?
p = 2, aka, solving linear systems Approximately minimizing symmetric functions Reasons: errors are allowed Gradient descent (e.g. [Durfee-Lai-Sawlani `18]) Multiplicative weights / regret minimization Iterative reweighted least squares

A constant factor approximation
Minimize ∑ xi4 Subject to: Ax = b Minimize ∑ wixi2 Subject to: Ax = b

A constant factor approximation
Adjust wi based on xi Minimize ∑ xi4 Subject to: Ax = b Minimize ∑ wixi2 Subject to: Ax = b Done if wi = xi2 Variants of this gives O(1)-approx in [Chin-Madry-Miller-P `13] m1/3 iters for p = 1 & ∞ [Adil-Kyng-P-Sachdeva `19] m|1/2 – 1/p| / (1 + |1/2 – 1/p|) iters

But… 2-approx to min ∑ xi4 Subject to: Ax = b

But… Incomplete Cholesky: solve ATAx = ATb by pretending ATA is sparse

Why are we good at p = 2? Residual problem at x:
Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b

Why are we good at p = 2? Simplify: Residual problem at x:
Minimize ∑ 2xiΔi+ Δi2 s.t. AΔ = 0 Residual problem at x: Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b

Why are we good at p = 2? Simplify: Residual problem at x:
Minimize ∑ 2xiΔi+ Δi2 s.t. AΔ = 0 Residual problem at x: Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b Binary/line search on linear term: Minimize ∑ Δi2 s.t. AΔ = 0 ∑ 2xiΔi = 2xTΔ = ⍺ Another instance of 2-norm minimization!

Iterative refinement Minimize f(x) Subject to: Ax = b Repeatedly:
Create residual problem, fres( ) Approximately min fres(Δ) s.t. AΔ = 0 Update x  x Δ

Simplifications Ignore A: fres(Δ) approximates fres(x + Δ) - f(x)
Suffices to approximate each coordinate (1 + Δ)4 - 14

Gradient: consider Δ  0 fres(x + Δ) - f(x)  <grad(x), Δ>
Must have fres(Δ) = <grad(x), Δ> + hres(Δ) (1 + Δ)4 - 14 Gradient

Higher Order Terms Scale down step size: (x + Δ)2 - x2 = 2 Δx + Δ2

Unconditionally, even when Δx < 0!

Unconditionally, even when Δx < 0! Factor κ error in higher order terms can be absorbed by making 1/κ sized steps

Equivalent goal of hres()
Symmetric around 0 hres(Δ) ≅ f(x + Δ) - f(x) - <grad(x), Δ> Approx factor  iteration count

Equivalent goal of hres()
Symmetric around 0 hres(Δ) ≅ f(x + Δ) - f(x) - <grad(x), Δ> Approx factor  iteration count Bregman Divergence: difference between f() and its first order approximation

Bregman Divergence of x4
(x + Δ)4 = x4 + 4Δx3 + 6Δ2x2 + 4Δ3x + Δ4 Div(x+ Δ, x) = 6Δ2x2 + 4Δ3x + Δ4 Δ3x is not symmetric in Δ 

Drop the odd term?!? 3(6Δ2+ Δ4) Div(1 + Δ, 1) 0.1(6Δ2+ Δ4)

Prove this? 6Δ2x2 + 4Δ3x + Δ4 ≅ Δ2x2 + Δ4 Proof:
By arithmetic-geometric mean inequality (a2+b2 ≥ 2|ab|): |4Δ3x| ≤ 5Δ2x Δ4

Prove this? 6Δ2x2 + 4Δ3x + Δ4 ≅ Δ2x2 + Δ4 Proof:
By arithmetic-geometric mean inequality (a2+b2 ≥ 2|ab|): |4Δ3x| ≤ 5Δ2x Δ4 Substitute back into Div(x+Δ, x) Δ2x Δ4 ≤ 6Δ2x2 + 4Δ3x + Δ ≤ 9Δ2x Δ4

Overall Algorithm Repeatedly approximate
min <Δ, x3> + <Δ2, xp - 2> + ‖Δ‖pp s.t. AΔ = 0 And adjust: x  x + ⍺Δ Gradient term <Δ, xp>: Ternary/line search on its value Becomes another linear constraint

Inner Problem min <Δ2, xp-2> + ‖Δ‖pp s.t. AΔ = 0;
Symmetric intermediate problems! Modify multiplicative weights: Op(m|p-2|/|3p-2|log(𝜖-1)) iterations Total time can be reduced to Op(mωlog(𝜖-1)) via lazy-updates to inverses

Difference Constraints
On graphs… edge-vertex incidence matrix Beu = -1/1 for endpoints u 0 everywhere else BTf = b: f has residue b f =Bɸ: f is potential flow p BTf = b f =Bɸ 1 Shortest Path Min-cut 2 Electrical flow Electrical voltage ∞ Max-flow Difference Constraints 1 < p < ∞: p-Laplacians, p-Lipschitz learning

p-norm flows for large p
` ` Intermediate problem: min 2-norm + ∞-norm On unit capacity graphs, [Spielman-Teng `04] works! Reason: any tree has small total L∞ stretch. Main technical hurdle: sparsification.

Large p (in progress): Overhead: O(p) Easier after p > m1/2
Limitation: change in 2nd order derivative xp-2 stable up to factor of 1/(p-2) Where things get messy: numerical precision

Questions / Future directions
Sparsification for mixed 2/p norms 4-norm flows? Currently best about m1.2 Incomplete Cholesky for p-norm minimization?

A Numerical Analysis Approach to Convex Optimization

Similar presentations

Presentation on theme: "A Numerical Analysis Approach to Convex Optimization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Numerical Analysis Approach to Convex Optimization

Similar presentations

Presentation on theme: "A Numerical Analysis Approach to Convex Optimization"— Presentation transcript:

Similar presentations

About project

Feedback