Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Numerical Analysis Approach to Convex Optimization

Similar presentations


Presentation on theme: "A Numerical Analysis Approach to Convex Optimization"— Presentation transcript:

1 A Numerical Analysis Approach to Convex Optimization
Richard Peng Georgia Tech Jan 11, 2019 Based on projects joint with Deeksha Adil (University of Toronto) Rasmus Kyng (Harvard) Sushant Sachdeva (University of Toronto) Di Wang (Georgia Tech)

2 Outline Convex (norm) optimization High accuracy vs. approximate
Constructing residual problems Applications / extensions

3 Convex Optimization Minimize f(x) For some convex f

4 Norm Optimization Minimize ‖ x ‖p Subject to: Ax = b
‖ x ‖p= (∑ |xi|p )1/p p = 1 / ∞: complete for linear programs p = 2: systems of linear equations LASSO / compressive sensing: p = 1

5 p = 2 Minimize ‖ x ‖2 Minimize f(x) Subject to: Ax = b
‖ x ‖2: Euclidean distance Equivalent to solving ATAx = ATb

6 p = 1 Minimize ‖ x ‖1 Minimize f(x) Subject to: Ax = b
‖ x ‖1: Manhattan distance Equivalent to linear programming

7 p = ∞ Minimize ‖ x ‖1 Minimize f(x) Subject to: Ax = b
‖ x ‖∞: max value of entry Also equivalent to linear programming

8 p = 4 Minimize ‖ x ‖4 Subject to: Ax = b Previous results:
Interior point methods Homotopy methods [Bubeck-Cohen-Lee-Li `18] Accelerated gradient descent [Bullins `18]

9 Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞

10 Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞ Algorithms motivated by the p = 2 case: Faster p-norm regression: Op(m<1/3log(𝜖-1)) linear system solves, and Op(mωlog(𝜖-1)) time P-norm flows in Op(m1+O(1/p)log(𝜖-1)) time

11 Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞ Algorithms motivated by the p = 2 case: Faster p-norm regression: Op(m<1/3log(𝜖-1)) linear system solves, and Op(mωlog(𝜖-1)) time P-norm flows in Op(m1+O(1/p)log(𝜖-1)) time In progress: replace Op() by O(p)

12 m?log(𝜖-1) linear system solves
Interior point: 1/2 for all p [Bubeck-Cohen-Lee-Li `18]: |1/2 – 1/p| [Bullins `18 (indep)]: 1/5 for p = 4 Our result: |1/2 – 1/p| / (1 + |1/2 – 1/p|) BCLL Our result 𝑝

13 Outline Convex (norm) optimization High accuracy vs. approximate
Constructing residual problems Applications / extensions

14 What are we good at optimizing?
p = 2, aka, solving linear systems Reasons: Gaussian elimination Approximate Gaussian elimination Easy to correct errors

15 What are we good at optimizing?
p = 2, aka, solving linear systems Approximately minimizing symmetric functions Reasons: errors are allowed Gradient descent (e.g. [Durfee-Lai-Sawlani `18]) Multiplicative weights / regret minimization Iterative reweighted least squares

16 A constant factor approximation
Minimize ∑ xi4 Subject to: Ax = b Minimize ∑ wixi2 Subject to: Ax = b

17 A constant factor approximation
Adjust wi based on xi Minimize ∑ xi4 Subject to: Ax = b Minimize ∑ wixi2 Subject to: Ax = b Done if wi = xi2 Variants of this gives O(1)-approx in [Chin-Madry-Miller-P `13] m1/3 iters for p = 1 & ∞ [Adil-Kyng-P-Sachdeva `19] m|1/2 – 1/p| / (1 + |1/2 – 1/p|) iters

18 But… 2-approx to min ∑ xi4 Subject to: Ax = b

19 But… Incomplete Cholesky: solve ATAx = ATb by pretending ATA is sparse

20 Why are we good at p = 2? Residual problem at x:
Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b

21 Why are we good at p = 2? Simplify: Residual problem at x:
Minimize ∑ 2xiΔi+ Δi2 s.t. AΔ = 0 Residual problem at x: Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b

22 Why are we good at p = 2? Simplify: Residual problem at x:
Minimize ∑ 2xiΔi+ Δi2 s.t. AΔ = 0 Residual problem at x: Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b Binary/line search on linear term: Minimize ∑ Δi2 s.t. AΔ = 0 ∑ 2xiΔi = 2xTΔ = ⍺ Another instance of 2-norm minimization!

23 Iterative refinement Minimize f(x) Subject to: Ax = b Repeatedly:
Create residual problem, fres( ) Approximately min fres(Δ) s.t. AΔ = 0 Update x  x Δ

24 Outline Convex (norm) optimization High accuracy vs. approximate
Constructing residual problems Applications / extensions

25 Simplifications Ignore A: fres(Δ) approximates fres(x + Δ) - f(x)
Suffices to approximate each coordinate (1 + Δ)4 - 14

26 Gradient: consider Δ  0 fres(x + Δ) - f(x)  <grad(x), Δ>
Must have fres(Δ) = <grad(x), Δ> + hres(Δ) (1 + Δ)4 - 14 Gradient

27 Higher Order Terms Scale down step size: (x + Δ)2 - x2 = 2 Δx + Δ2

28 Higher Order Terms Scale down step size: (x + Δ)2 - x2 = 2 Δx + Δ2
Unconditionally, even when Δx < 0!

29 Higher Order Terms Scale down step size: (x + Δ)2 - x2 = 2 Δx + Δ2
Unconditionally, even when Δx < 0! Factor κ error in higher order terms can be absorbed by making 1/κ sized steps

30 Equivalent goal of hres()
Symmetric around 0 hres(Δ) ≅ f(x + Δ) - f(x) - <grad(x), Δ> Approx factor  iteration count

31 Equivalent goal of hres()
Symmetric around 0 hres(Δ) ≅ f(x + Δ) - f(x) - <grad(x), Δ> Approx factor  iteration count Bregman Divergence: difference between f() and its first order approximation

32 Bregman Divergence of x4
(x + Δ)4 = x4 + 4Δx3 + 6Δ2x2 + 4Δ3x + Δ4 Div(x+ Δ, x) = 6Δ2x2 + 4Δ3x + Δ4 Δ3x is not symmetric in Δ 

33 Drop the odd term?!? 3(6Δ2+ Δ4) Div(1 + Δ, 1) 0.1(6Δ2+ Δ4)

34 Prove this? 6Δ2x2 + 4Δ3x + Δ4 ≅ Δ2x2 + Δ4 Proof:
By arithmetic-geometric mean inequality (a2+b2 ≥ 2|ab|): |4Δ3x| ≤ 5Δ2x Δ4

35 Prove this? 6Δ2x2 + 4Δ3x + Δ4 ≅ Δ2x2 + Δ4 Proof:
By arithmetic-geometric mean inequality (a2+b2 ≥ 2|ab|): |4Δ3x| ≤ 5Δ2x Δ4 Substitute back into Div(x+Δ, x) Δ2x Δ4 ≤ 6Δ2x2 + 4Δ3x + Δ ≤ 9Δ2x Δ4

36 Outline Convex (norm) optimization High accuracy vs. approximate
Constructing residual problems Applications / extensions

37 Overall Algorithm Repeatedly approximate
min <Δ, x3> + <Δ2, xp - 2> + ‖Δ‖pp s.t. AΔ = 0 And adjust: x  x + ⍺Δ Gradient term <Δ, xp>: Ternary/line search on its value Becomes another linear constraint

38 Inner Problem min <Δ2, xp-2> + ‖Δ‖pp s.t. AΔ = 0;
Symmetric intermediate problems! Modify multiplicative weights: Op(m|p-2|/|3p-2|log(𝜖-1)) iterations Total time can be reduced to Op(mωlog(𝜖-1)) via lazy-updates to inverses

39 Difference Constraints
On graphs… edge-vertex incidence matrix Beu = -1/1 for endpoints u 0 everywhere else BTf = b: f has residue b f =Bɸ: f is potential flow p BTf = b f =Bɸ 1 Shortest Path Min-cut 2 Electrical flow Electrical voltage Max-flow Difference Constraints 1 < p < ∞: p-Laplacians, p-Lipschitz learning

40 p-norm flows for large p
` ` Intermediate problem: min 2-norm + ∞-norm On unit capacity graphs, [Spielman-Teng `04] works! Reason: any tree has small total L∞ stretch. Main technical hurdle: sparsification.

41 Large p (in progress): Overhead: O(p) Easier after p > m1/2
Limitation: change in 2nd order derivative xp-2 stable up to factor of 1/(p-2) Where things get messy: numerical precision

42 Questions / Future directions
Sparsification for mixed 2/p norms 4-norm flows? Currently best about m1.2 Incomplete Cholesky for p-norm minimization?


Download ppt "A Numerical Analysis Approach to Convex Optimization"

Similar presentations


Ads by Google