Download presentation
Presentation is loading. Please wait.
Published byΜεθόδιος Βούλγαρης Modified over 5 years ago
1
A Numerical Analysis Approach to Convex Optimization
Richard Peng Georgia Tech Jan 11, 2019 Based on projects joint with Deeksha Adil (University of Toronto) Rasmus Kyng (Harvard) Sushant Sachdeva (University of Toronto) Di Wang (Georgia Tech)
2
Outline Convex (norm) optimization High accuracy vs. approximate
Constructing residual problems Applications / extensions
3
Convex Optimization Minimize f(x) For some convex f
4
Norm Optimization Minimize ‖ x ‖p Subject to: Ax = b
‖ x ‖p= (∑ |xi|p )1/p p = 1 / ∞: complete for linear programs p = 2: systems of linear equations LASSO / compressive sensing: p = 1
5
p = 2 Minimize ‖ x ‖2 Minimize f(x) Subject to: Ax = b
‖ x ‖2: Euclidean distance Equivalent to solving ATAx = ATb
6
p = 1 Minimize ‖ x ‖1 Minimize f(x) Subject to: Ax = b
‖ x ‖1: Manhattan distance Equivalent to linear programming
7
p = ∞ Minimize ‖ x ‖1 Minimize f(x) Subject to: Ax = b
‖ x ‖∞: max value of entry Also equivalent to linear programming
8
p = 4 Minimize ‖ x ‖4 Subject to: Ax = b Previous results:
Interior point methods Homotopy methods [Bubeck-Cohen-Lee-Li `18] Accelerated gradient descent [Bullins `18]
9
Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞
10
Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞ Algorithms motivated by the p = 2 case: Faster p-norm regression: Op(m<1/3log(𝜖-1)) linear system solves, and Op(mωlog(𝜖-1)) time P-norm flows in Op(m1+O(1/p)log(𝜖-1)) time
11
Main idea The p = 4 case has more in common with p = 2 than with p = 1 and p = ∞ Algorithms motivated by the p = 2 case: Faster p-norm regression: Op(m<1/3log(𝜖-1)) linear system solves, and Op(mωlog(𝜖-1)) time P-norm flows in Op(m1+O(1/p)log(𝜖-1)) time In progress: replace Op() by O(p)
12
m?log(𝜖-1) linear system solves
Interior point: 1/2 for all p [Bubeck-Cohen-Lee-Li `18]: |1/2 – 1/p| [Bullins `18 (indep)]: 1/5 for p = 4 Our result: |1/2 – 1/p| / (1 + |1/2 – 1/p|) BCLL Our result 𝑝
13
Outline Convex (norm) optimization High accuracy vs. approximate
Constructing residual problems Applications / extensions
14
What are we good at optimizing?
p = 2, aka, solving linear systems Reasons: Gaussian elimination Approximate Gaussian elimination Easy to correct errors
15
What are we good at optimizing?
p = 2, aka, solving linear systems Approximately minimizing symmetric functions Reasons: errors are allowed Gradient descent (e.g. [Durfee-Lai-Sawlani `18]) Multiplicative weights / regret minimization Iterative reweighted least squares
16
A constant factor approximation
Minimize ∑ xi4 Subject to: Ax = b Minimize ∑ wixi2 Subject to: Ax = b
17
A constant factor approximation
Adjust wi based on xi Minimize ∑ xi4 Subject to: Ax = b Minimize ∑ wixi2 Subject to: Ax = b Done if wi = xi2 Variants of this gives O(1)-approx in [Chin-Madry-Miller-P `13] m1/3 iters for p = 1 & ∞ [Adil-Kyng-P-Sachdeva `19] m|1/2 – 1/p| / (1 + |1/2 – 1/p|) iters
18
But… 2-approx to min ∑ xi4 Subject to: Ax = b
19
But… Incomplete Cholesky: solve ATAx = ATb by pretending ATA is sparse
20
Why are we good at p = 2? Residual problem at x:
Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b
21
Why are we good at p = 2? Simplify: Residual problem at x:
Minimize ∑ 2xiΔi+ Δi2 s.t. AΔ = 0 Residual problem at x: Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b
22
Why are we good at p = 2? Simplify: Residual problem at x:
Minimize ∑ 2xiΔi+ Δi2 s.t. AΔ = 0 Residual problem at x: Minimize ∑ (xi+ Δi)2 - xi2 Subject to: A(x + Δ) = b Binary/line search on linear term: Minimize ∑ Δi2 s.t. AΔ = 0 ∑ 2xiΔi = 2xTΔ = ⍺ Another instance of 2-norm minimization!
23
Iterative refinement Minimize f(x) Subject to: Ax = b Repeatedly:
Create residual problem, fres( ) Approximately min fres(Δ) s.t. AΔ = 0 Update x x Δ
24
Outline Convex (norm) optimization High accuracy vs. approximate
Constructing residual problems Applications / extensions
25
Simplifications Ignore A: fres(Δ) approximates fres(x + Δ) - f(x)
Suffices to approximate each coordinate (1 + Δ)4 - 14
26
Gradient: consider Δ 0 fres(x + Δ) - f(x) <grad(x), Δ>
Must have fres(Δ) = <grad(x), Δ> + hres(Δ) (1 + Δ)4 - 14 Gradient
27
Higher Order Terms Scale down step size: (x + Δ)2 - x2 = 2 Δx + Δ2
28
Higher Order Terms Scale down step size: (x + Δ)2 - x2 = 2 Δx + Δ2
Unconditionally, even when Δx < 0!
29
Higher Order Terms Scale down step size: (x + Δ)2 - x2 = 2 Δx + Δ2
Unconditionally, even when Δx < 0! Factor κ error in higher order terms can be absorbed by making 1/κ sized steps
30
Equivalent goal of hres()
Symmetric around 0 hres(Δ) ≅ f(x + Δ) - f(x) - <grad(x), Δ> Approx factor iteration count
31
Equivalent goal of hres()
Symmetric around 0 hres(Δ) ≅ f(x + Δ) - f(x) - <grad(x), Δ> Approx factor iteration count Bregman Divergence: difference between f() and its first order approximation
32
Bregman Divergence of x4
(x + Δ)4 = x4 + 4Δx3 + 6Δ2x2 + 4Δ3x + Δ4 Div(x+ Δ, x) = 6Δ2x2 + 4Δ3x + Δ4 Δ3x is not symmetric in Δ
33
Drop the odd term?!? 3(6Δ2+ Δ4) Div(1 + Δ, 1) 0.1(6Δ2+ Δ4)
34
Prove this? 6Δ2x2 + 4Δ3x + Δ4 ≅ Δ2x2 + Δ4 Proof:
By arithmetic-geometric mean inequality (a2+b2 ≥ 2|ab|): |4Δ3x| ≤ 5Δ2x Δ4
35
Prove this? 6Δ2x2 + 4Δ3x + Δ4 ≅ Δ2x2 + Δ4 Proof:
By arithmetic-geometric mean inequality (a2+b2 ≥ 2|ab|): |4Δ3x| ≤ 5Δ2x Δ4 Substitute back into Div(x+Δ, x) Δ2x Δ4 ≤ 6Δ2x2 + 4Δ3x + Δ ≤ 9Δ2x Δ4
36
Outline Convex (norm) optimization High accuracy vs. approximate
Constructing residual problems Applications / extensions
37
Overall Algorithm Repeatedly approximate
min <Δ, x3> + <Δ2, xp - 2> + ‖Δ‖pp s.t. AΔ = 0 And adjust: x x + ⍺Δ Gradient term <Δ, xp>: Ternary/line search on its value Becomes another linear constraint
38
Inner Problem min <Δ2, xp-2> + ‖Δ‖pp s.t. AΔ = 0;
Symmetric intermediate problems! Modify multiplicative weights: Op(m|p-2|/|3p-2|log(𝜖-1)) iterations Total time can be reduced to Op(mωlog(𝜖-1)) via lazy-updates to inverses
39
Difference Constraints
On graphs… edge-vertex incidence matrix Beu = -1/1 for endpoints u 0 everywhere else BTf = b: f has residue b f =Bɸ: f is potential flow p BTf = b f =Bɸ 1 Shortest Path Min-cut 2 Electrical flow Electrical voltage ∞ Max-flow Difference Constraints 1 < p < ∞: p-Laplacians, p-Lipschitz learning
40
p-norm flows for large p
` ` Intermediate problem: min 2-norm + ∞-norm On unit capacity graphs, [Spielman-Teng `04] works! Reason: any tree has small total L∞ stretch. Main technical hurdle: sparsification.
41
Large p (in progress): Overhead: O(p) Easier after p > m1/2
Limitation: change in 2nd order derivative xp-2 stable up to factor of 1/(p-2) Where things get messy: numerical precision
42
Questions / Future directions
Sparsification for mixed 2/p norms 4-norm flows? Currently best about m1.2 Incomplete Cholesky for p-norm minimization?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.