Sparsified Matrix Algorithms for Graph Laplacians Richard Peng Georgia Tech
OUTLINE (Structured) Linear Systems Iterative and Direct Methods (Graph) Sparsification Sparsified Squaring Speeding up Gaussian Elimination
GRAPH LAPLACIANS Matrices that correspond to undirected graphs Coordinates vertices Non-zeros edges This talk: weighted, undirected graphs, and symmetric PSD matrices
THIS TALK Provably efficient algorithms for graph Laplacians, with focus on solving linear systems Why linear systems? Primitive in many graph algorithms Simplest convex optimization problem Algorithms on them often generalize
THE LAPLACIAN PARADIGM Directly related : Elliptic systems Few iterations : Eigenvectors, Heat kernels Many iterations / modify algorithm Graph problems Image processing
SCIENTIFIC COMPUTING Reducible to SDD systems, M-matrices [BHV `04, DS `07] PDEs, trusses [CFMNPW`14]: Helmholtz on meshes
DATA ANALYSIS [ZGL `03][ZHS `05][CCLPT `15]: inference / sampling on graphical models [KMST `09, CMMP `13]: image segmentation / denoising
[Tutte `62] Planar graph embeddings in 2 solves [KMP `09][MST `15] random spanning trees, Õ(m 4/3 ) [DS `08, LS`13] mincost / lossy flows, Õ(mn 1/2 ) GRAPHS Õ hides factors of log c n
[CKMST `11] [Sherman `13][KLOS `13][P `16]: approx. undirected maxflow, Õ(m 4/3 ) Õ(m 1+ε ) Õ(m) [OSV `12]: balanced cuts, heat kernel walks, Õ(m) [Madry `13]: bipartite matching in Õ(m 10/7 ) [CMSV `16]: mincost matching and negative length shortest paths in Õ(m 10/7 ) GRAPHS, FASTER
WHY WORST CASE ANALYSIS? The Laplacian paradigm of designing graph algorithms Optimization Problem Linear System Solver Sequence of (adpatively) generated linear systems Main difficulties: Widely varying weights Multi-scale behavior
INSTANCE: ISOTONIC REGRESSION [Kyng-Rao-Sachdeva `15]: /blob/master/README.mdhttps://github.com/sachdevasushant/Isotonic /blob/master/README.md : …we suggest rerunning the program a few times and/or using a different solver. An alternate solver based on incomplete Cholesky factorization is provided with the code. Numbers thanks to Kevin Deweese (UCSB)
OUTLINE (Structured) Linear Systems Iterative and Direct Methods (Graph) Sparsification Sparsified Squaring Speeding up Gaussian Elimination
LINEAR SYSTEM SOLVERS [~0] Gaussian Elimination: O(n 3 ) [Strassen `69] O(n 2.8 ) [Coppersmith-Winograd `90] O(n ) [Stothers `10] O(n ) [Vassilevska Williams`11] O(n ) [Hestenes-Stiefel `52] Conjugate gradient: O(nm) (?)
APPROACHES DirectIterative Unit stepModifying entryMatrix-vector multiply Main goalSimplify systemExplored rank space Cost per stepO(1)O(m) #StepsO(n )O(n) TotalO(n )O(nm) Performances comparable on medium sized instances: m = 10 5 takes ~ 1 second
EXTREME INSTANCES Highly connected, need global steps Long paths / tree, need many steps Solvers must handle both simultaneously Each easy on their own: Iterative methodDirect Method
SIMPLIFICATION Adjust/rescale so diagonal = I Add to diagonal to make full rank L = I – A A: Random walk
ITERATIVE METHODS Division with multiplication: (1 – a) -1 = 1 + a + a 2 + a 3 … Spectral theorem: this works for symmetric PSD matrices matrices well-approximated by their diagonal blocks are easy to solve If |a| ≤ ρ, κ = (1-ρ) -1 terms give good approximation to (1 – a) -1 Matrix version: L -1 = I + A + A 2 + A 3 +…
LOWER BOUND FOR ITERATIVE METHODS Exists G (e.g. cycle) that require Ω(n) steps Graph theoretic interpretation: each term = 1 step walk A diameter b bAbA2bA2b Closely related to Smoothness 1/2 lower bound for # of gradient steps
( I – A ) -1 = I + A + A 2 + A 3 + …. = ( I + A ) ( I + A 2 ) ( I + A 4 )… DEGREE N N OPERATIONS? Combinatorial view: A : step of random walk I – A 2 : Laplacian of the 2 step random walk Dense matrix! Repeated squaring: A 16 = (((( A 2 ) 2 ) 2 ) 2, 4 operations O(logn) terms ok Similar to multi-level methods Still a graph!
OUTLINE (Structured) Linear Systems Iterative and Direct Methods (Graph) Sparsification Sparsified Squaring Speeding up Gaussian Elimination
GRAPH SPARSIFICATION Any undirected graph can be approximated by an undirected graph with [ST `04]: O(nlog O(1) n) edges [BSS`09]: O(n) edges
NOTION OF APPROXIMATION Same as small relative condition number, reflexive, composes naturally A ≈ ε B if both exp(ε) A – B and exp(ε) B – A are P.S.D. Necessary condition: all cuts similar ≈ ≈
HOW? Simplest explanation (so far): [SS`08] importance sampling on the edges Keep edge e with probability p e, rescale if kept to maintain expectation
HOW TO SAMPLE? Widely used: uniform sampling Works well when data is uniform e.g. complete graph Problem: long path, removing any edge changes connectivity (can also have both in one graph)
THE `RIGHT’ PROBABILITIES Path + clique: 1 1/n τ : L 2 statistical leverage scores τ e = trace( L + L e ) Interpretation: effective resistance [Rudelson, Vershynin `07], [Tropp `12]: p e ≥ τ e O( logn) gives good sparsifier.
COMPUTING SAMPLING PROBABILITIES τ : leverage scores / effective resistance τ e = trace( M + M e ) [BSS`09][LS `15]: potential functions [ST `04][OV `11]: spectral partitioning [SS`08][CLMMPS`15]: Gaussian projections [Koutis `14]: spanners / low diameter partitions
OUTLINE (Structured) Linear Systems Iterative and Direct Methods (Graph) Sparsification Sparsified Squaring Speeding up Gaussian Elimination
SQUARING Sparsifiers (plus a few tricks) gives for any A, A ’ s.t. I – A ’ ≈ I – A 2 Plan: build algorithms around sparsifiers and identities involving I – A and I – A 2
SIMILAR TO ConnectivityParallel Solver Iteration A i+1 ≈ A i 2 Until | A d | small Size ReductionLow degreeSparse graph MethodDerandomizedRandomized Solution transferConnectivity ( I - A i )x i = b i Multiscale methods NC algorithm for shortest path Logspace connectivity: [Reingold `02] Deterministic squaring: [RV`05]
APPROXIMATE INVERSE CHAIN I - A 1 ≈ ε I – A 2 I – A 2 ≈ ε I – A 1 2 … I – A i ≈ ε I – A i-1 2 I - A d ≈ I I - A 0 I - A d ≈ I Convergence: I – A i+1 ≈ ε I – A i 2 implies | A i+1 |<| A i | 1.5 | A i | κ < 0.8: can stop at d = O(logκ)
ISSUE: ERROR AT EACH STEP Only have 1 – a i+1 ≈ 1 – a i 2 Solution: apply one at a time (1 – a i ) -1 = (1 + a i )(1 – a i 2 ) -1 ≈ (1 + a i )(1 – a i+1 ) -1 Induction: z i+1 ≈ (1 – a i+1 ) -1 I - A 0 I - A d ≈ I z i = (1 + a i ) z i+1 ≈ (1 + a i )(1 – a i+1 ) -1 ≈(1 – a i ) -1 Need to invoke: (1 – a) -1 = (1 + a) (1 + a 2 ) (1 + a 4 )… z d = (1 – a d ) -1 ≈ 1
ISSUE: MATRIX COMPOSITION In matrix setting, replacements by approximations need to be symmetric: Z ≈ Z ’ U T ZU ≈ U T Z ’ U Terms around Z ’ needs to be symmetric ( I – A i ) Z is not symmetric Solution 1 ([PS `14]): (1 – a) -1 =1/2 ( 1 + (1 + a)(1 – a 2 ) -1 (1 + a))
ALGORITHM Z ’ ≈ ( 1 – A 2 ) -1 ( I – A ) -1 = ½ [ I +( 1 + A ) ( I – A 2 ) -1 ( 1 + A )] Composition: Z ≈ ( I – A ) -1 Total error = dε= O(logκε) Chain: ( I – A ’ ) -1 ≈ ( I – A i 2 ) -1 Z ½ [ I +(1 + A ) Z ’ ( I + A )] Induction: Z ’ ≈ ( I – A ’ ) -1
PSEUDOCODE x = Solve( I, A 0, … A d, b) 1.For i from 1 to d, set b i = ( I + A i ) b i-1. 2.Set x d = b d. 3.For i from d - 1 downto 0, set x i = ½[b i +( I + A i )x i+1 ].
FACTORIZATION INTO PRODUCT [CCLPT`15] alternate step for computing matrix roots, ( I – A ) p for some |p|<1 ( I – A ) -1 = (I + A /2) ( I – 3/4 A 2 -1/4 A 3 ) -1 (I + A /2) Hard part: sparsifying I – 3/4 A 2 -1/4 A 3 3/4( I – A 2 ): same as before 1/4( I – A 3 ): cubic power
WHAT IS I - A 3 A : one step of random walk A 3 : 3 steps of random walk (part of) edge uv in I - A 3 Length 3 path in A : u-y-z-v Weight: A uy A yz A zv
PSEUDOCODE Repeat O(cmlognε -2 ) times: 1.Pick an integer 1 ≤ k ≤ c and an edge e = uv, both uniformly at random. 2.Perform (k -1)-step random walk from u. 3.Perform (r - k)-step random walk from v. 4.Add a scaled copy of the corresponding edge to the sparsifier Resembles: Local clustering Approximate triangle counting (c = 3)
OUTLINE (Structured) Linear Systems Iterative and Direct Methods (Graph) Sparsification Sparsified Squaring Speeding up Gaussian Elimination
DIRECT METHODS Row reduction Eliminate variable by subtracting equations from each other Sparse case? Effect of reduction: creates more non-zeros in matrix. Quickly get dense matrices Runtime: n steps, each O(degree 2 ), O(n 3 ) total
SPARSE GAUSSIAN ELIMINATION Goal: keep intermediate matrices sparse? [George `73][LRT `79]: nested dissection: O(nlogn) size inverses for planar graphs Schur Complement
KEY QUESTION Ways of controlling fill: Eliminate in the right order: Minimum degree heuristic Elimination / separator trees Drop entries: incomplete Cholesky Schur complement is still a graph, can also be sparsified
SPARSE BLOCK CHOLESKY Linear system solve reduces to: 2 solves involving top left block 1 solve on the Schur complement [KLPRS`16]: Repeatedly pivot out constant fraction of variables similar to matrix inverse via matrix multiplication (solves on red blocks)
TAIL RECURSION (solves on red blocks) Choose partition so top-left is easy to invert using iterative methods Recurrence: T(n) = T(0.99n) + O(nnz)
CHOOSING SET TO ELIMINATE α- block diagonally dominant (α-BDD) subset F: each vertex has ≥ 0.1 of total (weighed) degree going to V \ F = C Intuition: approximate independent set Identical to AMG: C: coarse grid F: fine grid - coarse Best case scenario: independent set
ITERATIVE METHOD ON M FF Division with multiplication: (1 – a) -1 = 1 + a + a 2 + a 3 … M FF = I – A : Row/column sum of A < 0.9 A 10t < e -t, quickly goes to 0 We had to be very careful with operators when addressing this. OPEN : random walk based view
Findingα-bDD subsets Pick F randomly: each u w.p. ½ Trim F: only keep good blocks Removing blocks from F can only decrease inner degree of remaining blocks Linearity of expectation: 1/4 of all blocks kept w.p. 1/2 half of u’s neighbors are not picked Markov inequality: u picked, and good w.p. ≥ 1/4
OVERALL CALL ROUTINE Cost with O(n) sized sparse approximations: T(n) = T(0.99n) + O(n) = O(n) 2 solves involving top left block: O(nnz) 1 solve on the Schur complement: T(0.99n)
KYNG-SACHDEVA `16 ( Per-entry pivoting, almost identical to incomplete LU
ONGOING WORK Connection to multigrid / multiscale? Other low factor width matrices: Multi-commodity flows? Linear elasticity problems? General PSD Linear Systems? Extension to convex optimization?