Sampling from Gaussian Graphical Models via Spectral Sparsification Richard Peng M.I.T. Joint work with Dehua Cheng, Yu Cheng, Yan Liu and Shanghua Teng (U.S.C.)
OUTLINE Gaussian sampling, linear systems, matrix-roots Sparse factorizations of L p Sparsification of random walk polynomials
SAMPLING FROM GRAPHICAL MODELS Joint probability distribution between entries of n-dimensional random variables x graphical models: encode as local dependencies via graph Sampling: pick a uniformly random point from the model
APPLICATIONS Often need many samples Rejection / importance sampling Estimation of quantities on the samples Ideal sampling routine: Efficient, parallel Use limited randomness
PREVIOUS WORKS Instance of Markov Chain Monte-Carlo Parallel sampling algorithm: [Gonzalez-Low-Gretton-Guestrin `11]: coloring. [Niu-Recht-Re-Wright `11] Hogwild: go lock-free [Williamson-Dubey-Xing `13]: auxiliary variables. Gibbs sampling: locally resample each variable from the joint distribution given by its neighbors
GAUSSIAN GRAPHICAL MODELS AND LINEAR SYSTEMS Joint distribution specified by a precision matrix, M -1 Goal: sample from Gaussian distribution N(0, M -1 ) Gibbs sampling: resample based on neighbors Iterative methods: x ’ x + α Mx Also recomputing on neighbors Usually denoted as Λ -1
CONNECTION TO SOLVING LINEAR SYSTEMS [Johnson, Saunderson, Willsky `13]: if the precision matrix M is (generalized) diagonally dominant, then Hogwild Gibbs sampling converges 1 1 n vertices m edges Further simplification: graph Laplacian Matrix L Diagonal: degree Off-diagonal: -edge weights Much more restrictive than the `graph’ in graphical models! n rows / columns O(m) non-zeros
LOCAL METHODS #steps required lower bounded by information propagation M diameter b bMbMb M2bM2b Need n matrix operations? What if we have more powerful algorithmic primitives?
ALGEBRAIC PRIMITIVE Goal: generate random variable from the Gaussian distribution N(0, L -1 ) Can generate uniform Gaussians, N(0, I) Need: efficiently evaluable linear operator C s.t. C T C = L -1 x ~ N(0, I), y = Cx y ~ N(0, C T C ) Assume L is full rank for simplicity
DIRECT SOLUTION: Factorize L = B T B Set C = L -1 B T CC T = L -1 B T ( L -1 B T ) T = L -1 B T BL -1 = L -1 Factorization + black-box access solvers gives sampling algorithm B : Edge-vertex incidence matrix: B eu =-1/1 if u is endpoint of e 0 otherwise
PARALLEL SAMPLING ROUTINE [P-Spielman `14]: Z ≈ ε L -1 in polylog depth and nearly-linear work ≈: spectral similarity, A ≈ k B iff ∀ x we have: e -k x T Ax ≤ x T Bx ≤ e k x T Ax Can use B ‘in place’ of A Can also boost accuracy Parallel sampling routine C corresponding to : y ’ B T y x solve( L, y’ ) gives L ≈ C T C
RANDOMNESS REQUIREMENT Sample y from N(0, I ) y ’ B t y x solve( L, y’ ) return x B : m – by – n matrix, m = # of edges Optimal randomness requirement: n C that is a square matrix Fewer random variables? y needs to be a m-dimensional Gaussians (can get to O(nlogn) with some work)
GENERALIZATIONS Lower Randomness Requirement: L ≈ C T C where C is a square matrix Application of matrix roots: ‘half a step’ of a random walk Can also view as matrix square root? Z s.t. Z ≈ L -1/2 ? Z s.t. Z ≈ L -1/3 ? ≈: spectral approximation Akin to QR factorization Alternate definition of square-root:
OUR RESULT Input : graph Laplacian L with condition number κ, parameter -1 ≤ p ≤ 1 Output : Access to square operator C s.t. C T C ≈ ε L p Cost : O(log c1 m log c2 κ ε -4 ) time O(m log c1 m log c2 κ ε -4 ) work κ : condition number, closely related to bit-complexity of solve( L, b ) Extends to symmetric diagonally dominant (SDD) matrices
SUMMARY Gaussian sampling closely related to linear system solves and matrix p th roots Can approximately factor L p into a product of sparse matrices Random walk polynomials can be sparsified by sampling random walks
OUTLINE Gaussian sampling, linear systems, matrix-roots Sparse factorizations of L p Sparsification of random walk polynomials
SIMPLIFICATION Adjust/rescale so diagonal = I Add to diagonal to make full rank L = I – A A: Random walk, ║ A ║ < 1
PROBLEM Each step: pass information to neighbor A diameter IAA2A2 Need A diameter Given random walk matrix A, parameter p, produce easily evaluable C s.t. C T C ≈ ( I – A ) p Evaluate using O(diameter) matrix operations? Local approach for p = -1: I + A + A 2 + A 3 + …. = ( I – A ) -1
FASTER INFORMATION PROPAGATION Recall: ║ A ║ < 1, I - A n 3 ≈ I if A corresponds to random walk on unweighted graph Repeated squaring: A 16 = (((( A 2 ) 2 ) 2 ) 2, 4 operations Framework from [P-Spielman `14]: Reducing ( I – A ) p to computing ( I – A 2 ) p O(logκ) reduction steps suffice
SQUARING DENSE GRAPHS?!? [ST `04][SS`08][OV `11] + some modifications, or [Koutis `14]: O(nlog c n ε -2 ) entries, efficient, parallel [BSS`09, ALZ `14]: O(nε -2 ) entries, but quadratic cost Graph sparsification: sparse A ’ s.t. I - A ’ ≈ ε I – A 2 Also preserves p th powers
ABSORBING ERRORS Direct factorization: ( I – A ) -1 = ( I + A ) ( I – A 2 ) -1 Simplification: work with p = -1 Have: I – A ’ ≈ I – A 2 Implies: ( I – A ’) -1 ≈ ( I – A 2 ) -1 But NOT: ( I + A ) ( I – A ’) -1 ≈ ( I + A ) ( I – A 2 ) -1 Incorporation of matrix approximations need to be symmetric: X ≈ X ’ U T XU ≈ U T X ’ U Instead use: ( I – A ) -1 = ( I + A ) 1/2 ( I – A 2 ) -1 ( I + A ) 1/2 ≈ ( I + A ) 1/2 ( I – A ’) -1 ( I + A ) 1/2
SIMILAR TO ConnectivityOur Algorithm Iteration A i+1 ≈ A i 2 I - A i+1 ≈ I - A i 2 Until ║ A d ║ small Size ReductionLow degreeSparse graph MethodDerandomizedRandomized Solution transferConnectivitySolution vectors Multiscale methods NC algorithm for shortest path Logspace connectivity: [Reingold `02] Deterministic squaring: [Rozenman-Vadhan `05]
EVALUATING ( I + A ) 1/2 ? Well-conditioned matrix Mclaurin series expansion, approximated well by a low degree polynomial T 1/2 ( A i ) A 1 ≈ A 0 2: Eigenvalues between [0,1] Eigenvalues of I + A i in [1,2] when i > 0 Doesn’t work for ( I + A 0 ) 1/2 : eigenvalues of A 0 can be -1 ( I – A ’) -1 ≈ ( I + A ) 1/2 ( I – A ’) -1 ( I + A ) 1/2
MODIFIED IDENTITY ( I – A ) -1 = ( I + A /2) 1/2 ( I – A /2 - A 2 /2) -1 ( I + A /2) 1/2 Modified reduction: I – A i+1 ≈ I – A /2 - A 2 /2 I + A i /2 has eigenvalues in [1/2, 3/2] Can approximate (to very high accuracy) with low degree polynomial / Mclaurin series, T 1/2 ( A i /2)
APPROX. FACTORIZATION CHAIN For p th root (-1 ≤ p ≤1): T p/2 ( A 0 /2)T p/2 ( A 1 /2) …T p/2 ( A d /2) I - A 1 ≈ ε I – A /2 - A 2 /2 I – A 2 ≈ ε I – A 1 /2 - A 1 2 … I – A i ≈ ε I – A i-1 /2 - A i-1 2 /2 I - A d ≈ I I - A 0 I - A d ≈ I d = O(logκ) ( I – A i ) -1 ≈ T 1/2 ( A i /2) ( I – A i+1 ) -1 T 1/2 ( A i /2) C i = T 1/2 ( A i /2) T 1/2 ( A 1 /2)…T 1/2 ( A d /2) gives (I – A i ) -1 ≈ C i T C i,
WORKING AROUND EXPANSIONS Alternate reduction step: ( I – A ) -1 = (I + A /2) ( I – 3/4 A 2 -1/4 A 3 ) -1 (I + A /2) Composition now done with I + A /2, easy Hard part: finding sparse approximation to I – 3/4 A 2 -1/4 A 3 3/4( I – A 2 ): same as before 1/4( I – A 3 ): cubic power
GENERALIZATION TO PTH POWER ( I – A ) p = (I + k A ) ((1 + k A ) 2/p ( I – A )) p ( I + k A ) Intuition: scalar operations commute, cancel away extra outer terms with inner ones Can show: if 2/p is integer and k > 2/p, (1 + k A ) 2/p ( I – A ) is a combination of ( I – A c ) for integer c up to 2/p Difficulty: sparsifying ( I – A c ) for large values of c
SUMMARY Gaussian sampling closely related to linear system solves and matrix p th roots Can approximately factor L p into a product of sparse matrices
OUTLINE Gaussian sampling, linear systems, matrix-roots Sparse factorizations of L p Sparsification of random walk polynomials
SPECTRAL SPARSIFICATION VIA EFFECTIVE RESISTANCE [Spielman-Srivastava `08]: suffices to sample with probabilities at least O(logn) times weight times effective resistance Issues: I - A 3 is dense Need to sample without explicitly generating all edges / resistances Aka. sample with logn A uv R(u, v) Two step approach: get sparsifier with edge count close to m, then run full sparsifier
TWO STEP APPROACH FOR I – X 2 A : 1 step of random walk A 2 : 2 steps of random walk [P-Spielman `14]: for a fix midpoint, edges of A 2, form a (weighted) complete graph Replace with expanders O(mlogn) edges Run black-box sparsifier
I - A 3 A : one step of random walk A 3 : 3 steps of random walk (part of) edge uv in I - A 3 Length 3 path in A : u-y-z-v Weight: A uy A yz A zv
BOUND RESISTANCE ON I - A Rayleigh’s monotonicity law: resistances in subgraphs of I – A are good upper bounds Can check: I - A ≈ 3 I - A 3 Resistance between u and v in I - A gives upper bound for sampling probability Bound R(u, v) using length 3 path in A, u-y-z-v: Sampling probability = logn × w() × R () Spectral theorem: can work as scalars
SAMPLING DISTRIBUTION Weight: A uy A yz A zv Probability: A yz A zv + A uv A zv + A uv A yz Sampling probability = logn × w() × R () Resistance: 1/ A uv + 1/ A yz + 1/ A zv A uy A yz A zv
ONE TERM AT A TIME Probability of picking uyzv: A yz A zv + A uv A zv + A uv A yz Interepratation: pick edge uy, take 2 steps of random walk, then sample edge in A 3 corresponding to uyzv Total for a fixed choice fo uy: Σ zv A yz A zv = Σ z A yz (Σ v A zv ) A : random walk transition probability ≤ Σ z A yz ≤ 1≤ 1 total over all choices of uy: m
MIDDLE TERM Interpretation: pick edge yz, take one step from y to get u, one step from z to get edge uyzv from A 3 Total: m again A uv A yz handled similarly O(mlogn) size approximation to I - A 3 in O(mlogn) time Can then further sparsify in nearly-liner time Probability of picking uyzv: A yz A zv + A uv A zv + A uv A yz
EXTENSIONS I - A k in O(mklog c n) time Even power: I – A ≈ I - A 2 does not hold But I – A 2 ≈ 2 I - A 4, certify via 2 step matrix, same algorithm I - A k in O(mlogklog c n) time when k is a multiple of 4
SUMMARY Gaussian sampling closely related to linear system solves and matrix p th roots Can approximately factor L p into a product of sparse matrices Random walk polynomials can be sparsified by sampling random walks
OPEN QUESTIONS Generalizations: Batch sampling? Connections to multigrid/multiscale methods? Other functionals of L ? Sparsification of random walk polynomials: Degree n polynomials in nearly-linear time? Positive and negative coefficients? Connections with other algorithms based on sampling random walks?
THANK YOU! Questions? Manuscripts on arXiv: