Sampling from Gaussian Graphical Models via Spectral Sparsification Richard Peng M.I.T. Joint work with Dehua Cheng, Yu Cheng, Yan Liu and Shanghua Teng.

Sampling from Gaussian Graphical Models via Spectral Sparsification Richard Peng M.I.T. Joint work with Dehua Cheng, Yu Cheng, Yan Liu and Shanghua Teng (U.S.C.)

OUTLINE Gaussian sampling, linear systems, matrix-roots Sparse factorizations of L p Sparsification of random walk polynomials

SAMPLING FROM GRAPHICAL MODELS Joint probability distribution between entries of n-dimensional random variables x graphical models: encode as local dependencies via graph Sampling: pick a uniformly random point from the model

APPLICATIONS Often need many samples Rejection / importance sampling Estimation of quantities on the samples Ideal sampling routine: Efficient, parallel Use limited randomness

PREVIOUS WORKS Instance of Markov Chain Monte-Carlo Parallel sampling algorithm: [Gonzalez-Low-Gretton-Guestrin `11]: coloring. [Niu-Recht-Re-Wright `11] Hogwild: go lock-free [Williamson-Dubey-Xing `13]: auxiliary variables. Gibbs sampling: locally resample each variable from the joint distribution given by its neighbors

GAUSSIAN GRAPHICAL MODELS AND LINEAR SYSTEMS Joint distribution specified by a precision matrix, M -1 Goal: sample from Gaussian distribution N(0, M -1 ) Gibbs sampling: resample based on neighbors Iterative methods: x ’  x + α Mx Also recomputing on neighbors Usually denoted as Λ -1

CONNECTION TO SOLVING LINEAR SYSTEMS [Johnson, Saunderson, Willsky `13]: if the precision matrix M is (generalized) diagonally dominant, then Hogwild Gibbs sampling converges 1 1 n vertices m edges Further simplification: graph Laplacian Matrix L Diagonal: degree Off-diagonal: -edge weights 2 -1 -1 -1 1 0 -1 0 1 Much more restrictive than the `graph’ in graphical models! n rows / columns O(m) non-zeros

LOCAL METHODS #steps required lower bounded by information propagation M diameter b bMbMb M2bM2b Need n matrix operations? What if we have more powerful algorithmic primitives?

ALGEBRAIC PRIMITIVE Goal: generate random variable from the Gaussian distribution N(0, L -1 ) Can generate uniform Gaussians, N(0, I) Need: efficiently evaluable linear operator C s.t. C T C = L -1 x ~ N(0, I), y = Cx y ~ N(0, C T C ) Assume L is full rank for simplicity

DIRECT SOLUTION: Factorize L = B T B Set C = L -1 B T CC T = L -1 B T ( L -1 B T ) T = L -1 B T BL -1 = L -1 Factorization + black-box access solvers gives sampling algorithm B : Edge-vertex incidence matrix: B eu =-1/1 if u is endpoint of e 0 otherwise 1 -1 0 -1 0 1

PARALLEL SAMPLING ROUTINE [P-Spielman `14]: Z ≈ ε L -1 in polylog depth and nearly-linear work ≈: spectral similarity, A ≈ k B iff ∀ x we have: e -k x T Ax ≤ x T Bx ≤ e k x T Ax Can use B ‘in place’ of A Can also boost accuracy Parallel sampling routine C corresponding to : y ’  B T y x  solve( L, y’ ) gives L ≈ C T C

RANDOMNESS REQUIREMENT Sample y from N(0, I ) y ’  B t y x  solve( L, y’ ) return x B : m – by – n matrix, m = # of edges Optimal randomness requirement: n  C that is a square matrix Fewer random variables? y needs to be a m-dimensional Gaussians (can get to O(nlogn) with some work)

GENERALIZATIONS Lower Randomness Requirement: L ≈ C T C where C is a square matrix Application of matrix roots: ‘half a step’ of a random walk Can also view as matrix square root? Z s.t. Z ≈ L -1/2 ? Z s.t. Z ≈ L -1/3 ? ≈: spectral approximation Akin to QR factorization Alternate definition of square-root:

OUR RESULT Input : graph Laplacian L with condition number κ, parameter -1 ≤ p ≤ 1 Output : Access to square operator C s.t. C T C ≈ ε L p Cost : O(log c1 m log c2 κ ε -4 ) time O(m log c1 m log c2 κ ε -4 ) work κ : condition number, closely related to bit-complexity of solve( L, b ) Extends to symmetric diagonally dominant (SDD) matrices

SUMMARY Gaussian sampling closely related to linear system solves and matrix p th roots Can approximately factor L p into a product of sparse matrices Random walk polynomials can be sparsified by sampling random walks

SIMPLIFICATION Adjust/rescale so diagonal = I Add to diagonal to make full rank L = I – A A: Random walk, ║ A ║ < 1

PROBLEM Each step: pass information to neighbor A diameter IAA2A2 Need A diameter Given random walk matrix A, parameter p, produce easily evaluable C s.t. C T C ≈ ( I – A ) p Evaluate using O(diameter) matrix operations? Local approach for p = -1: I + A + A 2 + A 3 + …. = ( I – A ) -1

FASTER INFORMATION PROPAGATION Recall: ║ A ║ < 1, I - A n 3 ≈ I if A corresponds to random walk on unweighted graph Repeated squaring: A 16 = (((( A 2 ) 2 ) 2 ) 2, 4 operations Framework from [P-Spielman `14]: Reducing ( I – A ) p to computing ( I – A 2 ) p O(logκ) reduction steps suffice

SQUARING  DENSE GRAPHS?!? [ST `04][SS`08][OV `11] + some modifications, or [Koutis `14]: O(nlog c n ε -2 ) entries, efficient, parallel [BSS`09, ALZ `14]: O(nε -2 ) entries, but quadratic cost Graph sparsification: sparse A ’ s.t. I - A ’ ≈ ε I – A 2 Also preserves p th powers

ABSORBING ERRORS Direct factorization: ( I – A ) -1 = ( I + A ) ( I – A 2 ) -1 Simplification: work with p = -1 Have: I – A ’ ≈ I – A 2 Implies: ( I – A ’) -1 ≈ ( I – A 2 ) -1 But NOT: ( I + A ) ( I – A ’) -1 ≈ ( I + A ) ( I – A 2 ) -1 Incorporation of matrix approximations need to be symmetric: X ≈ X ’  U T XU ≈ U T X ’ U Instead use: ( I – A ) -1 = ( I + A ) 1/2 ( I – A 2 ) -1 ( I + A ) 1/2 ≈ ( I + A ) 1/2 ( I – A ’) -1 ( I + A ) 1/2

SIMILAR TO ConnectivityOur Algorithm Iteration A i+1 ≈ A i 2 I - A i+1 ≈ I - A i 2 Until ║ A d ║ small Size ReductionLow degreeSparse graph MethodDerandomizedRandomized Solution transferConnectivitySolution vectors Multiscale methods NC algorithm for shortest path Logspace connectivity: [Reingold `02] Deterministic squaring: [Rozenman-Vadhan `05]

EVALUATING ( I + A ) 1/2 ? Well-conditioned matrix Mclaurin series expansion, approximated well by a low degree polynomial T 1/2 ( A i ) A 1 ≈ A 0 2: Eigenvalues between [0,1] Eigenvalues of I + A i in [1,2] when i > 0 Doesn’t work for ( I + A 0 ) 1/2 : eigenvalues of A 0 can be -1 ( I – A ’) -1 ≈ ( I + A ) 1/2 ( I – A ’) -1 ( I + A ) 1/2

MODIFIED IDENTITY ( I – A ) -1 = ( I + A /2) 1/2 ( I – A /2 - A 2 /2) -1 ( I + A /2) 1/2 Modified reduction: I – A i+1 ≈ I – A /2 - A 2 /2 I + A i /2 has eigenvalues in [1/2, 3/2] Can approximate (to very high accuracy) with low degree polynomial / Mclaurin series, T 1/2 ( A i /2)

APPROX. FACTORIZATION CHAIN For p th root (-1 ≤ p ≤1): T p/2 ( A 0 /2)T p/2 ( A 1 /2) …T p/2 ( A d /2) I - A 1 ≈ ε I – A /2 - A 2 /2 I – A 2 ≈ ε I – A 1 /2 - A 1 2 … I – A i ≈ ε I – A i-1 /2 - A i-1 2 /2 I - A d ≈ I I - A 0 I - A d ≈ I d = O(logκ) ( I – A i ) -1 ≈ T 1/2 ( A i /2) ( I – A i+1 ) -1 T 1/2 ( A i /2) C i = T 1/2 ( A i /2) T 1/2 ( A 1 /2)…T 1/2 ( A d /2) gives (I – A i ) -1 ≈ C i T C i,

WORKING AROUND EXPANSIONS Alternate reduction step: ( I – A ) -1 = (I + A /2) ( I – 3/4 A 2 -1/4 A 3 ) -1 (I + A /2) Composition now done with I + A /2, easy Hard part: finding sparse approximation to I – 3/4 A 2 -1/4 A 3 3/4( I – A 2 ): same as before 1/4( I – A 3 ): cubic power

GENERALIZATION TO PTH POWER ( I – A ) p = (I + k A ) ((1 + k A ) 2/p ( I – A )) p ( I + k A ) Intuition: scalar operations commute, cancel away extra outer terms with inner ones Can show: if 2/p is integer and k > 2/p, (1 + k A ) 2/p ( I – A ) is a combination of ( I – A c ) for integer c up to 2/p Difficulty: sparsifying ( I – A c ) for large values of c

SUMMARY Gaussian sampling closely related to linear system solves and matrix p th roots Can approximately factor L p into a product of sparse matrices

SPECTRAL SPARSIFICATION VIA EFFECTIVE RESISTANCE [Spielman-Srivastava `08]: suffices to sample with probabilities at least O(logn) times weight times effective resistance Issues: I - A 3 is dense Need to sample without explicitly generating all edges / resistances Aka. sample with logn A uv R(u, v) Two step approach: get sparsifier with edge count close to m, then run full sparsifier

TWO STEP APPROACH FOR I – X 2 A : 1 step of random walk A 2 : 2 steps of random walk [P-Spielman `14]: for a fix midpoint, edges of A 2, form a (weighted) complete graph Replace with expanders  O(mlogn) edges Run black-box sparsifier

I - A 3 A : one step of random walk A 3 : 3 steps of random walk (part of) edge uv in I - A 3 Length 3 path in A : u-y-z-v Weight: A uy A yz A zv

BOUND RESISTANCE ON I - A Rayleigh’s monotonicity law: resistances in subgraphs of I – A are good upper bounds Can check: I - A ≈ 3 I - A 3 Resistance between u and v in I - A gives upper bound for sampling probability Bound R(u, v) using length 3 path in A, u-y-z-v: Sampling probability = logn × w() × R () Spectral theorem: can work as scalars

SAMPLING DISTRIBUTION Weight: A uy A yz A zv Probability: A yz A zv + A uv A zv + A uv A yz Sampling probability = logn × w() × R () Resistance: 1/ A uv + 1/ A yz + 1/ A zv A uy A yz A zv

ONE TERM AT A TIME Probability of picking uyzv: A yz A zv + A uv A zv + A uv A yz Interepratation: pick edge uy, take 2 steps of random walk, then sample edge in A 3 corresponding to uyzv Total for a fixed choice fo uy: Σ zv A yz A zv = Σ z A yz (Σ v A zv ) A : random walk transition probability ≤ Σ z A yz ≤ 1≤ 1 total over all choices of uy: m

MIDDLE TERM Interpretation: pick edge yz, take one step from y to get u, one step from z to get edge uyzv from A 3 Total: m again A uv A yz handled similarly O(mlogn) size approximation to I - A 3 in O(mlogn) time Can then further sparsify in nearly-liner time Probability of picking uyzv: A yz A zv + A uv A zv + A uv A yz

EXTENSIONS I - A k in O(mklog c n) time Even power: I – A ≈ I - A 2 does not hold But I – A 2 ≈ 2 I - A 4, certify via 2 step matrix, same algorithm I - A k in O(mlogklog c n) time when k is a multiple of 4

SUMMARY Gaussian sampling closely related to linear system solves and matrix p th roots Can approximately factor L p into a product of sparse matrices Random walk polynomials can be sparsified by sampling random walks

OPEN QUESTIONS Generalizations: Batch sampling? Connections to multigrid/multiscale methods? Other functionals of L ? Sparsification of random walk polynomials: Degree n polynomials in nearly-linear time? Positive and negative coefficients? Connections with other algorithms based on sampling random walks?

THANK YOU! Questions? Manuscripts on arXiv: http://arxiv.org/abs/1311.3286 http://arxiv.org/abs/1410.5392

Sampling from Gaussian Graphical Models via Spectral Sparsification Richard Peng M.I.T. Joint work with Dehua Cheng, Yu Cheng, Yan Liu and Shanghua Teng.

Similar presentations

Presentation on theme: "Sampling from Gaussian Graphical Models via Spectral Sparsification Richard Peng M.I.T. Joint work with Dehua Cheng, Yu Cheng, Yan Liu and Shanghua Teng."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sampling from Gaussian Graphical Models via Spectral Sparsification Richard Peng M.I.T. Joint work with Dehua Cheng, Yu Cheng, Yan Liu and Shanghua Teng.

Similar presentations

Presentation on theme: "Sampling from Gaussian Graphical Models via Spectral Sparsification Richard Peng M.I.T. Joint work with Dehua Cheng, Yu Cheng, Yan Liu and Shanghua Teng."— Presentation transcript:

Similar presentations

About project

Feedback