Estimating PageRank on Graph Streams Atish Das Sarma (Georgia Tech) Sreenivas Gollapudi, Rina Panigrahy (Microsoft Research)
PageRank – Determine Ranking of nodes in graphs Typically large graphs - WWW, Social Networks Run daily by commercial search engines
PageRank computation u a b c
PageRank Computation Our Approach: No Matrix-Vector Multiplication! u a b c
Our Result Many Random Walk Samples Efficiently. Approximate PageRank u
Other results from Random Walks We can estimate: Mixing Time Conductance Using Streams G u
Streaming 7 e 1, e 2, e 3, e 4, e 5, e 6, e 7, …. Input is a “stream” Small RAM working memory Few Passes Frequency moments, quantiles Graphs: Edges, arbitrary order
Related Work Sparsifiers (Benczur-Karger 96, Spielman-Teng 01, Spielman-Srivastava 08) – Given an undirected graph, produces a sparse one – approximately preserves x’Lx – Can be used to compute sparse cuts Streaming version of BK96 (Ahn, Guha 09) – Sparse cuts in 1 pass and O(n) space. Accelarated Page Rank (McSherry 08) – heuristics 8 ~
Key Idea One walk from u length l efficiently Later extend to Many walks u v l
Single Random Walk - Naive Algo. One Step with every Pass! Constant Space Passes s
Second Naive Algo Single Pass Sample sufficient edges! If, then sample 2 out-edges from each node. (store order) s
Comparison Naive (single walk): Our Result: In fact walks! u l Automatically:
Insight: Merge Short Walks Sample fraction of nodes (centers) passes - length walks Merge and extend short walks! Two problems: End up at node second time End up at non-sampled node s w w w w w w w a b
Stuck Nodes Sample an edge from stuck. Again. And again... Slow? If new nodes, good in passes! s w w w w w w w
Stuck nodes Stuck on same Nodes? Sample s edges from each s progress OR new node! Must include to set previous seen centers s w w w w w w w w w s s s ss s
Summary s w w w w w w w w w s s s ss s Perform short walks from sampled centers Concatenate walks until stuck Sample edges from stuck Make local progress until new node Local progress = s New node : center with prob Amortized progress, every pass
Summary s w w w w w w w w w s s s ss s Total number of passes : Total Space :
Summary s w w w w w w w w w s s s ss s Set Number of passes = Space =
Many Walks Naive Space Bound: Observation: Many short walks not used in Single RW. We show:
Many Random Walks : probability node ’s short walk used in single RW. If known : save lot of space! Perform K random walks Total number of short walks required is about Don’t know. But can estimate.
Estimating Run K = (log n) walks of length Gives a crude estimate of Sufficient to double K Continue doubling K Gives K walks in space Passes u l
Distributions samples Distribution: u Space Passes
Mixing Time, Conductance Undirected graphs: Compare Distribution with Steady State. Estimating difference: samples. [Batu et. al.’ 01] – approximate mixing time. Directed, till distribution “stabilizes”: samples. Conductance: Recall space for walks:
Results recap - Mixing Time for Undirected Graphs : Quadratic Approximation to Conductance PageRank to accuracy
Open Questions? Improve passes for random walks. In particular, sub-linear space and constant passes. Graph Cuts and Graph Sparsification for directed graphs Better (streaming) algorithms for computing eigenvectors
Thank You!
Summary Perform short walks from sampled centers Concatenate walks until stuck Sample edges from stuck Make local progress until new node Local progress = s New node = nodes gives center Amortized, every pass -
Summary Perform short walks from sampled centers Concatenate walks until stuck Sample edges from stuck Make local progress until new node Local progress = s New node = nodes gives center Amortized, every pass -
Analysis Total number of passes : Total Space : Set Number of passes = Space =