Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

Slides:



Advertisements
Similar presentations
Finding Cycles and Trees in Sublinear Time Oded Goldreich Weizmann Institute of Science Joint work with Artur Czumaj, Dana Ron, C. Seshadhri, Asaf Shapira,
Advertisements

Deterministic vs. Non-Deterministic Graph Property Testing Asaf Shapira Tel-Aviv University Joint work with Lior Gishboliner.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Approximation Algorithms for Unique Games Luca Trevisan Slides by Avi Eyal.
Approximating Average Parameters of Graphs Oded Goldreich, Weizmann Institute Dana Ron, Tel Aviv University.
Christian Sohler | Every Property of Hyperfinite Graphs is Testable Ilan Newman and Christian Sohler.
Artur Czumaj Dept of Computer Science & DIMAP University of Warwick Testing Expansion in Bounded Degree Graphs Joint work with Christian Sohler.
On the Spread of Viruses on the Internet Noam Berger Joint work with C. Borgs, J.T. Chayes and A. Saberi.
Random Walks Ben Hescott CS591a1 November 18, 2002.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
Analysis of Network Diffusion and Distributed Network Algorithms Rajmohan Rajaraman Northeastern University, Boston May 2012 Chennai Network Optimization.
Mining and Searching Massive Graphs (Networks)
Oded Goldreich Shafi Goldwasser Dana Ron February 13, 1998 Max-Cut Property Testing by Ori Rosen.
Testing the Diameter of Graphs Michal Parnas Dana Ron.
1 Mazes In The Theory of Computer Science Dana Moshkovitz.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Undirected ST-Connectivity 2 DL Omer Reingold, STOC 2005: Presented by: Fenghui Zhang CPSC 637 – paper presentation.
Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron.
Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.
Testing Metric Properties Michal Parnas and Dana Ron.
On Proximity Oblivious Testing Oded Goldreich - Weizmann Institute of Science Dana Ron – Tel Aviv University.
EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
Advanced Topics in Data Mining Special focus: Social Networks.
1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint work with Mira Gonen Dana Ron Tel-Aviv University.
1 Algorithmic Aspects in Property Testing of Dense Graphs Oded Goldreich – Weizmann Institute Dana Ron - Tel-Aviv University.
1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint works with Mira Gonen and Oded Goldreich Dana Ron Tel-Aviv University.
Expanders Eliyahu Kiperwasser. What is it? Expanders are graphs with no small cuts. The later gives several unique traits to such graph, such as: – High.
Christian Sohler 1 University of Dortmund Testing Expansion in Bounded Degree Graphs Christian Sohler University of Dortmund (joint work with Artur Czumaj,
Complexity 1 Mazes And Random Walks. Complexity 2 Can You Solve This Maze?
Undirected ST-Connectivity In Log Space Omer Reingold Slides by Sharon Bruckner.
Neighbourhood Sampling for Local Properties on a Graph Stream A. Pavan, Iowa State University Kanat Tangwongsan, IBM Research Srikanta Tirthapura, Iowa.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Complexity and Efficient Algorithms Group / Department of Computer Science Approximating Structural Properties of Graphs by Random Walks Christian Sohler.
Approximating the Minimum Degree Spanning Tree to within One from the Optimal Degree R 陳建霖 R 宋彥朋 B 楊鈞羽 R 郭慶徵 R
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
Expanders via Random Spanning Trees R 許榮財 R 黃佳婷 R 黃怡嘉.
15-853:Algorithms in the Real World
Lower Bounds for Property Testing Luca Trevisan U.C. Berkeley.
Testing the independence number of hypergraphs
Markov Chains and Random Walks. Def: A stochastic process X={X(t),t ∈ T} is a collection of random variables. If T is a countable set, say T={0,1,2, …
Data Structures & Algorithms Graphs
Seminar on random walks on graphs Lecture No. 2 Mille Gandelsman,
Artur Czumaj DIMAP DIMAP (Centre for Discrete Maths and it Applications) Computer Science & Department of Computer Science University of Warwick Testing.
Miniconference on the Mathematics of Computation
Graphs, Vectors, and Matrices Daniel A. Spielman Yale University AMS Josiah Willard Gibbs Lecture January 6, 2016.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Date: 2005/4/25 Advisor: Sy-Yen Kuo Speaker: Szu-Chi Wang.
Anonymous communication over social networks Shishir Nagaraja and Ross Anderson Security Group Computer Laboratory.
Presented by Alon Levin
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. Fast.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Random Walk for Similarity Testing in Complex Networks
On Sample Based Testers
Markov Chains and Random Walks
Stochastic Streams: Sample Complexity vs. Space Complexity
Approximating the MST Weight in Sublinear Time
Finding Cycles and Trees in Sublinear Time
Minimum Spanning Tree 8/7/2018 4:26 AM
From dense to sparse and back again: On testing graph properties (and some properties of Oded)
MST in Log-Star Rounds of Congested Clique
Structural Properties of Low Threshold Rank Graphs
Introduction Wireless Ad-Hoc Network
Pan Peng (University of Vienna, Austria)
Locality In Distributed Graph Algorithms
Presentation transcript:

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur Czumaj and Pan Peng

Complexity and Efficient Algorithms Group / Department of Computer Science 2 Very Large Networks Examples  Social networks  The World Wide Web  Cocitation graphs  Coauthorship graphs Data size  GigaByte upto TeraByte (only the graph)  Additional data can be in the Peta-Byte range Source: TonZ; Image under Creative Commons License

Complexity and Efficient Algorithms Group / Department of Computer Science 3 Information in the Network Structure Social network  Edge: Two persons are „friends“  Well-connected subgraph: A social group Cocitation graphs  Edge: Two papers deal with a similar subject  Well-connected subgraph: Papers in a scientific area Coauthor graphs  Edge: Two persons have worked together  Well-connected subgraph: Scientific community

Complexity and Efficient Algorithms Group / Department of Computer Science 4 How can we extract this information? Objective  Identify the well-connected subgraphs (clusters) of a huge graph Problem  Classical algorithms require at least linear time  Might be too large for huge networks Our approach  Decide, if the graph has a cluster structure or is far away from it  If yes, get a representative vertex from each (sufficiently big) cluster  Running time sublinear in the input size

Complexity and Efficient Algorithms Group / Department of Computer Science 5 Formalizing the Problem – The Input Input Model  Undirected graph G=(V,E) with vertex set {1,…,n}  Max. degree bounded by constant D  Graph is stored in adjacency lists  We can query for the i-th edge incident to vertex j in O(1) time Property Testing [Rubinfeld, Sudan, 1996, Goldreich, Goldwasser, Ron, 1998]  Formal framework to study sampling algorithms for very large networks  Bounded degree graph model [Goldreich, Ron, 2002]

Complexity and Efficient Algorithms Group / Department of Computer Science 6 Formalizing the Problem – Cluster Structure Definition  The conductance  (C,V-C) is defined as  The conductance  G (G) of G is min C:|C|≤|V|/2  (C,V-C) Definition  A subset C  V is called (  in,  out )-cluster, if   G (G[C]) ≥  in   (C, V-C) ≤  out Definition  A partition of V into at most k (  in,  out )-clusters is called (k,  in,  out )-clustering

Complexity and Efficient Algorithms Group / Department of Computer Science 7 Formalizing the Problem Our Objective  Develop a sampling algorithm that (a) accepts with probability at least 2/3, if the input graph is a (k,  in,  out )-clustering (b) rejects with probability at least 2/3, if the input graph differs from every (k,  in *,  out *)-clustering in more than  Dn edges  The number of samples taken (and running time) of the algorithm should be as small as possible

Complexity and Efficient Algorithms Group / Department of Computer Science 8 Random Walks, Stationary Distributions & Convergence Random Walk  In each step: move from current vertex v to a neighbor chosen uniformly at random Convergence  If G is connected and not bipartite, a random walk converges to a unique stationary distribution  Pr[Random Walk is at vertex v]  deg(v)

Complexity and Efficient Algorithms Group / Department of Computer Science 9 Random Walks, Stationary Distributions & Convergence Lazy Random Walk  In each step: - Probability to move from current vertex v to neighbor u is 1/(2D) - stays at v with remaining probability  Stationary distribution is uniform Rate of Convergence  Can be expressed in terms of the conductance of G or the second largest eigenvalue of the transition matrix (Cheeger‘s inequality)  O(log n) steps, if G is a (1,  in,  out )-clustering for constant  in

Complexity and Efficient Algorithms Group / Department of Computer Science 10 Previous Work k=1: Testing Expansion ((1,  in,  out )-clustering)  [Goldreich, Ron, 2000] introduced an algorithm based on collision-statistics of random walks  They conjectured the algorithm to accept in O*(  n) running time every  -expander and reject every expander, which differs in more than  Dn edges from a  *-expander  First proof with a polylogarithmic gap (in n) between  and  * [Czumaj, Sohler, 2010]  Improvement of parameters to constant gap (with running time O*(n 1/2+  )) [Nachmias, Shapira, 2010; Kale, Seshadri 2011]  [Batu et al., 2013] Tester for mixing properties of Markov chains  O* assumes all input parameters except n to be constant and supresses logarithmic factors

Complexity and Efficient Algorithms Group / Department of Computer Science 11 Previous Work TestingExpansion(G,  )  Sample  (1/  ) vertices uniformly at random  For each sample vertex do - Perform O*(  n) lazy random walks of length  *(log n) from each vertex - if the number of collisions among end points is too high then reject  accept Analysis  If G is a (1,  in,  out )-clustering, then a lazy random walk converges quickly to the uniform distribution  Let p(v) be the distribution of the end points of a lazy random walk starting at v  ||p(v)||² is the expected number of collisions  The uniform distribution minimizes ||p(v)||²

Complexity and Efficient Algorithms Group / Department of Computer Science 12 Previous Work TestingExpansion(G,  )  Sample  (1/  ) vertices uniformly at random  For each sample vertex do - Perform O*(  n) lazy random walks of length  *(log n) from each vertex - if the number of collisions among end points is too high then reject  accept Analysis  If G is far away from a (1,  in,  out )-clustering, then the ||p(v)||² is large

Complexity and Efficient Algorithms Group / Department of Computer Science 13 Testing k-Clusterings Main Idea  When increasing the length of the random walks, two random walks starting from the same cluster should eventually have almost the same distribution (and this is almost uniform on the cluster)  Two random walks starting in different cluster should have different distributions Obstacles  We cannot test closeness to the uniform distribution since we don‘t know the clusters  We do not compare stationary distributions

Complexity and Efficient Algorithms Group / Department of Computer Science 14 The Algorithm ClusteringTest  Sample set S of s vertices uniformly at random  For any v  S let p(v) be the distribution of end points of a random walk of length  *(log n) starting at v  for each pair u,v  S do  if p(u) and p(v) are close then add an edge (u,v) to the „cluster graph“ on vertex set S  accept, if and only if the cluster graph is a collection of at most k cliques

Complexity and Efficient Algorithms Group / Department of Computer Science 15 Completeness Lemma (informal)  Let p(v) denote the distribution of the end points of a random walk of given length. For our choice of parameters, if G is a (k,  in,  out )-clustering then (a) for most pairs u,v are from the same cluster C, ||p(v)-p(u)||²≤1/(4n), (b) for most pairs u,v are from different clusters, ||p(v)-p(u)||² > 1/n.

Complexity and Efficient Algorithms Group / Department of Computer Science 16 Completeness Lemma (informal)  Let p(v) denote the distribution of the end points of a random walk of given length. For our choice of parameters, if G is a (k,  in,  out )-clustering then (a) for most pairs u,v are from the same cluster C, ||p(v)-p(u)||²≤1/(4n), (b) for most pairs u,v are from different clusters, ||p(v)-p(u)||² > 1/n.  Proof uses higher order Cheeger‘s inequality [Lee, Oveis Gharan, Trevisan, 2012]

Complexity and Efficient Algorithms Group / Department of Computer Science 17 Completeness Lemma (informal)  Let p(v) denote the distribution of the end points of a random walk of given length. For our choice of parameters, if G is a (k,  in,  out )-clustering then (a) for most pairs u,v are from the same cluster C, ||p(v)-p(u)||²≤1/(4n), (b) for most pairs u,v are from different clusters, ||p(v)-p(u)||² > 1/n. Consequence  If we can estimate the distance of two distribution in sublinear time up to an l 2 -error of 1/(4n), then ClusteringTest accepts any (k,  in,  out )-clustering. 2

Complexity and Efficient Algorithms Group / Department of Computer Science 18 Completeness Lemma (informal)  Let p(v) denote the distribution of the end points of a random walk of given length. For our choice of parameters, if G is a (k,  in,  out )-clustering then (a) for most pairs u,v are from the same cluster C, ||p(v)-p(u)||²≤1/(4n), (b) for most pairs u,v are from different clusters, ||p(v)-p(u)||² > 1/n. Consequence  If we can estimate the distance of two distribution in sublinear time up to an l 2 -error of 1/(4n), then ClusteringTest accepts any (k,  in,  out )-clustering.  Can be done using previous work of [Batu et al.,2013] or [Chan, Diakonikolas, Valiant, Valiant, 2014] 2

Complexity and Efficient Algorithms Group / Department of Computer Science 19 Soundness Lemma (informal)  If G differs in more than  dn edges from a (k,  in,*,  out *)-clustering then one can partition V into k+1 subsets C 1,…,C k+1 of size   (n) such that  (C i, V-C i ) is small for all i. Example:  -far from (2,  in,*,  out *)-clustering

Complexity and Efficient Algorithms Group / Department of Computer Science 20 Soundness Lemma (informal)  If G differs in more than  dn edges from a (k,  in,*,  out *)-clustering then one can partition V into k+1 subsets C 1,…,C k+1 of size   (n) such that  (C i, V-C i ) is small for all i. Example:  -far from (2,  in,*,  out *)-clustering Sample will hit all k+1 subsets

Complexity and Efficient Algorithms Group / Department of Computer Science 21 Soundness Lemma (informal)  If G differs in more than  dn edges from a (k,  in,*,  out *)-clustering then one can partition V into k+1 subsets C 1,…,C k+1 of size   (n) such that  (C i, V-C i ) is small for all i. Example:  -far from (2,  in,*,  out *)-clustering Distance between vertices from different clusters is big

Complexity and Efficient Algorithms Group / Department of Computer Science 22 Summary Theorem  Algorithm ClusteringTester accepts every (k,  in,  out )-clustering with probability at least 2/3 and rejects every graph that differs in more than  Dn edges from every (k,  in *,  out *)-clustering with probability at least 2/3, where  out =O D,k (  4  in ²) and  in * =  D,k  (  4  in ²/log n).  The running time of the algorithm is O*(  n).

Complexity and Efficient Algorithms Group / Department of Computer Science 23 This may be good in theory… Take away message  We can compare distributions of end points of random walks to detect cluster structures in a graph Difficulties for practice  Typically, we do not know the parameters of the clusters  Our analysis is probably not strong enough for practical purposes Idea  We sample some vertices and then compare the distributions of end points of random walks for different length  We put an edge between two vertices whose distributions are close and study the development of the number of connected components as the length of the random walk increases

Complexity and Efficient Algorithms Group / Department of Computer Science 24 Preliminary Experiments – Stochastic Block Model

Complexity and Efficient Algorithms Group / Department of Computer Science 25 Preliminary Experiments – Data Sets Stanford Network Analysis Project [Leskovec, Krevl, 2014]  Road Networks (California, Pennsylvania, Texas)  Networks with ground-truth communities - LiveJournal (blogging community with friendship links) - Orkut (social network) - DBLP - YouTube social network - Amazon „Co-buying“ network Network Sizes  Between 300,000 and 4,000,000 nodes  Between 900,000 and 117,000,000 edges

Complexity and Efficient Algorithms Group / Department of Computer Science 26 Preliminary Experiments – Road Networks

Complexity and Efficient Algorithms Group / Department of Computer Science 27 Preliminary Experiments –Networks with ground truth communities

Complexity and Efficient Algorithms Group / Department of Computer Science 28 Preliminary Experiments Some conclusions  We can use our algorithm to distinguish between different classes of networks  Can we also distinguish between different types of social networks?  The curves suggest a rich nested cluster structure in social networks – can this be verified?

Complexity and Efficient Algorithms Group / Department of Computer Science 29 Thank you!