Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

Slides:



Advertisements
Similar presentations
Finding Cycles and Trees in Sublinear Time Oded Goldreich Weizmann Institute of Science Joint work with Artur Czumaj, Dana Ron, C. Seshadhri, Asaf Shapira,
Advertisements

Deterministic vs. Non-Deterministic Graph Property Testing Asaf Shapira Tel-Aviv University Joint work with Lior Gishboliner.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
 Graph Graph  Types of Graphs Types of Graphs  Data Structures to Store Graphs Data Structures to Store Graphs  Graph Definitions Graph Definitions.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Approximation Algorithms for Unique Games Luca Trevisan Slides by Avi Eyal.
Approximating Average Parameters of Graphs Oded Goldreich, Weizmann Institute Dana Ron, Tel Aviv University.
Christian Sohler | Every Property of Hyperfinite Graphs is Testable Ilan Newman and Christian Sohler.
Artur Czumaj Dept of Computer Science & DIMAP University of Warwick Testing Expansion in Bounded Degree Graphs Joint work with Christian Sohler.
On the Spread of Viruses on the Internet Noam Berger Joint work with C. Borgs, J.T. Chayes and A. Saberi.
Random Walks Ben Hescott CS591a1 November 18, 2002.
Mining and Searching Massive Graphs (Networks)
Testing the Diameter of Graphs Michal Parnas Dana Ron.
1 Mazes In The Theory of Computer Science Dana Moshkovitz.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
1 University of Freiburg Computer Networks and Telematics Prof. Christian Schindelhauer Distributed Coloring in Õ(  log n) Bit Rounds COST 293 GRAAL and.
Undirected ST-Connectivity 2 DL Omer Reingold, STOC 2005: Presented by: Fenghui Zhang CPSC 637 – paper presentation.
Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron.
Sublinear Algorithms for Approximating Graph Parameters Dana Ron Tel-Aviv University.
Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.
Testing Metric Properties Michal Parnas and Dana Ron.
On Proximity Oblivious Testing Oded Goldreich - Weizmann Institute of Science Dana Ron – Tel Aviv University.
EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
Advanced Topics in Data Mining Special focus: Social Networks.
1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint work with Mira Gonen Dana Ron Tel-Aviv University.
1 Algorithmic Aspects in Property Testing of Dense Graphs Oded Goldreich – Weizmann Institute Dana Ron - Tel-Aviv University.
1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint works with Mira Gonen and Oded Goldreich Dana Ron Tel-Aviv University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Expanders Eliyahu Kiperwasser. What is it? Expanders are graphs with no small cuts. The later gives several unique traits to such graph, such as: – High.
Christian Sohler 1 University of Dortmund Testing Expansion in Bounded Degree Graphs Christian Sohler University of Dortmund (joint work with Artur Czumaj,
Complexity 1 Mazes And Random Walks. Complexity 2 Can You Solve This Maze?
Finding Cycles and Trees in Sublinear Time Oded Goldreich Weizmann Institute of Science Joint work with Artur Czumaj, Dana Ron, C. Seshadhri, Asaf Shapira,
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Minimal Spanning Trees What is a minimal spanning tree (MST) and how to find one.
Graph Sparsifiers Nick Harvey University of British Columbia Based on joint work with Isaac Fung, and independent work of Ramesh Hariharan & Debmalya Panigrahi.
Complexity and Efficient Algorithms Group / Department of Computer Science Approximating Structural Properties of Graphs by Random Walks Christian Sohler.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
Expanders via Random Spanning Trees R 許榮財 R 黃佳婷 R 黃怡嘉.
Distance sensitivity oracles in weighted directed graphs Raphael Yuster University of Haifa Joint work with Oren Weimann Weizmann inst.
Batch Scheduling of Conflicting Jobs Hadas Shachnai The Technion Based on joint papers with L. Epstein, M. M. Halldórsson and A. Levin.
Graph Sparsifiers Nick Harvey Joint work with Isaac Fung TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
Lower Bounds for Property Testing Luca Trevisan U.C. Berkeley.
Testing the independence number of hypergraphs
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
Artur Czumaj DIMAP DIMAP (Centre for Discrete Maths and it Applications) Computer Science & Department of Computer Science University of Warwick Testing.
Miniconference on the Mathematics of Computation
Presented by Alon Levin
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. Fast.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
The NP class. NP-completeness
Random Walk for Similarity Testing in Complex Networks
On Sample Based Testers
Stochastic Streams: Sample Complexity vs. Space Complexity
Finding Cycles and Trees in Sublinear Time
Minimum Spanning Tree 8/7/2018 4:26 AM
From dense to sparse and back again: On testing graph properties (and some properties of Oded)
CIS 700: “algorithms for Big Data”
CSCI B609: “Foundations of Data Science”
On the effect of randomness on planted 3-coloring models
Introduction Wireless Ad-Hoc Network
Pan Peng (University of Vienna, Austria)
The Subgraph Testing Model
Locality In Distributed Graph Algorithms
Presentation transcript:

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur Czumaj and Pan Peng

Complexity and Efficient Algorithms Group / Department of Computer Science 2 Very Large Networks Examples  Social networks  The World Wide Web  Cocitation graphs  Coauthorship graphs Data size  GigaByte upto TeraByte (only the graph)  Additional data can be in the Peta-Byte range Source: TonZ; Image under Creative Commons License

Complexity and Efficient Algorithms Group / Department of Computer Science 3 Information in the Network Structure Social network  Edge: Two persons are „friends“  Well-connected subgraph: A social group Cocitation graphs  Edge: Two papers deal with a similar subject  Well-connected subgraph: Papers in a scientific area Coauthor graphs  Edge: Two persons have worked together  Well-connected subgraph: Scientific community

Complexity and Efficient Algorithms Group / Department of Computer Science 4 How can we extract this information? Objective  Identify the well-connected subgraphs (clusters) of a huge graph Problem  Classical algorithms require at least linear time  Might be too large for huge networks Our approach  Decide, if the graph has a cluster structure or is far away from it  If yes, get a representative vertex from each (sufficiently big) cluster  Running time sublinear in the input size

Complexity and Efficient Algorithms Group / Department of Computer Science 5 Formalizing the Problem – The Input Input Model  Undirected graph G=(V,E) with vertex set {1,…,n}  Max. degree bounded by constant D  Graph is stored in adjacency lists  We can query for the i-th edge incident to vertex j in O(1) time Property Testing [Rubinfeld, Sudan, 1996, Goldreich, Goldwasser, Ron, 1998]  Formal framework to study sampling algorithms for very large networks  Bounded degree graph model [Goldreich, Ron, 2002]

Complexity and Efficient Algorithms Group / Department of Computer Science 6 Formalizing the Problem – Cluster Structure Definition  The conductance  (C,V-C) is defined as  The conductance  G (G) of G is min C:|C|≤|V|/2  (C,V-C) Definition  A subset C  V is called (  in,  out )-cluster, if   G (G[C]) ≥  in   (C, V-C) ≤  out Definition  A partition of V into at most k (  in,  out )-clusters is called (k,  in,  out )-clustering

Complexity and Efficient Algorithms Group / Department of Computer Science 7 Formalizing the Problem Our Objective  Develop a sampling algorithm that (a) accepts with probability at least 2/3, when the input graph is a (k,  in,  out )-clustering (b) rejects with probability at least 2/3, if the input graph differs from every (k,  in *,  out *)-clustering in more than  Dn edges  The number of samples taken (and running time) of the algorithm should be as small as possible

Complexity and Efficient Algorithms Group / Department of Computer Science 8 Random Walks, Stationary Distributions & Convergence Random Walk  In each step: move from current vertex v to a neighbor chosen uniformly at random Convergence  If G is connected and not bipartite, a random walk converges to a unique stationary distribution  Pr[Random Walk is at vertex v]  deg(v)

Complexity and Efficient Algorithms Group / Department of Computer Science 9 Random Walks, Stationary Distributions & Convergence Lazy Random Walk  In each step: - Probability to move from current vertex v to neighbor u is 1/(2D) - stays at v with remaining probability  Stationary distribution is uniform Rate of Convergence  Can be expressed in terms of the conductance of G or the second largest eigenvalue of the transition matrix  O(log n) steps, if G is a (1,  in,  out )-clustering for constant  in

Complexity and Efficient Algorithms Group / Department of Computer Science 10 Previous Work k=1: Testing Expansion ((1,  in,  out )-clustering)  [Goldreich, Ron, 2000] introduced an algorithm based on collision-statistics of random walks  They conjectured the algorithm to accept in O*(  n) running time every  -expander and reject every expander, which differs in more than  Dn edges from a  *-expander  First proof with a polylogarithmic gap (in n) between  and  * [Czumaj, Sohler, 2010]  Improvement of parameters to constant gap (with running time O*(n 1/2+  )) [Nachmias, Shapira, 2010; Kale, Seshadri 2011]  O* assumes all input parameters except n to be constant and supresses logarithmic factors

Complexity and Efficient Algorithms Group / Department of Computer Science 11 Previous Work TestingExpansion(G,  )  Sample  (1/  ) vertices uniformly at random  For each sample vertex do - Perform O*(  n) lazy random walks of length  *(log n) from each vertex - if the number of collisions among end points is too high then reject  accept Analysis  If G is a (1,  in,  out )-clustering, then a lazy random walk converges quickly to the uniform distribution  Let p(v) be the distribution of the end points of a random walk starting at v  ||p(v)||² is the expected number of collisions  The uniform distribution minimizes ||p(v)||²

Complexity and Efficient Algorithms Group / Department of Computer Science 12 Previous Work TestingExpansion(G,  )  Sample  (1/  ) vertices uniformly at random  For each sample vertex do - Perform O*(  n) lazy random walks of length  *(log n) from each vertex - if the number of collisions among end points is too high then reject  accept Analysis  If G is far away from a (1,  in,  out )-clustering, then the ||p(v)||² is large

Complexity and Efficient Algorithms Group / Department of Computer Science 13 Testing k-Clusterings Main Idea  When increasing the length of the random walks, two random walks starting from the same cluster should eventually have almost the same distribution (and this is almost uniform on the cluster)  Two random walks starting in different cluster should have different distributions Obstacles  We cannot test closeness to the uniform distribution since we don‘t know the clusters  We do not compare stationary distributions

Complexity and Efficient Algorithms Group / Department of Computer Science 14 The Algorithm ClusteringTest  Sample set S of s vertices uniformly at random  For any v  S let p(v) be the distribution of end points of a random walk of length  *(log n) starting at v  for each pair u,v  S do  if p(u) and p(v) are close then add an edge (u,v) to the „cluster graph“ on vertex set S  accept, if and only if the cluster graph is a collection of at most k cliques

Complexity and Efficient Algorithms Group / Department of Computer Science 15 Completeness Lemma (informal)  Let p(v) denote the distribution of the end point of a random walk of given length. For our choice of parameters, if G is a (k,  in,  out )-clustering then (a) for most pairs u,v are from the same cluster C, ||p(v)-p(u)||²≤1/(4n), (b) for most pairs u,v are from different clusters, ||p(v)-p(u)||² > 1/n.

Complexity and Efficient Algorithms Group / Department of Computer Science 16 Completeness Lemma (informal)  Let p(v) denote the distribution of the end point of a random walk of given length. For our choice of parameters, if G is a (k,  in,  out )-clustering then (a) for most pairs u,v are from the same cluster C, ||p(v)-p(u)||²≤1/(4n), (b) for most pairs u,v are from different clusters, ||p(v)-p(u)||² > 1/n. Consequence  If we can estimate the distance of two distribution in sublinear time up to an l 2 -error of 1/(4n), then ClusteringTest accepts any (k,  in,  out )-clustering. 2

Complexity and Efficient Algorithms Group / Department of Computer Science 17 Completeness Lemma (informal)  Let p(v) denote the distribution of the end point of a random walk of given length. For our choice of parameters, if G is a (k,  in,  out )-clustering then (a) for most pairs u,v are from the same cluster C, ||p(v)-p(u)||²≤1/(4n), (b) for most pairs u,v are from different clusters, ||p(v)-p(u)||² > 1/n. Consequence  If we can estimate the distance of two distribution in sublinear time up to an l 2 -error of 1/(4n), then ClusteringTest accepts any (k,  in,  out )-clustering.  Can be done using previous work of [Batu et al.,2013] or [Chan, Diakonikolas, Valiant, Valiant, 2014] 2

Complexity and Efficient Algorithms Group / Department of Computer Science 18 Soundness Lemma (informal)  If G differs in more than  dn edges from a (k,  in,*,  out *)-clustering then one can partition V into k+1 subsets C 1,…,C k+1 of size   (n) such that  (C i, V-C i ) is small for all i. Example:  -far from (2,  in,*,  out *)-clustering

Complexity and Efficient Algorithms Group / Department of Computer Science 19 Soundness Lemma (informal)  If G differs in more than  dn edges from a (k,  in,*,  out *)-clustering then one can partition V into k+1 subsets C 1,…,C k+1 of size   (n) such that  (C i, V-C i ) is small for all i. Example:  -far from (2,  in,*,  out *)-clustering Sample will hit all k+1 subsets

Complexity and Efficient Algorithms Group / Department of Computer Science 20 Soundness Lemma (informal)  If G differs in more than  dn edges from a (k,  in,*,  out *)-clustering then one can partition V into k+1 subsets C 1,…,C k+1 of size   (n) such that  (C i, V-C i ) is small for all i. Example:  -far from (2,  in,*,  out *)-clustering Distance between vertices from different clusters is big

Complexity and Efficient Algorithms Group / Department of Computer Science 21 Summary Theorem  Algorithm ClusteringTester accepts every (k,  in,  out )-clustering with probability at least 2/3 and rejects every graph that differs in more than  dn edges from every (k,  in *,  out *)-clustering with probability at least 2/3, where  out =O d,k (  4  in ²/log n) and  in * =  d,k  (  4  in ²/log n).  The running time of the algorithm is O*(  n).

Complexity and Efficient Algorithms Group / Department of Computer Science 22 Thank you!