Complexity and Efficient Algorithms Group / Department of Computer Science Approximating Structural Properties of Graphs by Random Walks Christian Sohler.

Slides:



Advertisements
Similar presentations
Lower Bounds for Local Search by Quantum Arguments Scott Aaronson (UC Berkeley) August 14, 2003.
Advertisements

Lower Bounds for Additive Spanners, Emulators, and More David P. Woodruff MIT and Tsinghua University To appear in FOCS, 2006.
Finding Cycles and Trees in Sublinear Time Oded Goldreich Weizmann Institute of Science Joint work with Artur Czumaj, Dana Ron, C. Seshadhri, Asaf Shapira,
Deterministic vs. Non-Deterministic Graph Property Testing Asaf Shapira Tel-Aviv University Joint work with Lior Gishboliner.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Information Networks Graph Clustering Lecture 14.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Approximation Algorithms for Unique Games Luca Trevisan Slides by Avi Eyal.
Approximating Average Parameters of Graphs Oded Goldreich, Weizmann Institute Dana Ron, Tel Aviv University.
Noga Alon Institute for Advanced Study and Tel Aviv University
Christian Sohler | Every Property of Hyperfinite Graphs is Testable Ilan Newman and Christian Sohler.
Artur Czumaj Dept of Computer Science & DIMAP University of Warwick Testing Expansion in Bounded Degree Graphs Joint work with Christian Sohler.
Random Walks Ben Hescott CS591a1 November 18, 2002.
Mining and Searching Massive Graphs (Networks)
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
Graph Clustering. Why graph clustering is useful? Distance matrices are graphs  as useful as any other clustering Identification of communities in social.
Oded Goldreich Shafi Goldwasser Dana Ron February 13, 1998 Max-Cut Property Testing by Ori Rosen.
Chapter 23 Minimum Spanning Trees
Testing the Diameter of Graphs Michal Parnas Dana Ron.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Randomized Algorithms and Randomized Rounding Lecture 21: April 13 G n 2 leaves
Testing of Clustering Noga Alon, Seannie Dar Michal Parnas, Dana Ron.
Sublinear Algorithms for Approximating Graph Parameters Dana Ron Tel-Aviv University.
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.
Testing Metric Properties Michal Parnas and Dana Ron.
On Proximity Oblivious Testing Oded Goldreich - Weizmann Institute of Science Dana Ron – Tel Aviv University.
EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.
Analysis of Algorithms CS 477/677
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint work with Mira Gonen Dana Ron Tel-Aviv University.
1 Algorithmic Aspects in Property Testing of Dense Graphs Oded Goldreich – Weizmann Institute Dana Ron - Tel-Aviv University.
1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint works with Mira Gonen and Oded Goldreich Dana Ron Tel-Aviv University.
Christian Sohler 1 University of Dortmund Testing Expansion in Bounded Degree Graphs Christian Sohler University of Dortmund (joint work with Artur Czumaj,
Undirected ST-Connectivity In Log Space Omer Reingold Slides by Sharon Bruckner.
Finding Cycles and Trees in Sublinear Time Oded Goldreich Weizmann Institute of Science Joint work with Artur Czumaj, Dana Ron, C. Seshadhri, Asaf Shapira,
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Graph Sparsifiers Nick Harvey University of British Columbia Based on joint work with Isaac Fung, and independent work of Ramesh Hariharan & Debmalya Panigrahi.
Liang Ge.  Introduction  Important Concepts in MCL Algorithm  MCL Algorithm  The Features of MCL Algorithm  Summary.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Approximating the Minimum Degree Spanning Tree to within One from the Optimal Degree R 陳建霖 R 宋彥朋 B 楊鈞羽 R 郭慶徵 R
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
Expanders via Random Spanning Trees R 許榮財 R 黃佳婷 R 黃怡嘉.
Graph Sparsifiers Nick Harvey Joint work with Isaac Fung TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
Chapter 10 Graph Theory Eulerian Cycle and the property of graph theory 10.3 The important property of graph theory and its representation 10.4.
1/19 Minimizing weighted completion time with precedence constraints Nikhil Bansal (IBM) Subhash Khot (NYU)
Artur Czumaj DIMAP DIMAP (Centre for Discrete Maths and it Applications) Computer Science & Department of Computer Science University of Warwick Testing.
狄彥吾 (Yen-Wu Ti) 華夏技術學院資訊工程系 Property Testing on Combinatorial Objects.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
NPC.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. Fast.
Approximation Algorithms based on linear programming.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
On Sample Based Testers
Approximating the MST Weight in Sublinear Time
Finding Cycles and Trees in Sublinear Time
Minimum Spanning Tree 8/7/2018 4:26 AM
From dense to sparse and back again: On testing graph properties (and some properties of Oded)
Enumerating Distances Using Spanners of Bounded Degree
Randomized Algorithms CS648
Bart Jansen Polynomial Kernels for Hard Problems on Disk Graphs
Concepts of Computation
Locality In Distributed Graph Algorithms
Presentation transcript:

Complexity and Efficient Algorithms Group / Department of Computer Science Approximating Structural Properties of Graphs by Random Walks Christian Sohler

Complexity and Efficient Algorithms Group / Department of Computer Science 2 Very Large Networks Examples  Social networks  The human brain  Crystals  Chip design Size  10 9 – vertices  Petabytes of additional information possible

Complexity and Efficient Algorithms Group / Department of Computer Science 3 Very Large Networks Classical graph problems  Connectivity  MinCut, MaxCut  Graphclustering  Graphisomorphism Difficulties  Graph does not fit into main memory

Complexity and Efficient Algorithms Group / Department of Computer Science 4 Classification of Very Large Networks – A Vision Exampe questions  Is a country a democracy or a totalitarian country?  Is a patient schizophrenic?  Is software malicious? Formalization  Given a set of graphs with class labels (training set)  Find a classifier for new graphs

Complexity and Efficient Algorithms Group / Department of Computer Science 5 Classification of Very Large Networks – A Vision A typical szenario  Hundreds or thousands of graphs  Each graph is extremly large  Graphs are sparse A possible approach  Describe graphs by features (graph properties)  Apply classical learning algorithms The challenge  Computation of ten thousands of features for graphs with billions of vertices (12,3,-5,10,0,0,…,20,3)

Complexity and Efficient Algorithms Group / Department of Computer Science 6 Classification of Very Large Networks – A Sampling Approach Random Sampling  Compute a graph property approximately by random sampling Informal Question  What can we learn from the local structure of a sparse graph about its global properties? Sampling from Graphs  How can we sample a graph?

Complexity and Efficient Algorithms Group / Department of Computer Science 7 Classification of Very Large Networks – A Sampling Approach Examples of different sampling strategies 1.Sample set S of s vertices and look at all edges within S (the subgraph G[S] induced by S) 2.Sample set S of s edges and look at their graph 3.Sample a set S of s vertices and perform a BFS from each of them 4.Sample a set S of s vertices and perform a random walk from each of them  Many more possibilities… Question  Which is the right sampling strategy for my learning problem?

Complexity and Efficient Algorithms Group / Department of Computer Science 8 Classification of Very Large Networks – A Sampling Approach Examples of different sampling strategies 1.Sample set S of s vertices and look at all edges within S (the subgraph G[S] induced by S) 2.Sample set S of s edges and look at their graph 3.Sample a set S of s vertices and perform a BFS from each of them 4.Sample a set S of s vertices and perform a random walk from each of them  Many more possibilities… Question  Which is the right sampling strategy for my learning problem?  Depends on the problem…

Complexity and Efficient Algorithms Group / Department of Computer Science 9 Classification of Very Large Networks – A Sampling Approach Question 1  Assume you have some classification task that involves city maps. Which of our four sampling methods is your method of choice? Possible Answers 1.Sample set S of s vertices and look at all edges within S 2.Sample set S of s edges and look at their graph 3.Sample a set S of s vertices and perform a BFS from each of them 4.Sample a set S of s vertices and perform a random walk from each of them

Complexity and Efficient Algorithms Group / Department of Computer Science 10 Classification of Very Large Networks – A Sampling Approach Question 2  Assume you have some classification task that involves social networks. Which of our four sampling methods is your method of choice? Possible Answers 1.Sample set S of s vertices and look at all edges within S 2.Sample set S of s edges and look at their graph 3.Sample a set S of s vertices and perform a BFS from each of them 4.Sample a set S of s vertices and perform a random walk from each of them

Complexity and Efficient Algorithms Group / Department of Computer Science 11 First Wrap-Up Motivation  Some classification problems involve sets of huge graphs  No efficient algorithm for some fundamental graph problems known Sampling approach  We would like to pick small samples from the graph(s) and use them for graph classification Challenge  There are many different sampling procedures  We need to understand which is the right one for which problem

Complexity and Efficient Algorithms Group / Department of Computer Science 12 Sampling from Very Large Networks Property Testing [Rubinfeld, Sudan, 1996, Goldreich, Goldwasser, Ron, 1998]  Formal framework to study sampling algorithms for very large networks Relaxation of „Standard Decision Problems“  Want to distinguish whether input graph G has a property or is far away from it  If G neither has the property nor is far away from it the algorithm may give an arbitrary answer  Randomized algorithms with bounded (worst case) error probability  Only looks at small part of the graph Different graph models  Dense graphs, bounded degree graphs, directed graphs

Complexity and Efficient Algorithms Group / Department of Computer Science 13 Property Testing in Bounded Degree Graphs Bounded degree graphs [Goldreich, Ron, 2002]  Undirected Graph G=(V,E)  Maximum degree bounded by D  D constant Oracle access  V={1,…,n}  n is known to the algorithm  Query(i,j) returns j-th neighbor of vertex i or a symbol that indicates that this neighbor does not exist

Complexity and Efficient Algorithms Group / Department of Computer Science 14 Property Testing in Bounded Degree Graphs Graph properties  A graph property is a set of graphs that is closed under isomorphism Definition [Goldreich, Ron, 2002]  G=(V,E) is  -far from P, if one has to modify more than  Dn edges to obtain a bounded degree graph with property P. connected  -far

Complexity and Efficient Algorithms Group / Department of Computer Science 15 Property Testing in Bounded Degree Graphs Property Tester for property P [Goldreich, Ron, 2002]  Oracle access to input graph G  Accepts with probability at least 2/3, if G has property P  Rejects with probability at least 2/3, if G is  -far from P Quality measures  Query complexity: Maximum number of oracle queries  Running time

Complexity and Efficient Algorithms Group / Department of Computer Science 16 A First Example: Connectivity Connectivitytester(G, ,D) [Goldreich, Ron, 2002] (1) Sample set S with s=8/(  D) vertices uniformly at random from V (2) For every vertex from S: (3) Perform a BFS until (a) 4/(  D) vertices have been discovered or (b) all vertices of a small connected component have been discovered (4) if (b) then reject (5) accept

Complexity and Efficient Algorithms Group / Department of Computer Science 17 A First Example: Connectivity Connectivitytester(G, ,D) [Goldreich, Ron, 2002] (1) Sample set S with s=8/(  D) vertices uniformly at random from V (2) For every vertex from S: (3) Perform a BFS until (a) 4/(  D) vertices have been discovered or (b) all vertices of a small connected component have been discovered (4) if (b) then reject (5) accept Observation ConnectivityTester accepts every connected graph

Complexity and Efficient Algorithms Group / Department of Computer Science 18 A First Example: Connectivity Connectivitytester(G, ,D) [Goldreich, Ron, 2002] (1) Sample set S with s=8/(  D) vertices uniformly at random from V (2) For every vertex from S: (3) Perform a BFS until (a) 4/(  D) vertices have been discovered or (b) all vertices of a small connected component have been discovered (4) if (b) then reject (5) accept Claim If G is  -far from connected, then G has more than  Dn/2 connected components.

Complexity and Efficient Algorithms Group / Department of Computer Science 19 A First Example: Connectivity Connectivitytester(G, ,D) [Goldreich, Ron, 2002] (1) Sample set S with s=8/(  D) vertices uniformly at random from V (2) For every vertex from S: (3) Perform a BFS until (a) 4/(  D) vertices have been discovered or (b) all vertices of a small connected component have been discovered (4) if (b) then reject (5) accept Claim At least  Dn/4 of the connected components have size at most 4/(  D).

Complexity and Efficient Algorithms Group / Department of Computer Science 20 A First Example: Connectivity Connectivitytester(G, ,D) [Goldreich, Ron, 2002] (1) Sample set S with s=8/(  D) vertices uniformly at random from V (2) For every vertex from S: (3) Perform a BFS until (a) 4/(  D) vertices have been discovered or (b) all vertices of a small connected component have been discovered (4) if (b) then reject (5) accept Theorem Connectivitytester is a property tester with query complexity O(1/(  ²D)).

Complexity and Efficient Algorithms Group / Department of Computer Science 21 Second Wrap-Up – Introduction to Property Testing Property Testing  Approximately decide based on random sampling whether a graph has a property or is far away from it  Quality measure: Query complexity Connectivity  Sampling + BFS  Check whether the sample violates the property

Complexity and Efficient Algorithms Group / Department of Computer Science 22 Second Wrap-Up – Introduction to Property Testing Question 3  Is the following algorithm a property tester for planarity (for right choice of f)? Planaritytester(G, ,D) (1) Sample set S with s= f( ,D) vertices uniformly at random from V (2) For every vertex from S: (3) Perform a BFS until (a) f( ,D) vertices have been discovered or (b) the discovered graph is not planar (4) if (b) then reject (5) accept

Complexity and Efficient Algorithms Group / Department of Computer Science 23 Second Wrap-Up – Introduction to Property Testing Bad news There is a class of graphs such that every cycle has Length  (log n) and that are  -far from planar Good news The sampling is fine, we just need to modify our acceptance condition 23

Complexity and Efficient Algorithms Group / Department of Computer Science 24 Random Walks, Stationary Distributions & Convergence Random Walk  In each step: move from current vertex v to a neighbor chosen uniformly at random Convergence  If G is connected and not bipartite, a random walk converges to a unique stationary distribution  Pr[Random Walk is at vertex v]  deg(v)

Complexity and Efficient Algorithms Group / Department of Computer Science 25 Random Walks, Stationary Distributions & Convergence Random Walks on Maps  A random walk on a planar graph has the tendency to stay local  It takes a long time to reach the stationary distribution  Reason: The network has sparse cuts Random Walks on Social Networks  A random walk will quickly move to a „random place“  Fast convergence  The network does not have sparse cuts

Complexity and Efficient Algorithms Group / Department of Computer Science 26 Random Walks, Stationary Distributions & Convergence Lazy Random Walk  In each step: - Probability to move from current vertex v to neighbor u is 1/(2D) - stays at v with remaining probability Convergence of Lazy Random Walks  Stationary distribution is uniform Rate of Convergence  Can be expressed in terms of the conductance of G or the second largest eigenvalue of the transition matrix  O(log n) steps, if G is an expander graph

Complexity and Efficient Algorithms Group / Department of Computer Science 27 Conductance, Expanders & Small Worlds Definition  The expansion  (U) of a set U is defined as  The conductance  G of G is min U:1≤|U|≤|V|/2  (U) Definition  A graph G=(V,E) is called  -expander, if  G ≥  for some constant  Interpretations  Expander graphs satisfy the „small-world phenomenon“  Conductance can be viewed as a measure for the social connectivity of a network

Complexity and Efficient Algorithms Group / Department of Computer Science 28 Testing Expanders Facts  A lazy random walk converges to uniform distribution  A lazy random walk converges quickly in expander graphs Hope  A lazy random walk converges much slower, if the graph is  -far from an expander graph  In particular, we hope that the distribution of the endpoints of a  (log n)- step lazy random walk differs significantly from the uniform distribution Question  If so, how could we exploit this to design a property testing algorithm?

Complexity and Efficient Algorithms Group / Department of Computer Science 29 The Birthday Problem & Testing Uniform Distributions Birthday Problem  n possible birthdays  k persons with birthday chosen uniformly at random  How large must k be so that with constant probability two person have the same birthday? Analysis  p=(1/n,..,1/n) T  ||p||² is the collision probability of two birthdays  If we have k persons then the expected number of collision is  So, for k =  (  n) we expect to see a collision

Complexity and Efficient Algorithms Group / Department of Computer Science 30 Testing Uniform Distributions Observation  The uniform distribution minimizes the expected number of pairwise collisions  If a distribution q differs significantly from the uniform distribution then ||q||²>>||p||² TestUniformDistribution(distribution q) 1. Sample  (  n) elements according to q 2. if the number of pairwise collisions is too large then reject 3. else accept

Complexity and Efficient Algorithms Group / Department of Computer Science 31 Testing Expanders TestingExpanders(G) 1. Sample set S of s vertices uniformly at random 2. for each v  S do 3. Let q be the distribution of endpoints of a  (log n)-step lazy random walk 4. if TestUniformDistribution(q) rejects then reject 5. accept History Algorithm was invented by [Goldreich and Ron, 2000] and algorithm conjectured to be a property tester First complete analysis by [Czumaj and Sohler, 2010] (but weaker than conjectured) Later improved by [Nachmias and Shapira, 2010] and [Kale and Seshadhri, 2011]

Complexity and Efficient Algorithms Group / Department of Computer Science 32 Final Result Theorem [ Nachmias and Shapira, 2010, Kale and Seshadhri, 2011]  Algorithm TestingExpansion accepts every  -expander and rejects every graph that is  -far from a  ²)-expander. The algorithm has a running time of O(n 1/2+  ). Key structural property of „  -far“-graphs  If G is  -far from a  ²)-expander then there exists a set U of  (  n) vertices with  (U) = O(  ²).  Implies that for many vertices, the distribution of endpoints of a random walk of length O(log n) is significantly different from the uniform distribution

Complexity and Efficient Algorithms Group / Department of Computer Science 33 Third Wrap-Up – Testing Expansion (Lazy) Random Walks  Moves from a vertex to a random neighbor  Converges to uniform distribution  Speed of convergence depends on graph structure Testing Expansion  Random Walk converges quickly in expander graphs  Random Walk converges slower if we are far from expander graphs  Number of collisions among end points of random walks is minimized in expander graphs  We can test expansion by counting collisions

Complexity and Efficient Algorithms Group / Department of Computer Science 34 Graph Clustering & Web Communities Web Graph Communities  Set of vertices that induces an expander graph and has a sparse cut to the rest of the graph  Question: Is the web graph composed of a set of at most k communities? Definition  A subset C  V is called (  in,  out )-cluster, if   G (G[C]) ≥  in   (C) ≤  out Definition  A partition of V into at most k (  in,  out )-clusters is called (k,  in,  out )-clustering

Complexity and Efficient Algorithms Group / Department of Computer Science 35 Testing k-Clusterings A Simple Case?  Distinguish between a union of at most k expander graphs with no edges in between and a set of more than k (large) expander graphs with no edges in between  Can we use our previous algorithm to test for a k-clustering? Expander

Complexity and Efficient Algorithms Group / Department of Computer Science 36 Testing k-Clusterings A Simple Case?  No! We do not know the size of the clusters (expander graphs) and estimating the support size of a distribution is hard [Raskhodnikova et al., 2009] Expander

Complexity and Efficient Algorithms Group / Department of Computer Science 37 Testing k-Clusterings New idea  If two vertices come from the same cluster, the random walks quickly converge to the same distribution  So, we could try to sample a set of vertices and check for sets of vertices whose random walks induce the same distributions Expander

Complexity and Efficient Algorithms Group / Department of Computer Science 38 Main Idea [Batu et al. 2013; Chan et al. 2014]  if p  q then then the following experiments should give roughly the same number of collisions between elements from S and T:  Draw two sets S and T of m elements from p  Draw two sets S and T of m elements from q  Draw set S of m elements from p and set T of m elements from q  If p and q differ significantly, at least one of the three values is different Testing Closeness of Distributions

Complexity and Efficient Algorithms Group / Department of Computer Science 39 Theorem [Batu et al. 2013; Chan et al. 2014]  There is a tester that w.p. 2/3 accepts, if ||p-q||≤  /2 and rejects, if ||p-q||≥ . The query complexity of the algorithms is O(  b/  ²), where b is an upper bound on ||p||² and ||q||². Testing Closeness of Distributions

Complexity and Efficient Algorithms Group / Department of Computer Science 40 Theorem [Batu et al. 2013; Chan et al. 2014]  There is a tester that w.p. 2/3 accepts, if ||p-q||≤  /2 and rejects, if ||p-q||≥ . The query complexity of the algorithms is O(  b/  ²), where b is an upper bound on ||p||² and ||q||².  We will need b to be O(1/n) Testing Closeness of Distributions

Complexity and Efficient Algorithms Group / Department of Computer Science 41 The Algorithm ClusteringTest 1. Sample set S of s vertices uniformly at random 2. For any v  S let D(v) be the distribution of end points of a random walk of length  (log n) starting at v 3. for each pair u,v  S do 4. if D(u) and D(v) are close then add an edge (u,v) to the „cluster graph“ on vertex set S 5. accept, if and only if the cluster graph is a collection of at most k cliques

Complexity and Efficient Algorithms Group / Department of Computer Science 42 Testing k-Clusterings Observation  Algorithm ClusteringTest distinguishes between at most k expanders and more than k (large) expanders Expander

Complexity and Efficient Algorithms Group / Department of Computer Science 43 Testing k-Clusterings Observation  Algorithm ClusteringTest distinguishes between at most k expanders and more than k (large) expanders  Can we generalize it to testing of (k,  in,  out )-clusterings ? Expander

Complexity and Efficient Algorithms Group / Department of Computer Science 44 Testing k-Clusterings - Soundness Challenge  Since the clusters may be connected in a (k,  in,  out )-clustering the stationary distribution may be uniform over G (and not over the cluster)

Complexity and Efficient Algorithms Group / Department of Computer Science 45 Testing k-Clusterings - Soundness Challenge  Since the clusters may be connected in a (k,  in,  out )-clustering the stationary distribution may be uniform over G (and not over the cluster)  Need to show that for proper length of the random walk there is an „intermediate“ distribution that it is „reasonably stable“ w.r.t. l 2 -error

Complexity and Efficient Algorithms Group / Department of Computer Science 46 The Algorithm ClusteringTest 1. Sample set S of s vertices uniformly at random 2. For any v  S let D(v) be the distribution of end points of a random walk of length  (log n) starting at v 3. for each pair u,v  S do 4. if D(u) and D(v) are close then add an edge (u,v) to the „cluster graph“ on vertex set S 5. accept, if and only if the cluster graph is a collection of at most k cliques

Complexity and Efficient Algorithms Group / Department of Computer Science 47 The Algorithm ClusteringTest 1. Sample set S of s vertices uniformly at random 2. For any v  S let D(v) be the distribution of end points of a random walk of length  (log n) starting at v 3. if ||D(v)||² > O(1/n) then reject 4. for each pair u,v  S do 5. if D(u) and D(v) are close then add an edge (u,v) to the „cluster graph“ on vertex set S 6. accept, if and only if the cluster graph is a collection of at most k connected components

Complexity and Efficient Algorithms Group / Department of Computer Science 48 Testing k-Clusterings - Completeness Required Properties of a (k,  in,  out )-clustering  For most vertices v: The distribution D(v) of end points of a lazy random walk of proper length has ||D(v)||² = O(1/n)  For most pairs u,v from the same cluster: ||D(v)- D(u)||² is very small Useful Tool – Higher Order Cheeger‘s Inequality [Lee et al. 2014]  Relates (k,  in,  out )-clustering to the k+1 largest eigenvalues

Complexity and Efficient Algorithms Group / Department of Computer Science 49 Testing k-Clusterings - Soundness Structural property of „  -far“-graphs (similarly to expanders)  If G is  -far from a (k,  in *,  out * )-clusterings then there exists a partition into k+1 sets C 1,…,C k+1 each of  (  ²n/k) vertices and with  (C i ) = O(  in */  ²).

Complexity and Efficient Algorithms Group / Department of Computer Science 50 Testing k-Clusterings Theorem [Czumaj, Peng, Sohler, 2015]  Algorithm ClusteringTester accepts every (k,  in,  out )-clustering with probability at least 2/3 and rejects every graph that is  -far from every (k,  in *,  out *)-clustering with probability at least 2/3, where  out =O(  4  in ²) and  in * =  (  4  in ²/log n) for constants k,D.  The running time of the algorithm is O*(  n).

Complexity and Efficient Algorithms Group / Department of Computer Science 51 Fourth Wrap-Up Testing Clusterings  End points of Random Walk of proper length should be uniform on its cluster with not much probability „outside“  If Random Walks start from two different points of the same cluster, their end point distributions are similar  Collision statistics can be used to pairwise test similarity of distributions  This can be used to approximate the cut structure Take away message  The distribution of end points of random walks (possibly comparing different starting vertices) contains a lot of information about the cut structure of a graph

Complexity and Efficient Algorithms Group / Department of Computer Science 52 Summary Vision  Learning from very large sets of massive graphs Approach  Feature computation by random sampling  Analysis in the framework of property testing Two Examples  Expanders (connectivity measure in social networks)  Clustering (structure of social networks)

Complexity and Efficient Algorithms Group / Department of Computer Science 53 Thank you! Source Slide 2: Allan Ajifo und cobalt123; creative common license Slide 3: GustavoG und Jasper Nance; creative common license Slide 4: Wikipedia; Jason Brown; creative common license Slide 5: GustavoG; creative common license Slide 6: GoldenRibbon, creative common license