Neighbourhood Sampling for Local Properties on a Graph Stream A. Pavan, Iowa State University Kanat Tangwongsan, IBM Research Srikanta Tirthapura, Iowa.

Slides:



Advertisements
Similar presentations
Lower Bounds for Additive Spanners, Emulators, and More David P. Woodruff MIT and Tsinghua University To appear in FOCS, 2006.
Advertisements

Charalampos (Babis) E. Tsourakakis KDD 2013 KDD'131.
Dynamic Graph Algorithms - I
Theory of Computing Lecture 18 MAS 714 Hartmut Klauck.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
© 2005 IBM Corporation Discovering Large Dense Subgraphs in Massive Graphs David Gibson IBM Almaden Research Center Ravi Kumar Yahoo! Research* Andrew.
Analysis and Modeling of Social Networks Foudalis Ilias.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Minimum Spanning Trees Definition Two properties of MST’s Prim and Kruskal’s Algorithm –Proofs of correctness Boruvka’s algorithm Verifying an MST Randomized.
Train DEPOT PROBLEM USING PERMUTATION GRAPHS
Optimization of Pearl’s Method of Conditioning and Greedy-Like Approximation Algorithm for the Vertex Feedback Set Problem Authors: Ann Becker and Dan.
Christian Sohler | Every Property of Hyperfinite Graphs is Testable Ilan Newman and Christian Sohler.
Directional triadic closure and edge deletion mechanism induce asymmetry in directed edge properties.
Fast FAST By Noga Alon, Daniel Lokshtanov And Saket Saurabh Presentation by Gil Einziger.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
Computer Science 1 Web as a graph Anna Karpovsky.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Biological Networks Lectures 6-7 : February 02, 2010 Graph Algorithms Review Global Network Properties Local Network Properties 1.
A Randomized Algorithm for Minimum Cuts Andreas Klappenecker.
Fixed Parameter Complexity Algorithms and Networks.
Graph Sparsifiers Nick Harvey University of British Columbia Based on joint work with Isaac Fung, and independent work of Ramesh Hariharan & Debmalya Panigrahi.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
Near Optimal Streaming algorithms for Graph Spanners Surender Baswana IIT Kanpur.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
Analysis of biological networks Part III Shalev Itzkovitz Shalev Itzkovitz Uri Alon’s group Uri Alon’s group July 2005 July 2005.
1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science
On Non-Disjoint Dominating Sets for the Lifetime of Wireless Sensor Networks Akshaye Dhawan.
Randomized Composable Core-sets for Submodular Maximization Morteza Zadimoghaddam and Vahab Mirrokni Google Research New York.
PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
On-line Social Networks - Anthony Bonato 1 Dynamic Models of On-Line Social Networks Anthony Bonato Ryerson University WAW’2009 February 13, 2009 nt.
Amplification and Derandomization Without Slowdown Dana Moshkovitz MIT Joint work with Ofer Grossman (MIT)
CSCI 115 Chapter 8 Topics in Graph Theory. CSCI 115 §8.1 Graphs.
Seminar on random walks on graphs Lecture No. 2 Mille Gandelsman,
Artur Czumaj DIMAP DIMAP (Centre for Discrete Maths and it Applications) Computer Science & Department of Computer Science University of Warwick Testing.
Calculating frequency moments of Data Stream
Estimating PageRank on Graph Streams Atish Das Sarma (Georgia Tech) Sreenivas Gollapudi, Rina Panigrahy (Microsoft Research)
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Exponential random graphs and dynamic graph algorithms David Eppstein Comp. Sci. Dept., UC Irvine.
Community detection via random walk Draft slides.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Cohesive Subgraph Computation over Large Graphs
DOULION: Counting Triangles in Massive Graphs with a Coin
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Sequential Algorithms for Generating Random Graphs
Approximating the MST Weight in Sublinear Time
Minimum Spanning Tree 8/7/2018 4:26 AM
R.G.L.M Samarawickrama , D. N. Ranasinghe , T. Sritharan
Lecture 18: Uniformity Testing Monotonicity Testing
Kijung Shin1 Mohammad Hammoud1
Exact Inference Continued
Short paths and spanning trees
Range-Efficient Counting of Distinct Elements
CIS 700: “algorithms for Big Data”
CSCI B609: “Foundations of Data Science”
Clustering Coefficients
Range-Efficient Computation of F0 over Massive Data Streams
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Lecture 6: Counting triangles Dynamic graphs & sampling
Dynamic Graph Algorithms
Alan Kuhnle*, Victoria G. Crawford, and My T. Thai
Presentation transcript:

Neighbourhood Sampling for Local Properties on a Graph Stream A. Pavan, Iowa State University Kanat Tangwongsan, IBM Research Srikanta Tirthapura, Iowa State University Kun-Lung Wu, IBM Research 1MSR: Big Data and Analytics WorkshopIowa State University

Graph Streams Example: Network Monitoring IP addresses are vertices of a graph Edges represent connections between vertices Edges of the Graph Arrive in Sequence Continuously Maintain a Property of the Evolving Graph Local Property: Count subgraphs within 1-neighbourhood of a vertex 2MSR: Big Data and Analytics WorkshopIowa State University

Big Data, Small Machines Algorithm can be deployed on a single machine, reasonable resources Single Pass Through Data Online arrivals Also suitable for disk-resident data Effective use of a multicore machine Ex: process a 167GB graph in 1000 seconds, on 12 core machine MSR: Big Data and Analytics WorkshopIowa State University3

Problem: Triangle Counting Problem: Count the number of triangles in a simple undirected graph 4MSR: Big Data and Analytics WorkshopIowa State University

Why Triangle Counting (1) Number of triangles is a basic structural property Social Network Analysis: Transitivity Coefficient = 3 * # Triangles / # connected triples Related Clustering Coefficient Measure how dense the graph is MSR: Big Data and Analytics WorkshopIowa State University5

Why Triangle Counting (2) Web Spam Detection (Becchetti et al. 2008) A higher-than usual number of triangles is an indicator of web spam Biological Networks (Przulj et al. 2006, Kashtan et al. 2002) Generalizations of Triangle Count used in Graphlets and Network Motifs “Structural Summary” of a Graph = vector, containing the number of occurrences of various subgraphs 6MSR: Big Data and Analytics WorkshopIowa State University

Contributions Neighborhood Sampling: Simple random sampling method for graph streams Applications: Counting and Sampling Triangles in a Graph Counting Higher order cliques K 4, K 5, etc Directed Cycles in directed graphs Experiments showing this is a practical method MSR: Big Data and Analytics WorkshopIowa State University7

Prior Work Streaming Triangle Counting Bar-Yossef, Kumar, Sivakumar (2003): Reductions to frequency moments of appropriately defined streams Jowhari and Ghodsi (2005): Sampling-based and Sketch-based estimators Buriol et al. (2006): Another Sampling-based Estimator Ahn, Guha, McGregor (2012): Sketch-based, insertions and deletions Kane et al. (2012), Manjunath et al. (2011): sketch-based, more general subgraphs Seshadri, Pinar, Kolda (2012) Batch (non-streaming) Triangle Counting Pagh and Tsourakakis (2012) Suri and Vassilvitskii (2011) … 8MSR: Big Data and Analytics WorkshopIowa State University

Graph Model Simple Undirected Graph (extends to directed graphs easily) n vertices, m edges Problem: Estimate τ(G) = number of triangles in G Adjacency Stream Model: Edges arrive in an arbitrary order Incidence Stream Model: all edges incident to a vertex arrive together 9MSR: Big Data and Analytics WorkshopIowa State University

Sampling and Counting Suppose a procedure A that on graph G: If “succeeded”, then return a triangle from G, chosen uniformly at random Else, return “failure” Procedure A can be used in triangle counting Probability of A succeeding proportional to # triangles Repeat Procedure A many times, use fraction of successes Accuracy of Estimate depends on the probability that A fails 10MSR: Big Data and Analytics WorkshopIowa State University

Example Triangle Sampling Procedures 11MSR: Big Data and Analytics WorkshopIowa State University

Neighborhood Sampling Idea Choose a random edge r 1 in the graph Choose a random edge r 2, that appears after r 1, and is adjacent to r 1 See if triangle defined by r 1, r 2 is completed by a third edge MSR: Big Data and Analytics WorkshopIowa State University12 Two edges are adjacent if they share a vertex Above procedure can be done in a constant number of words in a streaming manner.

Sampling Bias 13 e2e2 e1e1 e3e3 e4e4 e9e9 e5e5 e6e6 e7e7 e8e8 e 11 e 10 MSR: Big Data and Analytics WorkshopIowa State University

Sampling Bias 14 e2e2 e1e1 e3e3 e4e4 e9e9 e5e5 e6e6 e7e7 e8e8 e 11 e 10 MSR: Big Data and Analytics WorkshopIowa State University

Sampling Bias 15 e2e2 e1e1 e3e3 e4e4 e9e9 e5e5 e6e6 e7e7 e8e8 e 11 e 10 MSR: Big Data and Analytics WorkshopIowa State University

Sampling Bias 16 e2e2 e1e1 e3e3 e4e4 e9e9 e5e5 e6e6 e7e7 e8e8 e 11 e 10 For edge e, define c(e) = Number of edges adjacent to e, and that follow e MSR: Big Data and Analytics WorkshopIowa State University

Sampling Bias 17 e2e2 e1e1 e3e3 e4e4 e9e9 e5e5 e6e6 e7e7 e8e8 e 11 e 10 For edge e, define c(e) = Number of edges adjacent to e, and that follow e MSR: Big Data and Analytics WorkshopIowa State University c(e 1 ) = 2 c(e 4 ) = 7

Sampling Bias 18 e2e2 e1e1 e3e3 e4e4 e9e9 e5e5 e6e6 e7e7 e8e8 e 11 e 10 MSR: Big Data and Analytics WorkshopIowa State University Pr[Triangle T, where e is the first edge]

Handling Sampling Bias For sampling a triangle uniformly at random Use neighbourhood sampling Compute (online) the bias in sampling a triangle Reject the sample, probability proportional to bias For counting triangles Use neighbourhood sampling as described Compute (online) the bias in sampling a triangle Incorporate bias directly into estimator 19MSR: Big Data and Analytics WorkshopIowa State University

Counting Triangles in a Graph 20MSR: Big Data and Analytics WorkshopIowa State University

Estimator Properties 21MSR: Big Data and Analytics WorkshopIowa State University Let X be the return value of the algorithm E[X] = # triangles in G Take mean of O((# edges) * (max degree) / (# triangles)) estimators to get a good approximation

Time Complexity Running r estimators in parallel means O(r) time per update? Bulk Processing, process w edges at a time: For each estimator, first level random sample updated in O(1) time Second level update is more complex, two passes through the batch Using a batch size w = O(r), entire batch of w edges can be processed in O(w) time, yielding an amortized processing time of O(1) per edge 22MSR: Big Data and Analytics WorkshopIowa State University

Counting and Sampling 4-Cliques 23 But this misses out cliques whose first two edges are not adjacent to each other – another case to handle such cliques. MSR: Big Data and Analytics WorkshopIowa State University 1.Choose a random edge r 1 in the graph 2.Choose a random edge r 2, that appears after r 1, and is adjacent to r 1 3.Choose a random adjacent edge r 3, which appears after {r 1,r 2 } and has one endpoint in common with {r 1,r 2 } 1.Any edge with both endpoints in {r 1,r 2 } is surely retained 4.Wait for 4-clique defined by {r 1,r 2, r 3 } to be completed

Extensions Transitivity Coefficient of a Graph = 3 * # triangles / # connected triples Sliding Windows Directed 3-cycles in a directed graph Counting patterns that have temporal constraints: “how many instances where A  B, followed by B  C, followed by C  A?” 24MSR: Big Data and Analytics WorkshopIowa State University

(Preliminary) Experimental Results Orkut Graph 3 million vertices 117 million edges max degree = 67,000 Number of triangles = 633 million MSR: Big Data and Analytics WorkshopIowa State University25 # Estimators1 K128 K1 M Relative Error4.6 %2.13 %1.48 % Time Taken52 sec75 sec103 sec (33 IO)

Runtime versus number of estimators MSR: Big Data and Analytics WorkshopIowa State University26 Livejournal graph 4 M vertices 35 M edges 30 K max degree 178 M triangles Youtube graph 1 M vertices 3 M edges 57 K max degree 3 M triangles

Relative Error versus Number of Estimators MSR: Big Data and Analytics WorkshopIowa State University27 Livejournal graph 4 M vertices 35 M edges 30 K max degree 178 M triangles Youtube graph 1 M vertices 3 M edges 57 K max degree 3 M triangles

Conclusions General Sampling Method for Estimating Cardinality of Graph Patterns Small sized cliques Extendible for special cases – ex: temporal constraints, edge directions “Sticky sampling” for graph streams Technique: Sample within neighbourhood of current edges Compute the bias online Incorporate the bias into the estimator Fast Implementations Multicore Machine: Synthetic Graph of size 167GB in 1000 sec on a 12 core machine 28MSR: Big Data and Analytics WorkshopIowa State University

Thank you Reference: Counting and Sampling Triangles from a Graph Stream Research Report RC25339, IBM 6B795E AEE0058FCD3 6B795E AEE0058FCD3 MSR: Big Data and Analytics WorkshopIowa State University29