Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.

Slides:



Advertisements
Similar presentations
1 Generating Network Topologies That Obey Power LawsPalmer/Steffan Carnegie Mellon Generating Network Topologies That Obey Power Laws Christopher R. Palmer.
Advertisements

Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University
Benchmarking traversal operations over graph databases Marek Ciglan 1, Alex Averbuch 2 and Ladialav Hluchý 1 1 Institute of Informatics, Slovak Academy.
Power Laws By Cameron Megaw 3/11/2013. What is a Power Law?
Analysis and Modeling of Social Networks Foudalis Ilias.
The Connectivity and Fault-Tolerance of the Internet Topology
1 2.5K-Graphs: from Sampling to Generation Minas Gjoka, Maciej Kurant ‡, Athina Markopoulou UC Irvine, ETZH ‡
Lecture 21 Network evolution Slides are modified from Jurij Leskovec, Jon Kleinberg and Christos Faloutsos.
Kronecker Graphs: An Approach to Modeling Networks Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, Zoubin Ghahramani Presented.
Advanced Topics in Data Mining Special focus: Social Networks.
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
Absorbing Random walks Coverage
1 A Random-Surfer Web-Graph Model (Joint work with Avrim Blum & Hubert Chan) Mugizi Rwebangira.
Masters Thesis Defense Amit Karandikar Advisor: Dr. Anupam Joshi Committee: Dr. Finin, Dr. Yesha, Dr. Oates Date: 1 st May 2007 Time: 9:30 am Place: ITE.
Graphs (Part I) Shannon Quinn (with thanks to William Cohen of CMU and Jure Leskovec, Anand Rajaraman, and Jeff Ullman of Stanford University)
1 Walking on a Graph with a Magnifying Glass Stratified Sampling via Weighted Random Walks Maciej Kurant Minas Gjoka, Carter T. Butts, Athina Markopoulou.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Mining and Searching Massive Graphs (Networks)
CMU SCS KDD 2006Leskovec & Faloutsos1 ??. CMU SCS KDD 2006Leskovec & Faloutsos2 Sampling from Large Graphs poster# 305 Jurij (Jure) Leskovec Christos.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Maciej Kurant (EPFL / UCI) Joint work with: Athina Markopoulou (UCI),
CS Lecture 6 Generative Graph Models Part II.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
On Power-Law Relationships of the Internet Topology CSCI 780, Fall 2005.
Graphs and Topology Yao Zhao. Background of Graph A graph is a pair G =(V,E) –Undirected graph and directed graph –Weighted graph and unweighted graph.
Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,
On Distinguishing between Internet Power Law B Bu and Towsley Infocom 2002 Presented by.
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
Social Media Mining Graph Essentials.
Random Graph Models of Social Networks Paper Authors: M.E. Newman, D.J. Watts, S.H. Strogatz Presentation presented by Jessie Riposo.
The Erdös-Rényi models
Information Networks Power Laws and Network Models Lecture 3.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
1 Link-Trace Sampling for Social Networks: Advances and Applications Maciej Kurant (UC Irvine) Join work with: Minas Gjoka (UC Irvine), Athina Markopoulou.
Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.
Biological Networks Lectures 6-7 : February 02, 2010 Graph Algorithms Review Global Network Properties Local Network Properties 1.
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
DATA MINING LECTURE 13 Absorbing Random walks Coverage.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Graph Algorithms: Properties of Graphs? William Cohen.
Efficient Labeling Scheme for Scale-Free Networks The scheme in detailsPerformance of the scheme First we fix the number of hubs (to O(log(N))) and show.
Analyzing the Vulnerability of Superpeer Networks Against Attack Niloy Ganguly Department of Computer Science & Engineering Indian Institute of Technology,
Yongqin Gao, Greg Madey Computer Science & Engineering Department University of Notre Dame © Copyright 2002~2003 by Serendip Gao, all rights reserved.
On-line Social Networks - Anthony Bonato 1 Dynamic Models of On-Line Social Networks Anthony Bonato Ryerson University WAW’2009 February 13, 2009 nt.
Butterfly model slides. Topological Model: “Butterfly” Objective: Develop model to help explain behavioral mechanisms that cause observed properties,
Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
A Visual and Statistical Benchmark for Graph Sampling Methods Fangyan Zhang 1 Song Zhang 1 Pak Chung Wong 2 J. Edward Swan II 1 T.J. Jankun-Kelly 1 1 Mississippi.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Minas Gjoka, Emily Smith, Carter T. Butts
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
1 Coarse-Grained Topology Estimation via Graph Sampling Maciej Kurant 1 Minas Gjoka 2 Yan Wang 2 Zack W. Almquist 2 Carter T. Butts 2 Athina Markopoulou.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Dynamic Network Analysis Case study of PageRank-based Rewiring Narjès Bellamine-BenSaoud Galen Wilkerson 2 nd Second Annual French Complex Systems Summer.
Random Walk for Similarity Testing in Complex Networks
Empirical analysis of Chinese airport network as a complex weighted network Methodology Section Presented by Di Li.
Degree and Eigenvector Centrality
Lecture 13 Network evolution
Peer-to-Peer and Social Networks Fall 2017
Department of Computer Science University of York
Lecture 21 Network evolution
Presentation transcript:

Sampling from Large Graphs

Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of nodes and edges –In order to analyze it we have to store the whole graph in the computers memory –Sometimes this is impossible –Even when it is possible it is extremely time consuming only to compute some basic graph properties –Thus we need to extract a small sample of the graph and analyze it

Problem Given a huge real graph, how can we derive a representative sample? –Which sampling method to use? –How small can the sample size be? –How do we measure success?

Problem What do we compare against? –Scale down sampling: Given a graph G with n nodes, derive a sample graph G’ with n’ nodes (n’ << n) that will be most similar to G –Back in time sampling: Let G n’ denote graph G at some point in time when it had n’ nodes Find a sample S on n’ nodes that is most similar to G n’ (when graph G had the same size as S)

Evaluation Techniques Criteria for scale down sampling –In degree distribution –Out degree distribution –Distribution of sizes of weakly connected components –Distribution of sizes of strongly connected components –Hop plot, number of reachable pairs of nodes at distance h –Hop plot on the largest WCC –Distribution of the clustering coefficient –Distribution of singular values of the graph adjacency matrix versus the rank

Evaluation Techniques Criteria for back in time sampling –Densification Power Law: Number of edges vs number of nodes over time –The effective diameter of the graph over time Observed that shrinks and stabilizes over time –Normalized size of the largest WCC over time –Average clustering coefficient over time –Largest singular value of graph adjacency matrix over time

Statistical Tests Comparing graph patterns using Kolmogorov-Smirnov D- statistic –Measure the agreement between two distributions using D = max x {|F’(x) – F(x)|} –Where F and F’ are two cumulative distribution functions –Does not address the issue of scaling –Just compares the shape of the distributions Comparing graph patterns using the visiting probability –For each node u E G, calculate the probability of visiting node w E G –Use of Frobenius norm to calculate the difference in visiting probability.

Algorithms Sampling by random node selection –Random Node Sampling: Uniformly at random select a set of nodes –Random PageRank sampling Set the probability of a node being selected into the sample proportional to its PageRank weight –Random Degree Node Se the probability of a node being selected into the sample proportional to its degree

Algorithms Sampling by random edge selection –Random edge sampling Uniformly select edges at random –Random node – edge sampling Uniformly at random select a node, then uniformly at random select an edge incident to it –Hybrid sampling With probability p perform RNE sampling, with probability 1-p perform RE sampling

Algorithms Sampling by exploration –Random node neighbor Select a node uniformly at random together with all his out-going neighbors –Random walk sampling Uniformly at random select a random node and perform a random walk with restarts If we get stuck, randomly select another node to start –Random jump sampling Same as random walk sampling but with a probability p we jump to a new node –Forest fire sampling Choose a node u uniformly at random Generate a random number z and select z out links of u that are not yet visited Apply this step recursively for all z links selected

Evaluation Three groups of algorithms: –RDN, RJ, RW: biased towards high degree nodes and densely connected part of the graph –FF, RPN, RN: not biased towards high degree nodes, match the temporal densification of the true graph –RE, RNE, HYB: For small sample size the resulting graph is very sparsely connected Conclusion: –For the scale down goal methods based on random walks perform best –For the back in time goal forest fire algorithm performs best –No single perfect answer to graph sampling –Experiments showed that a 15% sample is usually enough

Further thoughts Wrong approach trying to match all properties? Maybe we should try matching one at a time Test methods for sampling on graphs with weighted – labeled edges Current algorithms are extremely slow when we read a graph from a file –Need to implement better versions of them in order to decrease the I/O cost

Bibliography Sampling from large graphs, J. Leskovec and C. Faloutsos Unbiased sampling of Facebook, M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou What is the real size of a sampled network? The case of the Internet, F. Viger, A. Barrat. L. Dall’Asta, C. Zhang and E. D. Kolaczyk Sampling large Internet topologies for simulation purposes, V. Krishnamurthy, M. Faloutsos, M. Chrobak, J. Cui, L. Lao and A. G. Percus