Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Slides:



Advertisements
Similar presentations
Characterizing Overlay Topologies & Dynamics in Peer-to-Peer Networks Daniel Stutzbach, Reza Rejaie University of Oregon Subhabrata Sen AT&T Labs IEEE.
Advertisements

Peer-to-Peer and Social Networks An overview of Gnutella.
Respondent-driven Sampling for Characterizing Unstructured Overlays A. H. Rasti University of Oregon M. Torkjazi R. Rejaie N. Duffield AT&T Labs - Research.
The Connectivity and Fault-Tolerance of the Internet Topology
Introduction to Algorithms Second Edition by Cormen, Leiserson, Rivest & Stein Chapter 22.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 excerpts Graphs (breadth-first-search)
UNIVERSITY OF JYVÄSKYLÄ Building NeuroSearch – Intelligent Evolutionary Search Algorithm For Peer-to-Peer Environment Master’s Thesis by Joni Töyrylä
Graph Traversals Visit vertices of a graph G to determine some property: Is G connected? Is there a path from vertex a to vertex b? Does G have a cycle?
Walter Willinger AT&T Research Labs Reza Rejaie, Mojtaba Torkjazi, Masoud Valafar University of Oregon Mauro Maggioni Duke University HotMetrics’09, Seattle.
Models of Network Formation Networked Life NETS 112 Fall 2013 Prof. Michael Kearns.
Amir Rasti Reza Rejaie Dept. of Computer Science University of Oregon.
1 An Overview of Gnutella. 2 History The Gnutella network is a fully distributed alternative to the centralized Napster. Initial popularity of the network.
Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker ICS 2002.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Search in Power-Law Networks Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides also borrowed from the following paper Path Finding Strategies.
LightFlood: An Optimal Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.
1 Walking on a Graph with a Magnifying Glass Stratified Sampling via Weighted Random Walks Maciej Kurant Minas Gjoka, Carter T. Butts, Athina Markopoulou.
Expediting Searching Processes via Long Paths in P2P Systems 05/30 IDEA Lab.
The structure of the Internet. How are routers connected? Why should we care? –While communication protocols will work correctly on ANY topology –….they.
Masoud Valafar †, Reza Rejaie †, Walter Willinger ‡ † University of Oregon ‡ AT&T Labs-Research WOSN’09 Barcelona, Spain Beyond Friendship Graphs: A Study.
P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.
Graph & BFS.
Improving Lookup Performance over a Widely-Deployed DHT Daniel Stutzbach Reza Rejaie The ION P2P Project University of.
 We developed a fast and tunable crawler, Cruiser.  Cruiser uses a master-slave architecture, parallel crawling, and leverages the two-tier topology.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
UNIVERSITY OF JYVÄSKYLÄ Topology Management in Unstructured P2P Networks Using Neural Networks Presentation for IEEE Congress on Evolutionary Computing.
Graph & BFS Lecture 22 COMP171 Fall Graph & BFS / Slide 2 Graphs * Extremely useful tool in modeling problems * Consist of: n Vertices n Edges D.
Characterizing the Two-Tier Gnutella Topology  Gnutella, FastTrack, and eDonkey use two-tier overlay topologies.  Our initial study focuses on Gnutella.
Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.
Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,
Vassilios V. Dimakopoulos and Evaggelia Pitoura Distributed Data Management Lab Dept. of Computer Science, Univ. of Ioannina, Greece
Understanding Churn in Peer-to-Peer Networks Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Internet Measurement Conference.
1 Characterizing Files in the Modern Gnutella Network: A Measurement Study Shanyu Zhao, Daniel Stutzbach, Reza Rejaie University of Oregon SPIE Multimedia.
Graphs and Topology Yao Zhao. Background of Graph A graph is a pair G =(V,E) –Undirected graph and directed graph –Weighted graph and unweighted graph.
6/28/2015Reza Rejaie INFOCOM 07 1 Nazanin Magharei, Reza Rejaie University of Oregon PRIME: P2P Receiver-drIven MEsh based.
Characterizing Unstructured Overlay Topologies in Modern P2P File-Sharing Systems Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon.
On Unbiased Sampling for Unstructured Peer-to-Peer Networks Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield –
Searching in Unstructured Networks Joining Theory with P-P2P.
Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,
Amir Rasti Daniel Stutzbach Reza Rejaie The ION P2P Project University of Oregon On the Long-term Evolution of the Two-Tier.
UNIVERSITY OF JYVÄSKYLÄ Resource Discovery in Unstructured P2P Networks Distributed Systems Research Seminar on Mikko Vapa, research student.
On Self Adaptive Routing in Dynamic Environments -- A probabilistic routing scheme Haiyong Xie, Lili Qiu, Yang Richard Yang and Yin Yale, MR and.
Brute Force Search Depth-first or Breadth-first search
Soon-Hyung Yook, Sungmin Lee, Yup Kim Kyung Hee University NSPCS 08 Unified centrality measure of complex networks.
CORRELATION & REGRESSION
Rate-based Data Propagation in Sensor Networks Gurdip Singh and Sandeep Pujar Computing and Information Sciences Sanjoy Das Electrical and Computer Engineering.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
Graph Theory in Computer Science
1 CS 425 Distributed Systems Fall 2011 Slides by Indranil Gupta Measurement Studies All Slides © IG Acknowledgments: Jay Patel.
CCAN: Cache-based CAN Using the Small World Model Shanghai Jiaotong University Internet Computing R&D Center.
6.1 Hamilton Circuits and Paths: Hamilton Circuits and Paths: Hamilton Path: Travels to each vertex once and only once… Hamilton Path: Travels to each.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Multimedia Computing & Networking Shanyu Zhao, Daniel Stutzbach, Reza Rejaie Multimedia & Internetworking Research Group (Mirage) Computer & Information.
PRIME: P2P Receiver-drIven MEsh based Streaming Nazanin Magharei, Reza Rejaie University of Oregon Presenter Jungsik Yoon.
Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.
By Jonathan Drake.  The Gnutella protocol is simply not scalable  This is due to the flooding approach it currently utilizes  As the nodes increase.
LightFlood: An Efficient Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 13: Graphs Data Abstraction & Problem Solving with C++
1 Page Quality: In Search of an Unbiased Web Ranking Presented by: Arjun Dasgupta Adapted from slides by Junghoo Cho and Robert E. Adams SIGMOD 2005.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Local Search. Systematic versus local search u Systematic search  Breadth-first, depth-first, IDDFS, A*, IDA*, etc  Keep one or more paths in memory.
1 Patterns of Cascading Behavior in Large Blog Graphs Jure Leskoves, Mary McGlohon, Christos Faloutsos, Natalie Glance, Matthew Hurst SDM 2007 Date:2008/8/21.
Distributed Caching and Adaptive Search in Multilayer P2P Networks Chen Wang, Li Xiao, Yunhao Liu, Pei Zheng The 24th International Conference on Distributed.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Peer-to-Peer and Social Networks
Models of Network Formation
Models of Network Formation
Models of Network Formation
Models of Network Formation
Presentation transcript:

Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research Subhabrata Sen – AT&T Labs—Research Walter Willinger – AT&T Labs—Research Global Internet Symposium Barcelona, Spain April 28 th, 2006

Motivation P2P systems are very popular in practice. Several million simultaneous users collectively. 60% of all Internet traffic [CacheLogic Research 2005] Measurement studies aid understanding existing systems and user behavior. Capturing global state is often infeasible. P2P systems are large and rapidly changing. Sampling is therefore a natural approach, and has been used in several earlier measurement studies. But how do we know the samples are representative?

The Problem We focus on sampling peer properties. Peer degree Link Bandwidth Number of shared files Remaining uptime Sampling peer properties occurs in two steps: Discover and select peers Collect the measurements Selecting peers uniformly at random is hard. Peer dynamics can introduce bias. The graph topology can introduce bias. We examine these two problems separately.

Temporal Causes of Bias Define V t as the set of peers present at time t. We gather samples over a measurement window of length Δ. The most common approach is to gather peers from the set present during the window:

Example of Bias towards Short-Lived Peers Time Short-lived peers Long-lived peer Consider a simple two-peer system, containing: One long-lived peer One rapidly-changing short-lived peer The common approach over-selects short-lived peers. XXX I plan to update this slide with animation to show how a particular measurement window selects too many short-lived peers

Handling Temporal Causes of Bias The common approach is intuitive but incorrect. Sampling peers is the wrong goal. We want to sample peer properties. Therefore, v i,t and v i,t’ are distinct, even though they come from the same peer. Allow sampling the same peer more than once, at different points in time.

Example of avoiding bias towards Short-Lived Peers Time Short-lived peers Long-lived peer Allowing re-selecting a peer solves the problem. The long-lived peer will be selected half the time, reflecting the actual state of the system. Now the problem remains, how do we select a peer uniformly a random at a particular moment? XXX I plan to update this slide with animation

Topological Causes of Bias Goal: Select a peer uniformly at time t Begin with one peer. Query peers to discover neighbors. Prior work uses classic graph- discovery algorithms: Breadth-First Search (BFS) Depth-First Search (DFS) Problems with these techniques: Peers are correlated by their neighbor relationship Peers with higher degree are more likely to be discovered. A peer can only be selected once. Random walks are a promising alternative. XXX Some kind of animation here showing the discovery process (using breadth-first search)

Random Walks Basic idea of the random walk: Select a neighbor randomly to explore Explore that neighbor and “forget” the previous peer Only two pieces of state are maintained: The current peer The length of the walk A subset of visited peers are selected for sampling The basic random walk selects a peer every r steps. Graph theory suggests r ≥ log(|V|). Walking r steps between samples eliminates correlations. Peers are selected with probability proportional to degree. Peers can be selected more than once.

Variations on the Random Walk Fixing the degree bias (“Degree Correction”) Select a candidate peer with probability Pro: Should result in uniform selection of peers Con: Decreases efficiency Improving efficiency (“Random Stroll”) After the first r steps, select every peer instead of every r peers Pro: Increases efficiency Con: Introduces slight correlations

Evaluations We simulated different techniques over two types of graphs: A snapshot of the Gnutella ultrapeer topology [Stutzbach 05 IMC] Random graphs (with the same number of vertices and edges as the Gnutella topology) Metrics: Bias: Is peer A more likely to be selected than peer B? Correlation: If we select peer A, are we more likely to select peer B? Efficiency: How easily can we collect a sample? Techniques: Oracle (uniformly random) Breadth-First Search (BFS) Random Walk (RW) Random Walk with Degree Correction (RWDC) Random Stroll (RS) Random Stroll with Degree Correction (RSDC)

Bias Collect k|V| samples and compare with Oracle. Most peers should be selected around k times. RSDC appears unbiased in both cases. RWDC performs well, but exhibits slight bias on Gnutella. BFS, RS, and RW are heavily biased. Figures 1(a) and 1(b) go here

Correlation Even if unbiased, a technique may exhibit correlations. We define a sampling session as 1,000 consecutive samples. For pair (A, B), if A is selected, how often is B also selected? A long tail indicates correlation. RWDC and RSDC appear uncorrelated. RW and RS exhibit slight correlations. BFS exhibits strong correlation. Figures 2(a) and 2(b) go here

Efficiency The basic operation is the neighbors-query. Efficiency is: BFS and RS are close to 100% efficient. Unfortunately, they are also heavily biased. RW, RWDC, and RSDC are 2% to 8% efficient. RSDC is twice as efficient as RWDC (4% vs. 2%). However, even the inefficient techniques are O(log |V|).

Summary of Results and Lessons Learned Addressing temporal causes of bias Avoid gathering a set of peers and collecting measurements in separate passes. Select a peer, then collect the measurement. Repeat and allow re-selecting the same peer. Addressing topological causes of bias Be careful to avoid bias towards high-degree. Consider using a random walk or random stroll with degree correction.

Ongoing Work This work is preliminary. Additional types of random walks: Weighting the selection of the next hop Additional types of graphs: Power-law Small world We have examined temporal and topological causes of bias separately. To examine them concurrently, we are creating a dynamic overlay simulator. XXX This slide feels too much like a laundry list