Respondent-driven Sampling for Characterizing Unstructured Overlays A. H. Rasti University of Oregon M. Torkjazi R. Rejaie N. Duffield AT&T Labs - Research.

Respondent-driven Sampling for Characterizing Unstructured Overlays A. H. Rasti University of Oregon M. Torkjazi R. Rejaie N. Duffield AT&T Labs - Research W. Willinger D. Stutzbach Stutzbach Enterprises Graciously Presented By: Shubho Sen AT&T Labs - Research

Motivation P2P systems are very popular in practice.P2P systems are very popular in practice. –Millions of simultaneous users. –A significant fraction of Internet traffic Measurement studies aid understanding existing systems and user behavior.Measurement studies aid understanding existing systems and user behavior. Capturing an accurate global snapshot is often infeasible.Capturing an accurate global snapshot is often infeasible. –P2P systems are distributed, large, and rapidly changing. –P2P crawlers are likely to capture incomplete or distorted snapshots Sampling is a natural approach, and has been used implicitly in most earlier P2P measurement studies.Sampling is a natural approach, and has been used implicitly in most earlier P2P measurement studies. How can we collect representative samples? How can we collect representative samples? 2

The Graph Sampling Problem We focus on sampling peer properties, such as number of neighbors (degree), access link bandwidth, session time, # filesWe focus on sampling peer properties, such as number of neighbors (degree), access link bandwidth, session time, # files Sampling peer properties has two steps:Sampling peer properties has two steps: –Discovering and selecting peers (or samples) –Measuring the desired properties of selected peers Selecting peers uniformly at random is hard – there are two sources of bias [Stutzbach:IMC06]Selecting peers uniformly at random is hard – there are two sources of bias [Stutzbach:IMC06] –Topological: high-degree peers are more likely to be selected –Temporal: short-lived peers are more likely to be selected Random walks are a promising approach to samplingRandom walks are a promising approach to sampling –The resulting bias is precisely known –Samples can be collected in parallel by multiple walkers 3

Sampling Using Random Walk Random walks can be described with a transition matrix P(x,y)Random walks can be described with a transition matrix P(x,y) P(x,y) : probability of moving from x to yP(x,y) : probability of moving from x to y P r (x,y) : probability of moving from x to y after r movesP r (x,y) : probability of moving from x to y after r moves Random walks converge to a stationary distributionRandom walks converge to a stationary distribution Problem: we need a uniform distributionProblem: we need a uniform distribution 4

Metropolized Random Walk (MRW) The Metropolis-Hastings method modifies the transition matrix to yield the desired uniform distribution [Stutzbach:IMC06]The Metropolis-Hastings method modifies the transition matrix to yield the desired uniform distribution [Stutzbach:IMC06] MRW method:MRW method: –Select a neighbor y of x uniformly at random –Transition to y with probability min( deg(x)/deg(y), 1) –Otherwise, self-loop to x. –Results in uniform stationary dist. (x)= 1/|V| MRW compensates for bias as samples are collected MRW compensates for bias as samples are collected 5 x y

This paper Presents a new graph sampling technique, Respondent-Driven Sampling (RDS)Presents a new graph sampling technique, Respondent-Driven Sampling (RDS) Compares the performance of RDS and MRW sampling techniques using simulations & experimentsCompares the performance of RDS and MRW sampling techniques using simulations & experiments 6

Respondent-driven Sampling A development of Snowball Sampling [Salganik04]A development of Snowball Sampling [Salganik04] Commonly used in social sciences to sample hidden populations, e.g. HIV+ individualsCommonly used in social sciences to sample hidden populations, e.g. HIV+ individuals Social relationships (references) are used by sampler to diffuse into hidden populationsSocial relationships (references) are used by sampler to diffuse into hidden populations –Each person introduces n other persons –Similar to random walk (n = 1) We adopt the RDS technique from social sciences for sampling P2P networksWe adopt the RDS technique from social sciences for sampling P2P networks 7

RDS Formulation Goal: Estimate the distribution of node property XGoal: Estimate the distribution of node property X Perform regular random walk, collect values of property X and node degree (deg(v)) at each visited nodePerform regular random walk, collect values of property X and node degree (deg(v)) at each visited node Deal with the bias during the post-processing as follows:Deal with the bias during the post-processing as follows: –Divide possible values for X into several ranges: {R 1,...,R m } –Partition nodes with the X value within the same range: {V 1,...,V m } Using Hansen-Hurwitz estimator to compensate for the bias, the proportion of all nodes in group i is estimated as follows:Using Hansen-Hurwitz estimator to compensate for the bias, the proportion of all nodes in group i is estimated as follows: 8 Ti: visited samples in group i Ti: visited samples in group i T: all visited samples T: all visited samples

Evaluation Overview Performance metricPerformance metric –Consider only peer properties that may interact with the walk: 1) Peer Degree, 2) Peer Uptime, 3) Peer RTT1) Peer Degree, 2) Peer Uptime, 3) Peer RTT –Compare the dist. of the these peer properties from samples and ground truth using Kolmogorov-Smirnov (KS) statistics Evaluation MethodologyEvaluation Methodology –Evaluation over static graphs Effect of graph structureEffect of graph structure –Evaluation over dynamic graphs (session level simulation) Benefits of parallel Sampling (see the paper)Benefits of parallel Sampling (see the paper) Effect of 1) churn, 2) peer discovery, 3) target peer degreeEffect of 1) churn, 2) peer discovery, 3) target peer degree –Experiments over Gnutella network 9

Evaluation: Static Graphs Using graphs with different degree distribution & clustering characteristics:Using graphs with different degree distribution & clustering characteristics: –Random graphs (ER): Erdos-Renyi –Small-world graphs (SW): Watts and Strogatz –Scale-free graphs (BA): Barabasi and Albert –Hierarchical Scale-Free graphs (HSF): Barabasi 02 Power-law degree distributionPower-law degree distribution Node clustering is inversely proportional to node degreeNode clustering is inversely proportional to node degree –Gnutella graphs (GA): Snapshots of Gnutella Ultrapeer topology 1010

Hierarchical Scale-Free (HSF) 1111

Static Graphs Accuracy of both techniques is improved with the number of samples in most casesAccuracy of both techniques is improved with the number of samples in most cases The rate of improvement in accuracy is much lower over HSF especially for MRWThe rate of improvement in accuracy is much lower over HSF especially for MRW Walkers are likely to get trapped within clusters in HSF graphsWalkers are likely to get trapped within clusters in HSF graphs Leaving a cluster requires visiting high degree nodes but MRW is less likely to visit these nodesLeaving a cluster requires visiting high degree nodes but MRW is less likely to visit these nodes Rewiring a small fraction of randomly selected edges in HSF significantly improves accuracy for both techniquesRewiring a small fraction of randomly selected edges in HSF significantly improves accuracy for both techniques RDS is less sensitive to graph clustering than MRW RDS is less sensitive to graph clustering than MRW 1212

Dynamic Graphs Churn is a primary limiting factor for accuracyChurn is a primary limiting factor for accuracy Session len.> 5m Very good sampling accuracySession len.> 5m Very good sampling accuracy Churn model has little effectChurn model has little effect Similar impact on other peer properties (see the paper)Similar impact on other peer properties (see the paper) Sampling error is small once nodes have sufficient connectivity (> 5)Sampling error is small once nodes have sufficient connectivity (> 5) Lower accuracy for smaller degree is due to graph partitioningLower accuracy for smaller degree is due to graph partitioning Partitioned nodes in History mech. reduce the accuracy of samplingPartitioned nodes in History mech. reduce the accuracy of sampling 1313

Experiment: Gnutella Run crawler, 1000 RDS & 1000 MRW walkers in parallelRun crawler, 1000 RDS & 1000 MRW walkers in parallel –500 steps per walker Use captured snapshots by crawler as a rough referenceUse captured snapshots by crawler as a rough reference –Show min, max, avg KS over 6 experiments –Focus only on degree dist The degree dist from samples & crawls are very similar (KS~0.03)The degree dist from samples & crawls are very similar (KS~0.03) The accuracy is an order of magnitude lower than dynamic sim due to inaccurate reference.The accuracy is an order of magnitude lower than dynamic sim due to inaccurate reference. Both sampling technique achieve similar accuracyBoth sampling technique achieve similar accuracy 1414

Conclusions & Future Work 1515 RDS always performs as good or better than MRWRDS always performs as good or better than MRW High level of graph clustering can significantly degrade the accuracy of both RDS and MRWHigh level of graph clustering can significantly degrade the accuracy of both RDS and MRW –RDS is less sensitive than MRW to graph clustering There is sweet spot for the number of parallel samplers.There is sweet spot for the number of parallel samplers. Poor connectivity & high dynamics adversely affect the accuracy of both techniques.Poor connectivity & high dynamics adversely affect the accuracy of both techniques. Future Work:Future Work: –RDS is a promising approach for sampling user properties in Online Social Networks –Sampling over directed graphs raises new challenges.

Thank You !

Different Grpah Structures

Dynamic Simulation Setting Simulation environmentSimulation environment –Session-time distributions : Weibull, Exponential, Pareto –Poisson arrival process –Peer discovery : Oracle, FIFO, HeartBeat, History –Target population : 100000 –Min. Degree : 3-30 –Sampling Parameters: Node degree (DEG)Node degree (DEG) Node query latency (RTT)Node query latency (RTT) Session length/uptime (UT)Session length/uptime (UT) 1818

Evaluation: Static Graphs : Conclusion Combination of highly skewed degree distribution and highly skewed clustering traps samplersCombination of highly skewed degree distribution and highly skewed clustering traps samplers RDS samplers get out of the clusters quicklyRDS samplers get out of the clusters quickly MRW samplers get stuck in low-degree clustersMRW samplers get stuck in low-degree clusters Shuffling provides short-cuts out of clustersShuffling provides short-cuts out of clusters HSF can be a model for some natural and social networksHSF can be a model for some natural and social networks 1919

Dynamic Graphs: Effect of Parallelism Too much parallelism does not improve performanceToo much parallelism does not improve performance Too long random walks have negative effectToo long random walks have negative effect Sweet spot existsSweet spot exists 2020

Respondent-driven Sampling for Characterizing Unstructured Overlays A. H. Rasti University of Oregon M. Torkjazi R. Rejaie N. Duffield AT&T Labs - Research.

Similar presentations

Presentation on theme: "Respondent-driven Sampling for Characterizing Unstructured Overlays A. H. Rasti University of Oregon M. Torkjazi R. Rejaie N. Duffield AT&T Labs - Research."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Respondent-driven Sampling for Characterizing Unstructured Overlays A. H. Rasti University of Oregon M. Torkjazi R. Rejaie N. Duffield AT&T Labs - Research.

Similar presentations

Presentation on theme: "Respondent-driven Sampling for Characterizing Unstructured Overlays A. H. Rasti University of Oregon M. Torkjazi R. Rejaie N. Duffield AT&T Labs - Research."— Presentation transcript:

Similar presentations

About project

Feedback