Computer Science Sampling Biases in IP Topology Measurements John Byers with Anukool Lakhina, Mark Crovella and Peng Xie Department of Computer Science.

Slides:



Advertisements
Similar presentations
Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
Advertisements

1 Chi-Square Test -- X 2 Test of Goodness of Fit.
Chapter 6 Sampling and Sampling Distributions
Power Laws By Cameron Megaw 3/11/2013. What is a Power Law?
Analysis and Modeling of Social Networks Foudalis Ilias.
The Connectivity and Fault-Tolerance of the Internet Topology
Models of Network Formation Networked Life NETS 112 Fall 2013 Prof. Michael Kearns.
Generated Waypoint Efficiency: The efficiency considered here is defined as follows: As can be seen from the graph, for the obstruction radius values (200,
Network Topology Julian Shun. On Power-Law Relationships of the Internet Topology (Faloutsos 1999) Observes that Internet graphs can be described by “power.
Peer-to-Peer and Grid Computing Exercise Session 3 (TUD Student Use Only) ‏
Computer Science Demystifying the router-level topology John Byers Department of Computer Science and Topology Modeling Group, Boston University CS: Mark.
Chapter 7 Sampling and Sampling Distributions
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
On Power-Law Relationships of the Internet Topology CSCI 780, Fall 2005.
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
SDSC, skitter (July 1998) A random graph model for massive graphs William Aiello Fan Chung Graham Lincoln Lu.
Graphs and Topology Yao Zhao. Background of Graph A graph is a pair G =(V,E) –Undirected graph and directed graph –Weighted graph and unweighted graph.
Part III: Inference Topic 6 Sampling and Sampling Distributions
Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc. Chapter Hypothesis Tests Regarding a Parameter 10.
On Distinguishing between Internet Power Law B Bu and Towsley Infocom 2002 Presented by.
Doubling Dimension in Real-World Graphs Melitta Lorraine Geistdoerfer Andersen.
1 Network Topology Measurement Yang Chen CS 8803.
Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242.
On Power-Law Relationships of the Internet Topology.
Copyright © Cengage Learning. All rights reserved. 11 Applications of Chi-Square.
Large-scale organization of metabolic networks Jeong et al. CS 466 Saurabh Sinha.
(Social) Networks Analysis III Prof. Dr. Daning Hu Department of Informatics University of Zurich Oct 16th, 2012.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Traceroute-like exploration of unknown networks: a statistical analysis A. Barrat, LPT, Université Paris-Sud, France I. Alvarez-Hamelin (LPT, France) L.
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
Section 8 – Ec1818 Jeremy Barofsky March 31 st and April 1 st, 2010.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2013 Figures are taken.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 12 Inference About A Population.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 18 Inference for Counts.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Efficient Labeling Scheme for Scale-Free Networks The scheme in detailsPerformance of the scheme First we fix the number of hubs (to O(log(N))) and show.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Sampling  When we want to study populations.  We don’t need to count the whole population.  We take a sample that will REPRESENT the whole population.
Analyzing the Vulnerability of Superpeer Networks Against Attack Niloy Ganguly Department of Computer Science & Engineering Indian Institute of Technology,
Robust Estimators.
Chapter 13 Inference for Counts: Chi-Square Tests © 2011 Pearson Education, Inc. 1 Business Statistics: A First Course.
Sampling and estimation Petter Mostad
Point Pattern Analysis
Artur Czumaj DIMAP DIMAP (Centre for Discrete Maths and it Applications) Computer Science & Department of Computer Science University of Warwick Testing.
Brief Announcement : Measuring Robustness of Superpeer Topologies Niloy Ganguly Department of Computer Science & Engineering Indian Institute of Technology,
Scaling Properties of the Internet Graph Aditya Akella, CMU With Shuchi Chawla, Arvind Kannan and Srinivasan Seshan PODC 2003.
A Framework for Reliable Routing in Mobile Ad Hoc Networks Zhenqiang Ye Srikanth V. Krishnamurthy Satish K. Tripathi.
Lecture 11. The chi-square test for goodness of fit.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
1 Patterns of Cascading Behavior in Large Blog Graphs Jure Leskoves, Mary McGlohon, Christos Faloutsos, Natalie Glance, Matthew Hurst SDM 2007 Date:2008/8/21.
Scaling Properties of the Internet Graph Aditya Akella With Shuchi Chawla, Arvind Kannan and Srinivasan Seshan PODC 2003.
Chapter 6 Sampling and Sampling Distributions
Privacy Vulnerability of Published Anonymous Mobility Traces Chris Y. T. Ma, David K. Y. Yau, Nung Kwan Yip (Purdue University) Nageswara S. V. Rao (Oak.
Outline Sampling Measurement Descriptive Statistics:
PCB 3043L - General Ecology Data Analysis.
Empirical analysis of Chinese airport network as a complex weighted network Methodology Section Presented by Di Li.
How Do “Real” Networks Look?
Chapter 12: Inference about a Population Lecture 6b
How Do “Real” Networks Look?
How Do “Real” Networks Look?
How Do “Real” Networks Look?
Copyright © Cengage Learning. All rights reserved.
Chapter Outline Inferences About the Difference Between Two Population Means: s 1 and s 2 Known.
Presentation transcript:

Computer Science Sampling Biases in IP Topology Measurements John Byers with Anukool Lakhina, Mark Crovella and Peng Xie Department of Computer Science Boston University

Discovering the Internet topology Goal: Discover the Internet Router Graph Vertices represent routers, Edges connect routers that are one IP hop apart Measurement Primitive: traceroute Reports the IP path from A to B i.e., how IP paths are overlaid on the router graph source destination

k sources: Few active sources, strategically located. m destinations: Many passive destinations, globally dispersed. Union of many traceroute paths. (k,m)-traceroute study Traceroute studies today Destinations Sources

Degree Frequency Dataset from [PG98] Heavy tails in Topology Measurements A surprising finding: [FFF99] Let be a given node degree. Let be frequency of degree vertices in a graph Power-law relationship: Subsequent measurements show that the degree distribution is a heavy tail, [GT00, BC01, …] log(Pr[X>x]) log( )

We’re skeptical  We will argue that the evidence for power laws is at best insufficient.  Insufficient does not mean noisy or incomplete. (which these datasets certainly are!)  For us, insufficient means that measurements are statistically biased. We will show that (k,m)-traceroute studies exhibit significant sampling bias.

A thought experiment Idea: Simulate topology measurements on a random graph. 1.Generate a sparse Erdös-Rényi random graph, G=(V,E). Each edge present independently with probability p Assign weights: w(e) = 1 + , where  in 2.Pick k unique source nodes, uniformly at random 3.Pick m unique destination nodes, uniformly at random 4.Simulate traceroute from k sources to m destinations, i.e. learn shortest paths between k sources and m destinations. 5.Let Ĝ be union of shortest paths. Ask: How does Ĝ compare with G ?

Ĝ is a biased sample of G that looks heavy-tailed Are heavy tails a measurement artifact? Measured Graph, Ĝ Underlying Random Graph, G Underlying Graph: N=100000, p= Measured Graph: k=3, m=1000 log(Degree) log(Pr[X>x])

Outline  Motivation and Thought Experiments  Understanding Bias on Simulated Topologies Where and Why  Detecting and Defining Bias Statistical hypotheses to infer presence of bias  Examining Internet Maps

Understanding Bias (k,m)-traceroute sampling of graphs is biased An intuitive explanation: When traces are run from few sources to large destinations, some portions of underlying graph are explored more than others. We now investigate the causes behind bias.

Are nodes sampled unevenly? Conjecture: Shortest path routing favors higher degree nodes  nodes sampled unevenly Validation: Examine true degrees of nodes in measured graph, Ĝ. Expect true degrees of nodes in Ĝ to be higher than degrees of nodes in G, on average. True Degrees of nodes in Ĝ Degrees of all nodes in G Measured Graph: k=5,m=1000 Conclusion: Difference between true degrees of Ĝ and degrees of G is insignificant; dismiss conjecture.

Are edges sampled unevenly? Conjecture: Edges selected incident to a node in Ĝ not proportional to true degree. Validation: For each node in Ĝ, plot true degree vs. measured degree. If unbiased, ratio of true to measured degree should be constant. Points clustered around y=cx line (c<1). Conclusion: Edges incident to a node are sampled disproportionately; supports conjecture. Observed Degree True Degree

Why: Analyzing Bias Question: Given some vertex in Ĝ that is h hops from the source, what fraction of its true edges are contained in Ĝ? Messages: As h increases, number of edges discovered falls off sharply.* * We can prove exponential fall-off analytically, in a simplified model. Distance from source Fraction of node edges discovered 1000dst 100dst 600dst Result of adding more destinations: most new nodes and edges closer to the source.

What does this suggest? Summary: Edges are sampled unevenly by (k,m)-traceroute methods. Edges close to the source are sampled more often than edges further away. Intuitive Picture: Neighborhood near sources is well explored but, visibility of edges declines sharply with hop distance from sources. Hop1 log(Pr[X>x]) log(Degree) Hop2 Hop3 Underlying Graph Measured Graph Hop4

Outline  Motivation and Thought Experiments  Understanding Bias in Simulated Topologies Where and Why  Detecting and Defining Bias Statistical hypotheses to infer presence of bias  Examining Internet Maps

Inferring Bias Goal: Given a measured Ĝ, does it appear to be biased? Why this is difficult: Don’t have underlying graph. Don’t have formal criteria for checking bias. General Approach: Examine statistical properties as a function of distance from nearest source. Unbiased sample  No change Change  Bias

Detecting Bias Examine Pr[D=d|H=h], the conditional probability that a node has degree d, given that it is at distance h from the source. Two observations: 1. Highest degree nodes are near the source. 2. Degree distribution of nodes near the source different from those far away log(Degree) Ĝ degrees| H=3 log(Pr[X>x]) Underlying Graph Ĝ degrees| H=2

A Statistical Test for C1 Cut vertex set in half: N (near) and F (far), by distance from nearest source. Let v : (0.01) |V| k : fraction of v that lies in N Can bound likelihood k deviates from 1/2 using Chernoff-bounds: H 0 C1 Reject hypothesis with confidence 1-  if: C1: Are the highest-degree nodes near the source? If so, then consistent with bias. The 1% highest degree nodes occur at random with distance to nearest source.

A Statistical Test for C2 Partition vertices across median distance: N (near) and F (far) Compare degree distribution of nodes in N and F, using the Chi-Square Test: where O and E are observed and expected degree frequencies and l is histogram bin size. Reject hypothesis with confidence 1-  if: H 0 C2 C2: Is the degree distribution of nodes near the source different from those further away? If so, consistent with bias. Chi Square Test succeeds on degree distribution for nodes near the source and far from the source.

Our Definition of Bias Bias (Definition): Failure of a sampled graph to meet statistical tests for randomness associated with C1 and C2. Disclaimers: Tests are not conclusive. Tests are binary and don’t tell us how biased datasets are. But dataset that fails both tests is a poor choice to make generalizations of underlying graph.

Introducing datasets Pansiot-Grad log(Degree) MercatorSkitter log(Pr[X>x]) Dataset NameDate# Nodes# Links# Srcs# Dsts Pansiot-Grad19953,8884, Mercator ,263320,1491NA Skitter20007,20211,

Testing C1 H 0 C1 The 1% highest degree nodes occur at random with distance to source. Pansiot-Grad:93% of the highest degree nodes are in N Mercator:90% of the highest degree nodes are in N Skitter:84% of the highest degree nodes are in N

Testing C2 H 0 C2 Pansiot-Grad Mercator Skitter log(Pr[X>x]) log(Degree) Near Far All Near Far All Near Far All

Summary of Statistical Tests All datasets pass both statistical tests for evidence of bias. Likely that true degree distribution of the routers is different than that of these datasets.

Final Remarks Using (k,m)-traceroute methods to discover Internet topology yields biased samples. Rocketfuel [SMW:02] is limited-scale but may avoid some pitfalls of (k,m)-traceroute studies. One open question: How to sample the degree of a router at random? Node degree distributions are especially sensitive to biased sampling  may not be a sufficiently robust metric for characterizing or comparing graphs.

Sampling Power-Law Graphs Even though distributional shape similar, different exponents matter for topology modeling. Again, Ĝ is a biased sample of G Measured Graph Underlying, Power-Law Graph Underlying PLRG: N= Measured Graph: k=3, m=1000 log(Pr[X>x]) log(Degree)