1 Network Tomography Using Passive End-to-End Measurements Venkata N. Padmanabhan Lili Qiu Helen J. Wang Microsoft Research DIMACS’2002
2 Overview Goal: Determine internal network characteristics using end-to-end, passive measurements find trouble spots in the network (e.g., AT&T-Sprint peering point) Metrics of interest Link loss rate our focus Raw bandwidth Available bandwidth Traffic rate … Why interesting Find trouble spots Placement of Web replicas Sprint AT&T Web Server UUNET MCI Qwest AOL Earthlink Why so slow?
3 Topological Metrics Topological metrics are poor predictors of packet loss rate All links are not equal need to identify the lossy links
4 Previous Work Active probing to infer link loss rate multicast probes striped unicast probes Pros & cons accurate since individual loss events identified expensive because of extra probe traffic S AB S AB
5 Our Approach Passive observation of existing traffic measure loss rate rather than loss events Active probing to discover network topology can be done infrequently and in the background Goal Identify lossy links rather than determine exact loss rate l1l1 l8l8 l7l7 l6l6 l2l2 l4l4 l5l5 l3l3 server clients p1p1 p2p2 p3p3 p4p4 p5p5 (1-l 1 )*(1-l 2 )*(1-l 4 ) = (1-p 1 ) (1-l 1 )*(1-l 2 )*(1-l 5 ) = (1-p 2 ) … (1-l 1 )*(1-l 3 )*(1-l 8 ) = (1-p 5 ) Under-constrained system of equations
6 #1: Random Sampling Randomly sample the solution space Repeat this several times Draw conclusions based on overall statistics How to do random sampling? determine loss rate bound for each link using best downstream client iterate over all links: pick loss rate at random within bounds update bounds for other links Problem: little tolerance for estimation error l1l1 l8l8 l7l7 l6l6 l2l2 l4l4 l5l5 l3l3 server clients p1p1 p2p2 p3p3 p4p4 p5p5
7 #2: Linear Optimization Goals Parsimonious explanation Robust to error in client loss rate estimate L i = log(1/(1-l i )), P j = log(1/(1-p j )) minimize L i + |S j | L 1 +L 2 +L 4 + S 1 = P 1 L 1 +L 2 +L 5 + S 2 = P 2 … L 1 +L 3 +L 8 + S 5 = P 5 Can be turned into a linear program l1l1 l8l8 l7l7 l6l6 l2l2 l4l4 l5l5 l3l3 server clients p1p1 p2p2 p3p3 p4p4 p5p5
8 # 3: Gibbs Sampling D observed packet transmissions (either succeed or fail) at the clients ensemble of loss rates of links in the network Goal determine the posterior distribution P(f( )|D) Approach Use Markov Chain Monte Carlo with Gibbs sampling to obtain samples from P(f( )|D) Draw conclusions based on the samples
9 #3: Gibbs Sampling (Cont.) Gibbs sampling 1) Initialize link loss rates arbitrarily 2) For j = 1 : warmupSamples for each link i compute P(l i |D, {l i ’}) where l i is loss rate of link i, and {l i ’} = k I l k 3) For j = 1 : realSamples for each link i compute P(l i |D, {l i ’}) Use all the samples obtained at step 3 to approximate P(f( )|D)
10 Performance Evaluation Simulation experiments Trace-driven validation
11 Simulation Experiments Advantage: no uncertainty about link loss rate! Methodology topologies used: randomly-generated: nodes, max degree = 5-50 real topology obtained by tracing paths to microsoft.com clients randomly-generated packet loss events at each link A fraction f of the links are good, and the rest are “bad” LM1: good links: 0 – 1%, bad links: 5 – 10% LM2: good links: 0 – 1%, bad links: 1 – 100% Goodness metric: Coverage: # correctly inferred lossy links False positive: # incorrectly inferred lossy links
12 Random Topologies Random: high coverage and high false positive LP: moderate coverage and low false positive Gibbs: high coverage and low false positive
13 Random topologies (Cont.) Confidence estimate for gibbs sampling works well and can be used to rank order the inferred lossy links.
14 Trace-driven Validation Experimental setup packet tracing machine at microsoft.com client loss rates based on TCP traffic traces analyzed Dec. 20, 2000 and Jan. 11, 2002, each about 2 hours long with 100 – 125 million packets Validation Divide client trace into two: tomography and validation Use validation set to see if clients downstream of the inferred lossy links experience high loss False positive rate is between 5 – 30% likely candidates for lossy links: links that cross an inter-AS boundary links that have a large delay links that terminate at clients
15 Summary Develop and evaluate three different techniques for passive network tomography TechniquesCoverageFalse Positive Computation Cost Random sampling High Low Linear optimization ModestLowMedium Gibbs sampling HighLowHigh
16 Acknowledgement Dimitris Achlioptas Chris Borg Jennifer Chayes David Heckerman Chris Meek David Wilson Rob Emmanuel Scott Hogan