1 Passive Network Tomography Using Bayesian Inference Lili Qiu Joint work with Venkata N. Padmanabhan and Helen J. Wang Microsoft Research Internet Measurement Workshop 2002 Marseille, France
2 Motivation C&W AT&T Web Server Sprint UUNet Qwest Earthlink AOL It’s so slow! Diagnosis engine Ethernet Why is it so slow?
3 Network Diagnosis Diagnosis engine Netmon/tcpdump traces Network topologyTrouble spots location Diagnosis results: Qwest access link: > Peering between UUNET and AOL: >
4 Network Diagnosis (Cont.) Goal: Determine internal network characteristics using passive end-to-end measurements Primary focus: identifying lossy links Applications Trouble shooting Server selection Server placement Overlay network path construction
5 Previous Work Active probing to infer link loss rate multicast probes striped unicast probes Pros & cons accurate since individual loss events identified expensive because of extra probe traffic S AB S AB
6 Problem Formulation l1l1 l8l8 l7l7 l6l6 l2l2 l4l4 l5l5 l3l3 server clients p1p1 p2p2 p3p3 p4p4 p5p5 (1-l 1 )*(1-l 2 )*(1-l 4 ) = (1-p 1 ) (1-l 1 )*(1-l 2 )*(1-l 5 ) = (1-p 2 ) … (1-l 1 )*(1-l 3 )*(1-l 8 ) = (1-p 5 ) Challenges: Under-constrained system of equations Measurement errors
7 Gibbs Sampling D observed packet transmission and loss at the clients ensemble of loss rates of links in the network Goal determine the posterior distribution P( |D) Approach Use Markov Chain Monte Carlo with Gibbs sampling to obtain samples from P( |D) Draw conclusions based on the samples
8 Gibbs Sampling (Cont.) Applying Gibbs sampling to network tomography 1) Initialize link loss rates arbitrarily 2) For j = 1 : warmup for each link i compute P(l i |D, {l i ’}) where l i is loss rate of link i, and {l i ’} = k I l k 3) For j = 1 : realSamples for each link i compute P(l i |D, {l i ’}) Use all the samples obtained at step 3 to approximate P( |D)
9 Performance Evaluation Simulation experiments Trace-driven validation
10 Simulation Experiments Advantage: no uncertainty about link loss rate! Methodology Topologies used: randomly-generated: nodes, max degree = 5-50 real topology obtained by tracing paths to microsoft.com clients randomly-generated packet loss events at each link A fraction f of the links are good, and the rest are “bad” LM1: good links: 0 – 1%, bad links: 5 – 10% LM2: good links: 0 – 1%, bad links: 1 – 100% Link loss processes: Bernoulli and Gilbert Goodness metrics: Coverage: # correctly inferred lossy links False positive: # incorrectly inferred lossy links
11 Random topologies Confidence estimate for gibbs sampling works well and can be used to rank order the inferred lossy links.
12 Trace-driven Validation Validation approach Divide client traces into two: tomography and validation Tomography data set loss inference Validation set check if clients downstream of the inferred lossy links experience high loss Experimental setup Real topologies and loss traces collected from traceroute and tcpdump at microsoft.com during Dec. 20, 2000 and Jan. 11, 2002 Results For the small subset of inferences that could be validated, all the inferences are correct Likely candidates for lossy links: links crossing an inter-AS boundary links having a large delay (e.g. transcontinental links) links that terminate at clients
13 Summary Passive network tomography is feasible Gibbs sampling yields a high coverage (over 80%), and a low false positive rate (below 5-10%) We have developed two other inference techniques which trade-off accuracy for speed (more details in “Server- based Inference of Internet Performance”, to appear in INFOCOM’03) Future work: make loss inference in real time Acknowledgements: Chris Meek, David Wilson, Christian Borgs, Jennifer Chayes, David Heckerman