1 A Framework for Measuring and Predicting the Impact of Routing Changes Ying Zhang Z. Morley Mao Jia Wang
2 Internet routing changes Various causes Link failures, configuration changes, topology changes, etc. Direct influence on the data plane Transient data-plane disruption Packet loss, increased delay, forwarding loops Internet C BR C C Destination Source Old path New path C BR C C C C
Motivation Frequent routing dynamics can cause transient disruption in the data plane Inconsistent routes during convergence Real-time applications can be affected Predicting performance impact can assist more intelligent route selection 3
Measuring and predicting the impact Comprehensively measure the impact of routing changes Characterize the properties of routing changes that cause traffic disruption Search for pattern to help prediction 4
Outline Motivation Methodology Characterization of data-plane failures Failure prediction model 5
Methodology Data collection Control plane: local real-time BGP updates Data plane: ping and traceroute probes for each update A light weight active probing methodology A coarse-grained performance metric: reachability Destination reachable: any ping reply Scalable to many destinations with live IPs Measurement-based approach No simplifying assumptions Empirical evidence 6
Our approach Focus: measure data-plane failures caused by routing changes Coarse-grained performance metrics Methodology: light-weight active probing Triggered by locally observed routing updates Probing target of a live IP within the prefix 7 Prefix P Old path New path C BR AS C Update Prefix: P, AS path: A D B C BR AS B AS A C BR AS D Measurement Framework Internet
Our approach Focus: measure data-plane failure caused by routing changes Methodology: light-weight active probing Triggered by locally observed routing updates Probing target of a live IP within the prefix 8 Live IP 1 within Prefix P Old path New path C BR AS C Ping C BR AS B AS A C BR AS D Measurement Framework Internet Traceroute Ping, traceroute
Probing control Background probing Identifying persistent failures Verifying live IP’s response Resource control Ignoring updates due to table transfers Imposing maximum probing duration Accuracy control Impose maximum waiting duration 9
Outline Motivation Methodology Characterization of data-plane failures Failure prediction model 10
Characterization of data-plane failures Failure types Reachability failure Ping reply is not received due to network problems Forwarding loops A subset of reachability failures Transient loops observed in the path Failure properties Affected networks Failure duration Failure predictability 11
Overall reachability failure statistics 12 IncidencePrefixAS Unreachable Loop6%23%33% Other36%72%38% All42%73%63% Reachable57%83%98% Internet experiments for 11 weeks
Affected network locations Understanding the networks affected by routing changes Most Ases are near the edge and in foreign countries Small fraction of destinations experiencing many unreachable incidences 13
Failure durations Short duration Most last less than 300 seconds Transient routing failure, convergence delay 10% incidences with longer duration Configuration errors or path failures 14
Failure predictability Destination prefix information Appearance probability Probability of an unreachable incidence for prefix D Destination prefix and AS path segments Conditional probability on AS path segments Probability of an unreachable event occurring given a particular AS path segment Responsible AS Where traceroute stops 15
Outline Motivation Methodology Characterization of data plane failure Failure prediction model 16
Prediction model Prefix and AS segment information The data plane failure likelihood ratio P(Y=1|R;D): the conditional probability of data-plane failure given a routing update R for prefix D Assuming the failure on each AS is independent x i is the responsible AS in history data x i is the responsible AS in history data 17
Evaluation The trade-off between selectivity and sensitivity is the decision threshold which determines false positives and false negative route Receiver operating characteristic Evaluation results 60% detection rate with 18% false positives 18
Conclusion Developed an efficient framework for measuring and predicting data-plane failures caused by routing changes Identified patterns to accurately predict data-plane failures Provided suggestions for more intelligent route selections 19