On Understanding of Transient Interdomain Routing Failures Feng Wang, Lixin Gao, Jia Wang, and Jian Qiu Department of Electrical and Computer Engineering University of Massachusetts, Amherst MA AT&T Labs-research 180 Park Ave, Florham Park NJ 07869
Outline What is transient routing failures? When can transient routing failures occur? How long can transient routing failures last? Measurement results
Internet Routing Autonomous systems (ASes) –Internet Service Providers (ISPs) –Companies –Universities Intradomain Routing Protocols –Static Routing, OSPF, IS-IS Interdomain Routing Protocol –Border Gateway Protocol (BGP)
Long Convergence Delay Long convergence delay (Labovitz et al, TON2001) –Bringing a route back –(T up ): <shortest path length MRAI –Disconnecting a route –(T down ): <longest path length MRAI Fail-over: rerouting from Path A to Path B –During the time for discovering Path B, routers might experience transient routing failures, i.e., no route is available
An Example of Transient Routing Failure d Traffic on data plane BGP update W:20 A:10 AS1 AS2 AS W:20 10 A: BGP Routing table losing reachability AS3
Our Contributions Identify transient routing failures –Sufficient conditions Bound transient routing failure duration
Outline What is transient routing failures? When can transient routing failures occur? How long can transient routing failures last? Measurement results
Two sufficient conditions for a node must experience a transient routing failure (transient routing failure for sure). One sufficient condition for a node may experience a transient routing failure (potential transient routing failure). When Transient Routing Failures can Occur? w w
When Transient Routing Failures can Occur? (contd.) w A w
Outline What is transient routing failures? When can transient routing failures occur? How long can transient routing failures last? Measurement results
How long Transient Routing Failures last? d W: 2 0 A: 10 W: 2 0 A: 10 MRAI timer
MRAI Timers Minimum Advertisement Interval timer –Minimum amount of time that must elapse between routing updates –Applied to BGP announcement or withdrawal Default MRAI value –eBGP session: 30 seconds – iBGP session: 5 seconds
Upper Bound for Transient Routing Failure Duration Transient routing failure min(d u +d u ) MRAI 0 u dudu u v, d u 0
Transient Failures in a Typical BGP System A typical BGP system means that every router in the system applies common routing policies. Routing policies are guided by commercial relationships between ASes. Customer-to-provider Peer-to-peer Common routing policies: Import policies are guided by the prefer-customer routing policies. Export policies are guided by the no-valley routing policies
Occurrence of Transient failures in a typical BGP system In a typical BGP system, transient failures are prevalent. –Tier-1 ASes can experience transient routing failures, where alternate routes come from their edge routers. –Non tier-1 ASes can experience transient routing failures, where alternate routes are obtained from other ASes.
Outline What is transient routing failures? When can transient routing failures occur? How long can transient routing failures last? Measurement results
Measuring Transient Failures within a tier-1 AS Percentage of transient failures among all routing failures that last less than 30 seconds Cumulative distribution of transient Failure Duration BGP updates, BGP tables and router configuration files are collected during July 2004
Measuring Transient Failures contd. Transient failures in tier-2 ASes using Oregon RouteView’s BGP updates (July 2004)
Popularity of Prefixes Experiencing Transient Failures We aggregate the Netflow data collected in the tier-1 AS during the week (1/2/2005~1/8/2005) Transient routing failures can impact on popular prefixes and unpopular prefixes Fraction of transient routing failures
Conclusions Transient routing failures are prevalent in the Internet, and can last for a significant period of time. Majority of transient failures occur under the commonly applied routing policy setting. Popular and unpopular prefixes can experience transient failures.
Thanks