Internet Routing Instability Craig Labovitz, G. Robert Malan, Farham Jahanian University of Michigan Presented By Krishnanand M Kamath
Cause and Effect Define routing instability Rapid change of network reachability and topology information Causes Router Configuration Errors Transient Physical and data link problems Problems with leased line, router failures, high levels of congestion Software Configuration Errors Effects Very many – slew of effects
Effects Increased network latency and time to convergence Dropped and out of order delivery of packets Miserable end to end performance Loss of connectivity in national networks Route caching architecture and low end processors for CPU Pr(Cache Miss) increases, severe CPU load, memory problems Delays in packet processing, Keep-Alive packets are delayed Others flag the router as down and transmit updates Down router reinitiates peering session Large state dump transmission Yet more routers fail- Route Flap Storm
Solutions Route Aggregation Reduces the overall number of networks visible in the core Requires cooperation between service providers Redundant connectivity to the internet – multi-homing Route Dampening Algorithms Not a panacea – legitimate announcements may be delayed Overall, Multi-homing exhibiting linear growth Internet topology growing increasingly less hierarchical Increasing topological complexity
Recall Updates Announcements New route New policy decision for an existing route Withdrawals Explicit – associated with a withdrawal message Implicit – existing route is Replaced by announcement Of new route
Types of Updates Inter-domain routing updates Forwarding Instability Legitimate topological changes and affect the paths on which data will be forwarded between AS’s Routing policy fluctuation Reflects changes in routing policy information that may not affecting forwarding paths between AS’s Pathological Updates Redundant BGP info that reflect neither routing nor forwarding instability
Major Results Number of BGP updates is one or more orders of magnitude larger than expected. Routing information is dominated by pathological updates Instability and redundant updates exhibit a periodicity of 30 & 60 secs Instability and redundant updates show a correlation to network usage Instability is not dominated by a small set of AS or routes Discounting policy fluctuation and pathological behavior there remains a significant level of internet forwarding instability Specific architectural and protocol implementation changes in commercial internet routers through collaboration with vendors
Taxonomy Data Analyzed Sequences of BGP updates for each (prefix, peer) tuple Events Identified WADiff A route is explicitly withdrawn as it becomes unreachable and later replaced with an alternative route to the same destination. The alternative route differs in its ASPATH or nexthop attribute information.(Forwarding Instability) AADiff A route is implicitly withdrawn and replaced by an alternative route as the original route becomes unreachable, or a prefferd alternative path becomes Available (Forwarding Instability)
Taxonomy(contd.) Events Identified(contd.) WADup A route is explicitly withdrawn and then re-announced as reachable. This may reflect transient topological failure, or it may represent a pathological oscillation. (Forwarding Instability or Pathological Behavior) AADup A route is implicitly withdrawn and replaced with a duplicate of the original route. Duplicate Route – is defined as a subsequent route announcement that does not differ in nexthop or ASPATH attribute information. (Pathological Behavior or Route Ploicy Fluctuation) WWDup The repeated transmission of BGP withdrawals for a prefix that is currently unreachable. (Pathological Behavior)
Methodology Data Collected: BGP routing messages Time Period: Over the course of 9 months starting Jan 96 Where: Five of the major U.S. network exchange points Tool: Unix based route servers, Multithreaded routing Toolkit(MRT)
Gross Observations We Expect, Instability (Globally visible addresses, total number of available paths) We Observe, For 45,000 prefixes and 1500 paths- 3 to 6 million updates per day
Pathological Behavior Disturbing behaviors, Most of the BGP updates entirely pathological (WWDup) Disproportionate effect that a single service provider can have on global routing Causal relationship between manufacturer of a router and level of pathological behavior Routing updates have a regular, specific periodicity of either 30 or 60 seconds Persistence of pathological behavior are under five minutes
Origins of Pathologies Stateless BGP: Withdrawals are sent for every explicitly and implicitly withdrawn prefix- no state on info advertised to peers Plausible Explanations, CSU Timer problems Unjittered 30 second interval timer, self-synchronization Misconfigured interaction of IGP/BGP protocols Router vendor software bugs Unconstrained routing policies
Analysis of Instability Instability as the sum of AADiff, WADiff and WADup updates
Fine-grained Instability Statistics There is no correlation between the size of an AS and its proportion of the instability statistics.
Fine-grained Instability Statistics No single AS or prefix consistently dominates the instability statistics Instability is evenly distributed across routes
Temporal Properties of Instability Plausible causes for the periodicity, Routing software timers, self synchronization, and routing loops CSU handshaking timeouts Flaw in routing protocol
Origins of Internet Routing Instability Craig Labovitz, G. Robert Malan, Farham Jahanian University of Michigan
Introduction We observed, Several orders of magnitude more routing updates Large number of duplicate routing messages Unexpected frequency components between instability events Extend earlier analysis by, Identifying the origins of many of the pathological behavior Impact of specific commercial router software changes suggested Additional router software changes that can decrease updates exchanged by an additional 30 percent or more
Major Results Volume of inter-domain routing updates has decreased by an order of magnitude since April The majority of BGP messages consists of redundant announcements A growing proportion of instability stems from specific changes in Internet architecture coupled with limitations in router software and algorithms. Instability is not disproportionately dominated by prefixes of specific lengths. Persistently oscillating routes dominate the BGP traffic generated by a few Internet providers. Experimentally confirmed a num of origins of pathological routing behavior postulated in the earlier work.
Analysis of Gross Trends Note, Dramatic decrease in the number of withdrawals Number of announcements have doubled over 28 month period Growth of BGP announcements disproportional to any corresponding increase in the number of routing table entries
Taxonomy Analyze sequences of BGP updates for each (prefix, peer) tuple Identify the events, AADup: A route is implicitly withdrawn and replaced with a duplicate of the original route. We define a duplicate route as a subsequent route announcement that does not differ in any BGP path attribute information. AADiff: A route is implicitly withdrawn and replaced by an alternative route as the original route becomes unreachable, or a preferred alternative path becomes available. Tup and Tdown Fluctuation in the reachability for a given prefix Tup:currently unreachable prefix announced reachable & transitions up Tdown: announced route is withdrawn and transitions down
Analysis of Update Categories AADup Behavior stems from: 1.Non – transitive attribute filtering 2.Combination of BGP minimum advertising timer with stateless BGP
Analysis of AADiffs Note Low percentage of ASPath ASDiffs Growth in number of origin AADiffs related to architecture and and policy issues Growth in number of community AADiffs reflects its recent adoption by many ISPs Oscillations in MED due to the IBGP mapped MED policy at two service providers
IBGP Mapped MED
Frequency Recall, Frequency defined as inverse of inter-arrival time between routing updates Predominant frequencies have a 30 sec and 60 sec periodicity Cause, Frequency components stem from a fixed minimum BGP advertisement timer used by atleast one router vendor
Prefix Length Statistics
Conclusions Volume of routing update messages decreased by an order of magnitude by specific software changes on the majority of core Internet backbone routers. Software changes successfully suppressed the generation of pathological withdrawals. Proposed new software changes that may reduce instability levels by an additional thirty percent. Instability is well distributed across both autonomous system and prefix space. No single service provider or set of network destinations appears to be at fault.