Download presentation
Presentation is loading. Please wait.
2
1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University of Michigan) Jennifer Rexford (Princeton University) Jia Wang (AT&T Labs Research)
3
2 Motivation C BR C C C AS1 AS2AS3 destination A B C D Failure Disruption Congestion Mitigation AS4 source A backbone network is vulnerable to routing changes that occur in other domains.
4
3 Goal Identify important routing anomalies Lost reachability Persistent flapping Large traffic shifts Contributions: Build a tool to identify a small number of important routing disruptions from a large volume of raw BGP updates in real time. Use the tool to characterize routing disruptions in an operational network
5
4 Interdomain Routing: Border Gateway Protocol Prefix-based: one route per prefix Path-vector: list of ASes in the path Incremental: every update indicates a change Policy-based: local ranking of routes C BR C C C C C C C “I can reach 12.34.158.0/24” “I can reach 12.34.158.0/24 via AS 1” AS 1 AS 2 12.34.158.5 data traffic AS 3 iBGPeBGP 12.34.158.0/24
6
5 Capturing Routing Changes C BR C CPE BGP Monitor C BR C C C C C C C C C iBGP eBGP Updates Best routes A large operational network (8/16/2004 – 10/10-2004)
7
6 Challenges Large volume of BGP updates Millions daily, very bursty Too much for an operator to manage Different from root-cause analysis Identify changes and their effects Focus on actionable events rather than diagnosis Diagnose causes in/near the AS
8
7 System Architecture Event Classification Event Classification “Typed” Events EE BR EE EE BGP Updates (10 6 ) BGP Update Grouping BGP Update Grouping Events Persistent Flapping Prefixes (10 1 ) (10 5 ) Event Correlation Event Correlation Clusters Frequent Flapping Prefixes (10 3 ) (10 1 ) Traffic Impact Prediction Traffic Impact Prediction EE BR EE EE Large Disruptions Netflow Data (10 1 ) From millions of updates to a few dozen reports
9
8 Grouping BGP Update into Events Challenge: A single routing change leads to multiple update messages affects routing decisions at multiple routers Approach: Group together all updates for a prefix with inter-arrival < 70 seconds Flag prefixes with changes lasting > 10 minutes. BGP Update Grouping BGP Update Grouping EE BR EE EE BGP Updates Events Persistent Flapping Prefixes
10
9 Grouping Thresholds Based on our understanding of BGP and data analysis Event timeout: 70 seconds 2 * MRAI timer + 10 seconds 98% inter-arrival time < 70 seconds Convergence timeout: 10 minutes BGP usually converges within a few minutes 99.9% events < 10 minutes
11
10 Persistent Flapping Prefixes Types of persistent flapping Conservative damping parameters (78.6%) Protocol oscillations due to MED (18.3%) Unstable interfaces or BGP sessions (3.0%) A surprising finding: 15.2% of updates were caused by persistent-flapping prefixes even though flap damping is enabled.
12
11 Example: Unstable eBGP Session ISP Peer Customer E C E B E A E D p Flap damping parameters is session-based Damping not implemented for iBGP sessions
13
12 Event Classification Challenge: Major concerns in network management Changes in reachability Heavy load of routing messages on the routers Change of flow of the traffic through the network Event Classification Event Classification Events “Typed” Events, e.g., Loss/Gain of Reachability Solution: classify events by severity of their impact
14
13 Event Category – “No Disruption” ISP E A p E B E C E E AS 2 E D AS 1 No Traffic Shift “No Disruption”: no border routers have any traffic shift. (50.3%)
15
14 Event Category – “Internal Disruption” ISP E A p E B E C E E AS 2 E D AS 1 Internal Traffic Shift “Internal Disruption”: all traffic shifts are internal. (15.6%)
16
15 Event Category – “Single External Disruption” ISP E A p E B E C E E AS 2 E D AS 1 external Traffic Shift “Single External Disruption”: only one of the traffic shifts is external (20.7%)
17
16 Statistics on Event Classification EventsUpdates No Disruption50.3%48.6% Internal Disruption15.6%3.4% Single External Disruption20.7%7.9% Multiple External Disruption7.4%18.2% Loss/Gain of Reachability6.0%21.9% First 3 categories have significant day-to-day variations Updates per event depends on the type of events and the number of affected routers
18
17 Event Correlation Challenge: A single routing change affects multiple destination prefixes Event Correlation Event Correlation “Typed” Events Clusters Solution: group the same-type, close-occurring events
19
18 EBGP Session Reset Caused most of “single external disruption” events Check if the number of prefixes using that session as the best route changes dramatically Validation with Syslog router report (95%) time Number of prefixes session failure session recovery
20
19 Hot-Potato Changes Hot-Potato Changes Caused “internal disruption” events Validation with OSPF measurement (95%) [Teixeira et al – SIGMETRICS’ 04] ISP P E A E B E C 10119 “Hot-potato routing” = route to closest egress point
21
20 Traffic Impact Prediction Challenge: Routing changes have different impacts on the network which depends on the popularity of the destinations Traffic Impact Prediction Traffic Impact Prediction EE BR Clusters Large Disruptions Netflow Data EE BR EE Solution: weigh each cluster by traffic volume
22
21 Traffic Impact Prediction Traffic weight Per-prefix measurement from netflow 10% prefixes accounts for 90% of traffic Traffic weight of a cluster the sum of “traffic weight” of the prefixes A small number of large clusters have large traffic weight Mostly session resets and hot-potato changes
23
22 Performance Evaluation Memory Static memory: “current routes”, 600 MB Dynamic memory: “clusters”, 300 MB Speed 99% of intervals of 1 second of updates can be process within 1 second Occasional execution lag Every interval of 70 seconds of updates can be processed within 70 seconds Measurements were based on 900MHz CPU
24
23 Conclusion BGP troubleshooting system Fast, online fashion Operators’ concerns (reachability, flapping, traffic) Significant information reduction millions of update a few dozens of large disruptions Uncovered important network behavior Hot-Potato changes Session resets Persistent-flapping prefixes
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.