Download presentation
Presentation is loading. Please wait.
Published byDeborah Miles Modified over 9 years ago
1
1 BGP Anomaly Detection in an ISP Jian Wu (U. Michigan) Z. Morley Mao (U. Michigan) Jennifer Rexford (Princeton) Jia Wang (AT&T Labs) http://www.cs.princeton.edu/~jrex/papers/nsdi05-jian.pdf
2
2 Goal Identify important anomalies Lost reachability Persistent flapping Large traffic shifts Contributions: Build a tool to identify a small number of important routing disruptions from a large volume of raw BGP updates in real time. Use the tool to characterize routing disruptions in an operational network
3
3 Capturing Routing Changes C BR C CPE BGP Monitor C BR C C C C C C C C C iBGP eBGP Updates Best routes Large operational network (8/16/2004 – 10/10-2004)
4
4 Challenges Large volume of BGP updates Millions daily, very bursty Too much for an operator to manage Different than root-cause analysis Identify changes and their effects Focus on actionable events Diagnose causes only in/near the AS
5
5 System Architecture Event Classification Event Classification “Typed” Events EE BR EE EE BGP Updates (10 6 ) BGP Update Grouping BGP Update Grouping Events Persistent Flapping Prefixes (10 1 ) (10 5 ) Event Correlation Event Correlation Clusters Frequent Flapping Prefixes (10 3 ) (10 1 ) Traffic Impact Prediction Traffic Impact Prediction EE BR EE EE Large Disruptions Netflow Data (10 1 )
6
6 Grouping BGP Update into Events Challenge: A single routing change leads to multiple update messages affects routing decisions at multiple routers Solution: Group all updates for a prefix with inter- arrival < 70 seconds Flag prefixes with changes lasting > 10 minutes. BGP Update Grouping BGP Update Grouping EE BR EE EE BGP Updates Events Persistent Flapping Prefixes
7
7 Grouping Thresholds Based on data analysis and our understanding of BGP Event timeout: 70 seconds 2 * MRAI timer + 10 seconds 98% inter-arrival time < 70 seconds Convergence timeout: 10 minutes BGP usually converges within minutes 99.9% events < 10 minutes
8
8 Persistent Flapping Prefixes Causes of persistent flapping Conservative damping parameters (78.6%) Protocol oscillations due to MED (18.3%) Unstable interface or BGP session (3.0%) Surprising finding: 15.2% of updates were caused by persistent flapping prefixes, even though flap damping was enabled!
9
9 Example: Unstable eBGP Session ISP Peer Customer E C E B E A E D p Flap damping parameters are session-based Damping not implemented for iBGP sessions
10
10 Event Classification Challenge: Major concerns in network management Changes in reachability Heavy load of routing messages on the routers Change of flow of traffic through the network Event Classification Event Classification Events “Typed” Events Solution: classify events by severity of their impacts
11
11 Event Category – “No Disruption” ISP E A p E B E C E E AS 2 E D AS 1 No Traffic Shift “No Disruption”: each of the border routers has no traffic shift. (50.3%)
12
12 Event Category – “Internal Disruption” ISP E A p E B E C E E AS 2 E D AS 1 Internal Traffic Shift “Internal Disruption”: all of the traffic shifts are internal traffic shift. (15.6%)
13
13 Event Category – “Single External Disruption” ISP E A p E B E C E E AS 2 E D AS 1 external Traffic Shift “Single External Disruption”: only one of the traffic shifts is external traffic shift. (20.7%)
14
14 Statistics on Event Classification EventsUpdates No Disruption50.3%48.6% Internal Disruption15.6%3.4% Single External Disruption20.7%7.9% Multiple External Disruption7.4%18.2% Loss/Gain of Reachability6.0%21.9% First 3 categories have significant variations from day to day Updates per event depends on the type of events and the number of affected routers
15
15 Event Correlation Challenge: A single routing change affects multiple destination prefixes Event Correlation Event Correlation “Typed” Events Clusters Solution: group events of same type that occur close in time
16
16 EBGP Session Reset Caused most “single external disruption” events Check if the number of prefixes using that session as the best route changes dramatically Validation with Syslog router report (95%) time Number of prefixes session failure session recovery
17
17 Hot-Potato Changes Hot-Potato Changes Caused “internal disruption” events Validation with OSPF measurement (95%) [ Teixeira et al – SIGMETRICS’ 04] ISP P E A E B E C 10119 “Hot-potato routing” = route to closest egress point
18
18 Traffic Impact Prediction Challenge: Routing changes have different impacts on the network which depends on the popularity of the destinations Traffic Impact Prediction Traffic Impact Prediction EE BR Clusters Large Disruptions Netflow Data EE BR EE Solution: weigh each cluster by traffic volume
19
19 Traffic Impact Prediction Traffic weight Per-prefix measurement from Netflow 10% prefixes accounts for 90% of traffic Traffic weight of a cluster Sum of “traffic weight” of the prefixes A few clusters have large traffic weight Mostly session resets & hot-potato changes
20
20 Performance Evaluation Memory Static memory: “current routes”, 600 MB Dynamic memory: “clusters”, 300 MB Speed 99% of intervals of 1 second of updates can be process within 1 second Occasional execution lag Every interval of 70 seconds of updates can be processed within 70 seconds Measurements were based on 900MHz CPU
21
21 Conclusion BGP anomaly detection Fast, online fashion Operator concerns (reachability, flapping, traffic) Significant information reduction Uncovered important network behaviors Persistent flapping prefixes Hot-potato changes Session resets and interface failures
22
22 Detecting Peering Violations Consistent export requirement Peer should advertise prefixes at all peering points, with the same AS path length Allows the AS to do hot-potato routing Detecting violations Using iBGP feeds from the border routers Some inference tricks to identify inconsistencies Results of the study http://www.nanog.org/mtg-0410/feamster.html http://www.cs.princeton.edu/~jrex/papers/imc04.pdf
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.