Download presentation
Presentation is loading. Please wait.
1
1 Limiting the Impact of Failures on Network Performance Joint work with Supratik Bhattacharyya, and Christophe Diot High Performance Networking Group, 25 Feb. 2004 Yashar Ganjali Computer Systems Lab. Stanford University yganjali@stanford.edu http://www.stanford.edu/~yganjali
2
2 Motivation The core of the Internet consists of several large networks (IP backbones). IP backbones are carefully provisioned to guarantee low latency and jitter for packet delivery. Failures occur on a daily basis as a result of Physical layer malfunction, Router hardware/software failures, Maintenance, Human errors, … Failures affect the quality of service delivered to backbone customers.
3
3 Outline Background Sprint’s IP backbone Data Impact Metrics Time-based metrics Link-based metrics Measurements Reducing the impact Identifying critical failures Causes analysis Reducing critical failures
4
4 Background – Sprint’s IP backbone IP layer operates above DWDM with SONET framing. IS-IS protocol used to route traffic inside the network. IP-level restoration When an IP link fails, all routers in the network independently compute a new path around the failure No protection in the underlying optical infrastructure.
5
5 Data IS-IS Link State PDU logs Collected by passive listeners from Sprint’s North America backbone. Feb. 1 st, 2003 to Jun. 30 th, 2003. SNMP logs Link loads recorded once in every 5 minutes. SONET layer alarms Corresponding to minor and major problems in the optical layer We are only interested in two alarms:SLOS, and SLOS cleared.
6
6 Link Failures in Sprint’s IP Backbone – 9408 Failures
7
7 Inter-POP vs. Intra-POP ANA-2 ANA-3 ANA-1 ANA-4
8
8 Outline Background Sprint’s IP backbone Data Impact Metrics Time-based metrics Link-based metrics Measurements Reducing the impact Identifying critical failures Causes analysis Reducing critical failures
9
9 Inter-POP Link Failures in Sprint’s IP Backbone
10
10 Two Perspectives For a given impact metric Time-based analysis: Measure the impact of failures on the given metric as a function of time. Link-based analysis: Measure the impact of failures on the given metric as a function of failing links.
11
11 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load
12
12 Number of Simultaneous Failures
13
13 Number of Simultaneous Failures
14
14 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load Time-based Impact Metrics
15
15 Number of Affected O-D Pairs ACF B DE
16
16 Number of Affected O-D Pairs
17
17 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load Time-based Impact Metrics
18
18 Number of Affected BGP Prefixes
19
19 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load
20
20 Path Unavailability ACF B DE
21
21 Path Unavailability
22
22 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load
23
23 Total Rerouted Traffic
24
24 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load
25
25 Maximum Load Throughout the Network
26
26 Maximum Load Throughout the Network 96% of link failures were not followed by an immediate change in maximum load.
27
27 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load
28
28 Number of Failures per Link
29
29 Number of Affected OD Pairs per Link
30
30 Number of Affected BGP Prefixes per Link
31
31 Path Coverage ACF B DE
32
32 Path Coverage of Links
33
33 Total Rerouted Traffic on a Link
34
34 Peak Factor of a Link
35
35 Link-based Impact Metrics 1. Number of Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path coverage 5. Total rerouted traffic 6. Peak factor
36
36 Outline Background Sprint’s IP backbone Data Impact Metrics Time-based metrics Link-based metrics Measurements Reducing the impact Identifying critical failures Causes analysis Reducing critical failures
37
37 Critical Failures For each time-based metric Removing failures occuring during 1-5% of time improves the metrics by a factor of at least 5. For each link-based metric Removing failures on 1-7% of links improves the metric by a factor of at least 3.
38
38 Critical Time Periods
39
39 Critical Links Any link which has a critical failures, is called a Critical Link. We are interested in fixing such links.
40
40 Correlation of Critical Sets
41
41 Correlation of the Critical Sets MetricSize12345678910 1) Simultaneous failures 11 -0.380.330.270.230.110.130.080.150.05 2) # of O-D pairs 9 --0.370.210.250.120.140.060.090.06 3) # of BGP prefixes 6 ---0.180.320.090.050.10.070.03 4) Path unavailability 5 ----0.410.140.110.080.120.04 5) Total rerouted traffic 6 -----0.090.110.090.08 6) # of failures 2 ------0.290.310.250.17 7) # of O-D pairs 3 -------0.290.30.18 8) # of BGP prefixes 2 --------0.130.19 9) Path coverage 6 ---------0.08 10) Total rerouted traffic 1 ---------- Overall 23% of all links are critical.
42
42 Cause Analysis Markopoulou et al. have used IS-IS update messages for characterizing link failures into the following categories [MIB+04]. Maintenance Unplanned Shared failures –Router-related –Optical-related –Unspecified Individual failures About 70% of all unplanned failures
43
43 Matching SLOS Alarms with IP Link Failures Time IP link failure SLOS ~ 20ms SLOS Cleared ~ 12sec 58% of all link failures are due to optical layer problems. 84% of critical failures are due to optical layer problems.
44
44 Reducing Critical Failures Replace old optical fibers/parts. Optical Protection. Push the traffic away. Also works for maximum load and peak factor.
45
45 Performance Improvement Time-based metricsLink-based Metrics Metric% improvementMetric% improvement # of failures # of affected O-D pairs # of BGP prefixes Path unavailability Total rerouted traffic 41 36 32 39 29 # of failures # of affected O-D pairs # of BGP prefixes Path coverage Total rerouted traffic 45 37 29 42 38
46
46 Reducing Link Down-time Low-failure links: Failure are very rare. Damping doesn’t help. High-failure links: Failure rate changes very slowly. Fixed damping is wasteful.
47
47 Adaptive Damping Input: : time difference between the last two failures : threshold : constant function Adaptive_Damping begin if ( < ) ADT := x ; else ADT := 0; end; Output: ADT: Adaptive damping timer
48
48 Number – Duration Pareto Curve
49
49 Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.