1 Limiting the Impact of Failures on Network Performance Joint work with Supratik Bhattacharyya, and Christophe Diot High Performance Networking Group,

Slides:



Advertisements
Similar presentations
1 On the Management Issues over Lambda Networks 2005 / 08 / 23 Te-Lung Liu Associate Researcher NCHC, Taiwan.
Advertisements

Traffic Dynamics at a Commercial Backbone POP Nina Taft Sprint ATL Co-authors: Supratik Bhattacharyya, Jorjeta Jetcheva, Christophe Diot.
NETWORK LAYER (1) T.Najah AlSubaie Kingdom of Saudi Arabia Prince Norah bint Abdul Rahman University College of Computer Since and Information System NET331.
Network Layer: Internet-Wide Routing & BGP Dina Katabi & Sam Madden.
1 BGP Anomaly Detection in an ISP Jian Wu (U. Michigan) Z. Morley Mao (U. Michigan) Jennifer Rexford (Princeton) Jia Wang (AT&T Labs)
Internet and Overlay Networks Ram Keralapura ECE Dept
Infocom 2003 An Approach to Alleviate Link Overload as Observed on an IP Backbone Tuesday, April 1 st Infocom 2003 Sundar Iyer 1,2, Supratik Bhattacharrya.
Yashar Ganjali Computer Systems Laboratory Stanford University February 13, 2003 Optimal Routing in the Internet.
Volcano Routing Scheme Routing in a Highly Dynamic Environment Yashar Ganjali Stanford University Joint work with: Nick McKeown SECON 2005, Santa Clara,
1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.
December 20, 2004MPLS: TE and Restoration1 MPLS: Traffic Engineering and Restoration Routing Zartash Afzal Uzmi Computer Science and Engineering Lahore.
MPLS and Traffic Engineering
CSEE W4140 Networking Laboratory Lecture 4: IP Routing (RIP) Jong Yul Kim
CSEE W4140 Networking Laboratory Lecture 4: IP Routing (RIP) Jong Yul Kim
Delayed Internet Routing Convergence Craig Labovitz, Abha Ahuja, Abhijit Bose, Farham Jahanian Presented By Harpal Singh Bassali.
Dynamics of Hot-Potato Routing in IP Networks Renata Teixeira (UC San Diego) with Aman Shaikh (AT&T), Tim Griffin(Intel),
IP layer restoration and network planning based on virtual protection cycles 2000 IEEE Journal on Selected Areas in Communications Reporter: Jyun-Yong.
Impact of BGP Dynamics on Intra-Domain Traffic Patterns in the Sprint IP Backbone Sharad Agarwal, Chen-Nee Chuah, Supratik Bhattacharyya, Christophe Diot.
Network Monitoring for Internet Traffic Engineering Jennifer Rexford AT&T Labs – Research Florham Park, NJ 07932
Ningning HuCarnegie Mellon University1 A Measurement Study of Internet Bottlenecks Ningning Hu (CMU) Joint work with Li Erran Li (Bell Lab) Zhuoqing Morley.
Yashar Ganjali Sprint Advanced Technology Lab. August 5, 2003 Link Failures in IP Networks: A Closer Look.
ROUTING PROTOCOL IGRP. REVIEW 4 Purpose of Router –determine best path to destination –pass the frames to the destination 4 Protocols –routed - used by.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Computer Networks Layering and Routing Dina Katabi
Connecting LANs, Backbone Networks, and Virtual LANs
Delivery, Forwarding and
1 Meeyoung Cha (KAIST) Sue Moon (KAIST) Chong-Dae Park (KAIST) Aman Shaikh (AT&T Labs – Research) IEEE INFOCOM 2005 Poster Session Positioning Relay Nodes.
S305 – Network Infrastructure Chapter 5 Network and Transport Layers.
1 CS 4396 Computer Networks Lab Dynamic Routing Protocols - II OSPF.
Network Sensitivity to Hot-Potato Disruptions Renata Teixeira (UC San Diego) with Aman Shaikh (AT&T), Tim Griffin(Intel),
1 Meeyoung Cha, Sue Moon, Chong-Dae Park Aman Shaikh Placing Relay Nodes for Intra-Domain Path Diversity To appear in IEEE INFOCOM 2006.
Authors Renata Teixeira, Aman Shaikh and Jennifer Rexford(AT&T), Tim Griffin(Intel) Presenter : Farrukh Shahzad.
Routing protocols Basic Routing Routing Information Protocol (RIP) Open Shortest Path First (OSPF)
Network Layer4-1 Chapter 4: Network Layer Chapter goals: r understand principles behind network layer services: m network layer service models m forwarding.
Happy Network Administrators  Happy Packets  Happy Users WIRED Position Statement Aman Shaikh AT&T Labs – Research October 16,
Dynamics of Hot-Potato Routing in IP Networks Jennifer Rexford AT&T Labs—Research Joint work with Renata Teixeira (UCSD),
Computer Networks with Internet Technology William Stallings
1 Network Layer Lecture 13 Imran Ahmed University of Management & Technology.
Protection and Restoration Definitions A major application for MPLS.
A Snapshot on MPLS Reliability Features Ping Pan March, 2002.
Detection of Routing Loops and Analysis of Its Causes Sue Moon Dept. of Computer Science KAIST Joint work with Urs Hengartner, Ashwin Sridharan, Richard.
1 A Framework for Measuring and Predicting the Impact of Routing Changes Ying Zhang Z. Morley Mao Jia Wang.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
Yaping Zhu with: Jennifer Rexford (Princeton University) Aman Shaikh and Subhabrata Sen (ATT Research) Route Oracle: Where Have.
OSPF Open Shortest Path First. Table of Content  IP Routes  OSPF History  OSPF Design  OSPF Link State  OSPF Routing Table  OSPF Data Packets.
1 Why Optical Layer Protection? Optical layer provides lightpath services to its client layers (e.g., SONET, IP, ATM) Protection mechanisms exist in the.
Draft-asati-bgp-mpls-blackhole-avoidance-00.txt1 BGP/MPLS Traffic Blackhole Avoidance Proposal draft-asati-bgp-mpls-blackhole-avoidance-00 Rajiv Asati.
Advanced Technology Laboratories 8 December 2000 page 1 Characterization of Traffic at a Backbone POP Nina Taft Supratik Bhattacharyya Jorjeta Jetcheva.
© Copyright 2006 Glimmerglass. All Rights Reserved. More than just another single point of failure? Optical Switching.
Section #7: Getting Data from Point A to Point B.
Challenges in the Next Generation Internet Xin Yuan Department of Computer Science Florida State University
Routing Information Protocol
A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance Feng Wang 1, Zhuoqing Morley Mao 2 Jia Wang 3, Lixin Gao 1,
A Snapshot on MPLS Reliability Features Ping Pan March, 2002.
1 Protection in SONET Path layer protection scheme: operate on individual connections Line layer protection scheme: operate on the entire set of connections.
Placing Relay Nodes for Intra-Domain Path Diversity Meeyoung Cha Sue Moon Chong-Dae Park Aman Shaikh Proc. of IEEE INFOCOM 2006 Speaker 游鎮鴻.
1 Monitoring: from research to operations Christophe Diot and the IP Sprintlabs ipmon.sprintlabs.com.
Mobile IP THE 12 TH MEETING. Mobile IP  Incorporation of mobile users in the network.  Cellular system (e.g., GSM) started with mobility in mind. 
Kapitel 19: Routing. Kapitel 21: Routing Protocols
Jian Wu (University of Michigan)
A Study of Group-Tree Matching in Large Scale Group Communications
CS4470 Computer Networking Protocols
Detection of Routing Loops and Analysis of Its Causes
Delivery and Routing of IP Packets
COS 561: Advanced Computer Networks
Network Layer I have learned from life no matter how far you go
Use of Simplex Satellite Configurations to support Internet Traffic
Label Switched VPNs – Scalability and Performance Analysis
COS 461: Computer Networks
Exploiting Routing Redundancy via Structured Peer-to-Peer Overlays
Presentation transcript:

1 Limiting the Impact of Failures on Network Performance Joint work with Supratik Bhattacharyya, and Christophe Diot High Performance Networking Group, 25 Feb Yashar Ganjali Computer Systems Lab. Stanford University

2 Motivation  The core of the Internet consists of several large networks (IP backbones).  IP backbones are carefully provisioned to guarantee low latency and jitter for packet delivery.  Failures occur on a daily basis as a result of  Physical layer malfunction,  Router hardware/software failures,  Maintenance,  Human errors, …  Failures affect the quality of service delivered to backbone customers.

3 Outline  Background  Sprint’s IP backbone  Data  Impact Metrics  Time-based metrics  Link-based metrics  Measurements  Reducing the impact  Identifying critical failures  Causes analysis  Reducing critical failures

4 Background – Sprint’s IP backbone  IP layer operates above DWDM with SONET framing.  IS-IS protocol used to route traffic inside the network.  IP-level restoration  When an IP link fails, all routers in the network independently compute a new path around the failure  No protection in the underlying optical infrastructure.

5 Data  IS-IS Link State PDU logs  Collected by passive listeners from Sprint’s North America backbone.  Feb. 1 st, 2003 to Jun. 30 th,  SNMP logs  Link loads recorded once in every 5 minutes.  SONET layer alarms  Corresponding to minor and major problems in the optical layer  We are only interested in two alarms:SLOS, and SLOS cleared.

6 Link Failures in Sprint’s IP Backbone – 9408 Failures

7 Inter-POP vs. Intra-POP ANA-2 ANA-3 ANA-1 ANA-4

8 Outline  Background  Sprint’s IP backbone  Data  Impact Metrics  Time-based metrics  Link-based metrics  Measurements  Reducing the impact  Identifying critical failures  Causes analysis  Reducing critical failures

9 Inter-POP Link Failures in Sprint’s IP Backbone

10 Two Perspectives  For a given impact metric  Time-based analysis: Measure the impact of failures on the given metric as a function of time.  Link-based analysis: Measure the impact of failures on the given metric as a function of failing links.

11 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load

12 Number of Simultaneous Failures

13 Number of Simultaneous Failures

14 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load Time-based Impact Metrics

15 Number of Affected O-D Pairs ACF B DE

16 Number of Affected O-D Pairs

17 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load Time-based Impact Metrics

18 Number of Affected BGP Prefixes

19 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load

20 Path Unavailability ACF B DE

21 Path Unavailability

22 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load

23 Total Rerouted Traffic

24 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load

25 Maximum Load Throughout the Network

26 Maximum Load Throughout the Network 96% of link failures were not followed by an immediate change in maximum load.

27 Time-based Impact Metrics 1. Number of Simultaneous Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path unavailability 5. Total rerouted traffic 6. Maximum load

28 Number of Failures per Link

29 Number of Affected OD Pairs per Link

30 Number of Affected BGP Prefixes per Link

31 Path Coverage ACF B DE

32 Path Coverage of Links

33 Total Rerouted Traffic on a Link

34 Peak Factor of a Link

35 Link-based Impact Metrics 1. Number of Link Failures 2. Number of affected O-D pairs 3. Number of affected BGP prefixes 4. Path coverage 5. Total rerouted traffic 6. Peak factor

36 Outline  Background  Sprint’s IP backbone  Data  Impact Metrics  Time-based metrics  Link-based metrics  Measurements  Reducing the impact  Identifying critical failures  Causes analysis  Reducing critical failures

37 Critical Failures  For each time-based metric  Removing failures occuring during 1-5% of time improves the metrics by a factor of at least 5.  For each link-based metric  Removing failures on 1-7% of links improves the metric by a factor of at least 3.

38 Critical Time Periods

39 Critical Links  Any link which has a critical failures, is called a Critical Link.  We are interested in fixing such links.

40 Correlation of Critical Sets

41 Correlation of the Critical Sets MetricSize ) Simultaneous failures ) # of O-D pairs ) # of BGP prefixes ) Path unavailability ) Total rerouted traffic ) # of failures ) # of O-D pairs ) # of BGP prefixes ) Path coverage ) Total rerouted traffic Overall 23% of all links are critical.

42 Cause Analysis  Markopoulou et al. have used IS-IS update messages for characterizing link failures into the following categories [MIB+04].  Maintenance  Unplanned Shared failures –Router-related –Optical-related –Unspecified Individual failures About 70% of all unplanned failures

43 Matching SLOS Alarms with IP Link Failures Time IP link failure SLOS ~ 20ms SLOS Cleared ~ 12sec 58% of all link failures are due to optical layer problems. 84% of critical failures are due to optical layer problems.

44 Reducing Critical Failures  Replace old optical fibers/parts.  Optical Protection.  Push the traffic away.  Also works for maximum load and peak factor.

45 Performance Improvement Time-based metricsLink-based Metrics Metric% improvementMetric% improvement # of failures # of affected O-D pairs # of BGP prefixes Path unavailability Total rerouted traffic # of failures # of affected O-D pairs # of BGP prefixes Path coverage Total rerouted traffic

46 Reducing Link Down-time  Low-failure links:  Failure are very rare.  Damping doesn’t help.  High-failure links:  Failure rate changes very slowly.  Fixed damping is wasteful.

47 Adaptive Damping Input:  : time difference between the last two failures  : threshold  : constant function Adaptive_Damping begin if (  <  ) ADT :=  x  ; else ADT := 0; end; Output: ADT: Adaptive damping timer

48 Number – Duration Pareto Curve

49 Thank you!