Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta.

Similar presentations


Presentation on theme: "Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta."— Presentation transcript:

1 Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta

2 Copyright Rudra Dutta, NCSU, Spring, 20092 Motivation Failures can affect our lives fairly directly – Internet evolution: lab curiosity, mil/gov, educational, commercial, business, social, lifeline/ubiquitous Modern society critically depends on communication networks – Similar to power grid, transportation system, water distribution Mission critical business functions must be available 24/7 – Web-based transaction systems – 1-800 services – e-commerce Government services, emergency (911) services Scientific projects (e-Science, etc) Everyday communication services – BT will switch entirely to IP by 2012 (Spectrum, 1/2007) Survivability must be foremost consideration in network design, not an afterthought

3 Copyright Rudra Dutta, NCSU, Spring, 20093 Failure Events Link failures - fiber cuts Failure of active components inside network equipment – Transmitters, receivers, controllers – Individual channel failures (in a WDM system) Node failures - due to catastrophic event – Rare events, but cause widespread disruption Software failures - due to immense complexity – Usually dealt with by using proper software design techniques – Hard to protect against in the network

4 Copyright Rudra Dutta, NCSU, Spring, 20094 Failure Causes Human error - most common cause – “Backhoe effect”, operator errors, · · · Natural events - floods, snow storms, earthquakes – 1997 solar storm caused Telstar failure – 1988 fire at Hindsdale CO – 1999 damage from hurricane Floyd – 2002 fire melted Verizon fiber cable Animals ! – Rodents gnaw on cable jackets – Sharks bite undersea cables (TAT-8) Sabotage - terrorism (9/11) Wear and tear

5 Copyright Rudra Dutta, NCSU, Spring, 20095 Chances of Failure Fiber optic cables are critical components: we know to –...physically protect cables, –...bury them suitably deep, –...be careful when digging, – So why do fiber cables get cut at all? Similar issues in operating many large-scale systems: – Nuclear reactors, water systems, air traffic control / airplanes – Lay person: baffled when things go wrong – Insider: knows how many things can go wrong Statistical certainty of fairly high rate of failures ! Average life of fiber span - 228 years – But consider laid fiber-miles

6 Copyright Rudra Dutta, NCSU, Spring, 20096 Service Outage Statistics

7 Copyright Rudra Dutta, NCSU, Spring, 20097 Engineering Fault Tolerance Failures may be rare or common, but are inevitable Should not be baffling ! – (At least not to the designer of system !) – Should in fact, be predictable (at least statistically) Must engineer for failure - common to many disciplines Most repair times are much larger than acceptable restoration times – Restoration of service must be engineered to operate with active failure in network

8 Copyright Rudra Dutta, NCSU, Spring, 20098 Outage Duration Revenue loss – Loss of business (e.g., voice-calling revenue) – Default on SLAs Business disruption – Regular business impacted – Societal impact/risks (travel, education, financial services, 911) – Lawsuits, bankruptcies Network dynamics – Application/TCP session timeouts, router connectivity loss – Overloading

9 Copyright Rudra Dutta, NCSU, Spring, 20099 Outage Effects Target RangeDurationMain Effects Protection Switching < 50 msService “hit” TCP sees no impact 150 - 200Few voiceband disconnects ATM cell rerouting may start 2200-2000Some switched connections drop TCP protocol backoff 32 - 10 sSwitched circuit services drop (X.25) TCP session timeouts “webpage not available” errors Affects router hello protocol

10 Copyright Rudra Dutta, NCSU, Spring, 200910 Outage Effects Target RangeDurationMain Effects 410s - 5 minCalls and data sessions terminated TCP/IP applications timeout Users attempt mass redials Routers issue LSAs Topology update, network-wide resynch “Undesirable”5 - 30 minRouters under heavy reattempts load Minor business/societal impact Noticeable “Internet brownout” “Unacceptable”> 30 minRegulatory reporting required Major societal impacts/risks, headlines SLA clauses triggered, lawsuits

11 Copyright Rudra Dutta, NCSU, Spring, 200911 Planning Graded Fault Tolerance Instantaneous recovery from most significant/frequent failures – Eliminate human involvement - device level Fast recovery from other significant or frequent failures – Also automatic - device or system Reasonably fast recovery from next tier of failures – Automated, but may be system / software Least likely tier - repair and recovery plans – Includes manual repair, must also plan for liability

12 Copyright Rudra Dutta, NCSU, Spring, 200912 Mechanisms for Fault Tolerance Carefully design-in specific amounts of spare capacity – spare links/channels/equipment – bumping low priority traffic Design network topology for physical diversity – bi-connected topology (or better) – diverse routing – shared risk link group (SRLG) concept Embed real-time mechanisms to develop/implement “patch plan” – appropriate protocols and algorithms – cross-layer interactions

13 Copyright Rudra Dutta, NCSU, Spring, 200913 Stages of Failure Plan Operation Failure detection Failure localization Failure recovery Failure repair Almost certainly starting from device layer Cooperation between device and software Software Human

14 Copyright Rudra Dutta, NCSU, Spring, 200914 Summary Faults are real, must plan to address Faults are diverse, plan must be diverse Fault tolerance is a system concept, not add-on – Must plan at various levels – At each level, must be appropriate response Must address together with network design problem, hard to achieve after the fact


Download ppt "Introduction CSC/ECE 772: Survivable Networks Spring, 2009, Rudra Dutta."

Similar presentations


Ads by Google