Network Survivability, Reliability and Availability: Protection & Restoration Zilong Ye, Ph.D. zye5@calstatela.edu.

Slides:

Advertisements

Similar presentations

APNOMS03 1 A Resilient Path Management for BGP/MPLS VPN Jong T. Park School of Electrical Eng. And Computer Science Kyungpook National University

Advertisements

ITU-T Workshop on Security Seoul (Korea), May 2002 Telecommunication network reliability Dr. Chidung LAC.

Slide 111 May 2008 Point-to-Multipoint in 802.1Qay Nurit Sprecher, Nokia Siemens Networks Hayim Porat, Ethos Networks.

COMPUTER NETWORK TOPOLOGIES

Computer Network Topologies

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

~1~ Infocom’04 Mar. 10th On Finding Disjoint Paths in Single and Dual Link Cost Networks Chunming Qiao* LANDER, CSE Department SUNY at Buffalo *Collaborators:

Lecture: 4 WDM Networks Design & Operation

Network Architecture for Joint Failure Recovery and Traffic Engineering Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

December 20, 2004MPLS: TE and Restoration1 MPLS: Traffic Engineering and Restoration Routing Zartash Afzal Uzmi Computer Science and Engineering Lahore.

A General approach to MPLS Path Protection using Segments Ashish Gupta Ashish Gupta.

IP layer restoration and network planning based on virtual protection cycles 2000 IEEE Journal on Selected Areas in Communications Reporter: Jyun-Yong.

Introduction to Protection & Restoration for OBS Copyright, 2000, SUNY, Univ. at Buffalo Presented by Zaoyang Guo & Dahai Xu.

1 Distributed Partial Information Management (DPIM) for Survivable Networks Dahai Xu.

A General approach to MPLS Path Protection using Segments Ashish Gupta Ashish Gupta.

1 Multipoint Ethernet Connection Protection

Network Topologies.

NETWORKING CONCEPTS. Data Communication Communication is for sharing information Sharing can be local or remote Local communication between individuals.

1 EL736 Communications Networks II: Design and Algorithms Class10: Restoration and Protection Design of Resilient Networks Yong Liu 11/28/2007.

NTNU Protection switching TTM1: Optical transport and access networks By Steinar Bjørnstad 2014.

1 Meeyoung Cha, Sue Moon, Chong-Dae Park Aman Shaikh Placing Relay Nodes for Intra-Domain Path Diversity To appear in IEEE INFOCOM 2006.

Lecture Note on Survivability. Impact of Outages Service Outage Impact 50msec0200msec2sec10sec 5min 30min "Hit" TriggerChange- over of CCS Links FCCReportable.

1 Protection Mechanisms for Optical WDM Networks based on Wavelength Converter Multiplexing and Backup Path Relocation Techniques Sunil Gowda and Krishna.

SMUCSE 8344 Protection & Restoration of Optical Networks.

Lightpath Restoration in WDM Optical Networks A Survey in IEEE Network Magazine Nov/Dec 2000.

Multi-layered Optical Network Security

Protection and Restoration Definitions A major application for MPLS.

CSC Survivability Anuj Dewangan Parinda Gandhi.

The concept of RAID in Databases By Junaid Ali Siddiqui.

Survivable Traffic Grooming with Differentiated End-to-End Availability Guarantees in WDM Mesh Networks Proceedings of the 13th IEEE Workshop on Local.

10/6/2003Kevin Su Traffic Grooming for Survivable WDM Networks – Shared Protection Kevin Su University of Texas at San Antonio.

Optical Networking University of Southern Queensland.

1 Why Optical Layer Protection? Optical layer provides lightpath services to its client layers (e.g., SONET, IP, ATM) Protection mechanisms exist in the.

11/02/2001 Workshop on Optical Networking 1 Design Method of Logical Topologies in WDM Network with Quality of Protection Junichi Katou Dept. of Informatics.

1 Dynamic RWA Connection requests arrive sequentially. Setup a lightpath when a connection request arrives and teardown the lightpath when a connection.

1 Protection in SONET Path layer protection scheme: operate on individual connections Line layer protection scheme: operate on the entire set of connections.

Survivability in IP over WDM networks YINGHUA YE and SUDHIR DIXIT Nokia Research Center, Burlington, Massachusetts.

Physical Network Topology. When working with a network What is Physical Topology????? The physical topology of a network refers to the configuration of.

Chapter Seven Network Topology [tə'p ɒ ləd ʒɪ ]. In networking, the term “topology” refers to the layout of connected devices on a network. This article.

William Stallings Data and Computer Communications

Confluent vs. Splittable Flows

Computer Network Collection of computers and devices connected by communications channels that facilitates communications among users and allows users.

Data Center Network Architectures

Networks Network:end-node and router C 2 B 1 3 D 5 A 4 6 E 7 Router F

Circuit Switching Circuit switching refers to a communication mechanism that establishes a path between a sender and receiver with guaranteed isolation.

Computer Network Topologies

Prof.Veeraraghavan Prof.Karri Haobo Wang:

Isabella Cerutti, Andrea Fumagalli, Sonal Sheth

ElasticTree Michael Fruchtman.

Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.

Intra-Domain Routing Jacob Strauss September 14, 2006.

Routing: Distance Vector Algorithm

ISP and Egress Path Selection for Multihomed Networks

Lecture XVII: Distributed Systems Algorithms Inspired by Biology

Network Survivability

High Throughput Route Selection in Multi-Rate Ad Hoc Wireless Networks

COS 561: Advanced Computer Networks

Data and Computer Communications

Optical Layer Protection Schemes

Switching Techniques.

Physical Network Topology

Networks A, B, C, D, E, and F are all end nodes and 1 through 7

SURVIVABILITY IN IP-OVER-WDM NETWORKS (2)

Types of topology. Bus topology Bus topology is a network type in which every computer and network device is connected to single cable. When it has exactly.

Eusebi Calle, Jose L Marzo, Anna Urra. L. Fabrega

Achieving Resilient Routing in the Internet

Connectors, Repeaters, Hubs, Bridges, Switches, Routers, NIC’s

Presentation transcript:

Network Survivability, Reliability and Availability: Protection & Restoration Zilong Ye, Ph.D. zye5@calstatela.edu

Reliability Reliability is the probability that a system or component will operate without any service-affecting failure for a period of time t Reliability is a monotonically decreasing probability function of time, R(t) A specific reliability number always implies an assumed duration of time Reliability is about How soon the next repair expenses might be incurred etc. but reliability itself does not consider the repeated cycles of failure repair time, and return to service which determine the availability of an ongoing service

Availability Availability is the probability that a system will be found in the operating state at a random time in the future Availability inherently reflects a statistical equilibrium between failure processes: mean time to (or between) failure (MTTF/MTBF) and repair processes: mean time to repair (MTTR) in maintained repairable systems that are returned to the operating state following any failure MTTF Availability = MTTF + MTTR

Quantification of Availability Percent Availability N-Nines Downtime Time Minutes/Year 99% 2-Nines 5,000 Min/Yr 99.9% 3-Nines 500 Min/Yr 99.99% 4-Nines 50 Min/Yr 99.999% 5-Nines 5 Min/Yr 99.9999% 6-Nines .5 Min/Yr

Survivability Survivability of a network as a whole is the average fraction of failed working capacity that can be restored by a specified mechanism within the spare capacity provided in a network A link may be fully survived with 100% capacity or it could be partially survived with <100% capacity

Market Drivers for Survivability Customer Relations Competitive Advantage Revenue Negative - Tariff Rebates Positive - Premium Services Business Customers Medical Institutions Government Agencies Impact on Operations Minimize Liability

Failure Types & Other Motivations Types of failure: Components: links, nodes, channels in WDM, active components, software… Human error: backhaul fiber cut Fiber inside oil/gas pipelines less likely to be cut Systems: Entire COs can fail due to catastrophic events deliberate attacks

Network Survivability: drivers Availability: 99.999% (5 nines)  less than 5 min downtime per year Since a network is made up of several components, the ONLY way to reach 5-nines is to add survivability in the face of failures… - Survivability = continued services in the presence of failures Protection switching or restoration: mechanisms used to ensure survivability - Add redundant capacity, detect faults and automatically re-route traffic around the failure Protection: fast time-scale: 10s-100s of ms… - implemented in a distributed manner to ensure fast restoration Restoration: related term, but slower time-scale

Types of Fault-Recovery Mechanisms Protection Backup resources (routes and wavelengths) pre- computed and reserved in advance (before a failure occurs) – simple but 50% overhead Faster recovery time What if pre-reserved resources also fail? Restoration Routes and wavelengths discovered dynamically after detection of a failure Resources allocation based current network state info More resource efficient Can recover as long as there’re redundant resources Slower recovery time 10

Restoration Path Restoration Link Restoration Route can be computed after failure Link Restoration Path is discovered at the end nodes of the failed link More practical than path restoration Advantages & Disadvantages of Restoration Usually can recover from multiplex element faults More efficient usage of resource Complex Slow: require extra process time to setup path and reserve resource

Comparison between Protection & Restoration Characteristic: Protection -- the resource are reserved before the failure, they may be not used; Restoration -- the resource are reserved and used after the failure Route: Protection -- predetermined; Restoration - - can be dynamically computed Resource Efficiency: Protection -- Low; Restoration -- High

Comparison between Protection & Restoration (Cont’) Time used: Protection -- Short; Restoration -- Long Reliability: Protection -- mainly for single fault; Restoration -- can survive under multiple faults Implementation: Protection -- Simple; Restoration -- Complex

Network Survivability Architectures Restoration Protection Self-healing Network Re-Configurable Network Protection Switching Linear Protection Architectures Ring Protection Architectures Mesh Restoration Architectures Path-based Link-based Segment-based

Restoration in Mesh Networks Central Controller DCS DC DCS DC DCS Self Healing (distributed) Restoration Architecture Probing after restoration DC DCS DCS DCS DCS Reconfigurable (or Rerouting) Restoration Architecture (centralized) DC = Distributed Controller

Protection Switching Terminology 1+1 architectures - permanent bridge at the source - select at sink m:n architectures - m entities provide protection for n working entities where m is less than or equal to n allows unprotected extra traffic on the m entities most common - SONET linear 1:1 and 1:n Coordination Protocol - provides coordination between controllers in source and sink Required for all m:n architectures Not required for 1+1 architectures

Basic Ideas: Working and Backup Paths

Protection Switching: Terminology Dedicated vs Shared: working connection assigned dedicated or shared protection bandwidth 1+1 is dedicated, 1:n is shared Revertive vs Non-revertive: after failure is fixed, traffic is automatically or manually switched back Shared protection schemes are usually revertive

Different protection and restoration schemes Protection/Restoration Schemes Protection Restoration Path Ring Protection Mesh Protection Ring/Mesh Protection Link Restoration Restoration Link Path Link Path Protection Protection Link Protection Path Protection Protection Protection 21

Path vs Link Protection 22 Working Path DCS DCS Line or Link Protection DCS DCS DCS DCS Protection Path Control: Centralized or Distributed Route Calculation: Preplanned or Dynamic Type of Alternate Routing: Line or Path

Types of Fault-Recovery Mechanisms Path Protection Two link (node) disjoint paths: primary (working) and backup (protection) path Traffic rerouted through a link-disjoint backup route once a link failure occurs on working path Usually, less resource required (using shorter routes) Lower end-to-end propagation delay for the recovered route Backup path pre-reserved or pre-set up Backup paths of different connections may or may not share common wavelengths on common links

Types of Fault-Recovery Mechanisms Dedicated Path Protection Do not allow sharing among backup paths (resources) Backup paths pre-configured No switch configuration necessary along the backup path when a failure occurs Fast recovery time Resources not efficiently utilized (100% redundancy)

Types of Fault-Recovery Mechanisms Shared Path Protection Allow sharing among backup paths subject to certain constraints Primary/active paths (AP) are link disjoint  backup paths (BP) may share common link and wavelength Backup paths configured when a failure occurs since backup paths may be shared; cannot commit resources to a particular primary in advance Slower recovery time Resources utilization much better More signaling required to recover from the failure

Shared Path Protection If and only if two APs are disjoint, their BPs can share backup bandwidth (backup bandwidth) on a common link (i.e., total backup bandwidth = max{w1, w2}). AP1(w1) S1 D1 BP1 Link L(max{w1,w2}) BP2 S2 D2 AP2(w2)

Types of Fault-Recovery Mechanisms Link Protection a light-path set up on a primary path For each link on the primary path, a backup detour is reserved around the link No sharing – dedicated-link protection Wavelength used on backup loop dedicated to specific link to be protected Shared-link protection Note, different connections on the same link might have different backup detours for that link

Solution 1: Active-Path First Find an active path (AP) first Then find a disjoint backup path (BP) How? Remove the physical links and resources that the active path travels, and then re- run the routing algorithm to find the backup path. 60

Solution 2: Joint Path Selection Select the active path and backup path in a joint manner. Joint optimization have better performance compared to active path first schemes in terms of the amount of network resources required How? Use Suurballe’s algorithm to compute two link- disjoint paths between (s, d) simultaneously 60

Suurballe’s Algorithm Given a graph G=(V, E), find a pair of edge-disjoint paths from s to t such that the total edge cost of the two paths is minimal among all such path pairs

Dedicated-path protection – Heuristic Algorithms Remove links that do not have free wavelengths Apply Suurballe’s algorithm to find a pair of paths Choose the shorter path as the primary path and the longer path as the backup Assign a wavelength using First-Fit to each path Guarantees the minimum total bandwidth (TBW) = active BW (ABW) + backup BW (backup bandwidth) for this request 30

Shared-Path Protection Heuristic 1 Use Suurballe’s algorithm to generate two routes Assign wavelengths while trying to share the wavelengths on the backup paths as much as possible Does not perform well in backup path sharing since routing does not consider wavelength info no backup bandwidth sharing potential

A Fast and Efficient Heuristic 3 Challenges Jointly optimize an AP/BP pair with shared path protection is NP-hard using ILP is notoriously time consuming. also, only guarantee minimal TBW for each request, but not minimal TBW for all requests. Heuristics such as active path first (APF) can only achieve sub-optimal results: does not consider the yet-to-be-incurred backup cost along the BP when selecting a (shortest) AP

Potential Backup Cost (PBC) Uses a shortest path algorithm to find the AP first But, in selecting the AP, each capable link e (Re≥w) will be assigned a cost of w+e(w), where the second term is the potential backup cost (PBC) then finds a shortest BP combines the best of ILP and APF based approaches See Xu et al, Lightwave Technology, Journal of Volume: 25 Issue: 8, 2251 – 2259, 2007

Protection in SRLG networks

Shared Risk Link Group (SRLG) Widely recognized as an important concept in survivable optical networks A group of network links that share a common physical resource (cable, conduit etc.) Due to layered structure: Physical layer: Fiber spans (cable, conduit, et al) Optical layer: Optical links and nodes (a subset of the nodes in the physical layer

Layered Architecture of Optical Network 1 5 e1 e5 e7 e3 2 e2 e4 4 3 (a) Optical Layer g8 g7 g6 1 5 g1 g9 g5 2 g2 g4 4 g3 3 (b) Physical Layer 66

Protection in SRLG networks Finding SRLG-disjoint path pair is more complicated than finding a link/node-disjoint path pair. In fact, the former is a NP-complete problem. If Backup BandWidth (backup bandwidth) sharing is considered, SRLG protection problem will become even more complicated.

30% of the time statistically) when considering SRLGs APF and Trap Active Path First (APF), followed by an SRLG- disjoint BP attractive alternative (policy-based routing, optimal AP) But may fail to find such a BP more frequently (up to 30% of the time statistically) when considering SRLGs Trap: can’t find an SRLG-disjoint BP Real Traps: unavoidable, topology-induced Avoidable Traps: algorithm-induced. Only a few APF algorithms so far to deal with avoidable traps.

Other APF-based Heuristic K Shortest Paths (KSP) Finds the first K shortest paths between the source and destination as candidate APs, and then test them in the increasing order of their costs, until a SRLG disjoint BP is found or all of them have been tested.

Proposed Trap Avoidance (TA) Algorithm Similar to KSP: iteratively test candidate APs and find one that has a SRLG-disjoint BP But TA constructs one AP at a time, and modifies it into a new AP for testing only if necessary TA uses a more intelligent method to avoid the most “risky” link when modifying the AP KSP is oblivious/blind to “bad” links See Xu et al. IEEE/OSA Journal of Lightwave Technology (JLT), Special Issue on Optical Networks, Vol. 21, No. 11, pp. 2683-2693, 2003 70

infinity to prevent them from being used by BP. Find a Candidate BP All the directed links along AP assigned a cost of infinity to prevent them from being used by BP. All the remaining links that share at least one SRLG with any link on AP (including the links along the reversed AP) will be assigned a large value M as cost. Discourage any shortest-path algorithm to use such “M” links for the candidate BP. But do not forbid.

PROtection using MultIple SEgments (PROMISE)

Logically “divide” an AP into several, Basic idea of PROMISE Logically “divide” an AP into several, possibly overlapping sub-path called active segments (AS’s), and then protect each AS with a backup segment (BS) AS 1 AS 2 BS 1 AP BS 2

- Recall that traps are more likely in SRLG networks Applications for PROMISE First proposed for Non-SRLG networks Particularly effective in dealing with either real or avoidable traps - Recall that traps are more likely in SRLG networks

Most Bandwidth Efficient: I Reason: the flexibility it offers in choosing the appropriate AS’s and corresponding BS’s. AP1 and AP2 are not link disjoint, so their BPs cannot share backup bandwidth But in PROMISE, BS1,1 and BS2,1 can share backup bandwidth AS 1,1 AS 1,2 AP1 BS 1,1 BS 2,1 AP2 80

Most Bandwidth Efficient: II Inter-Sharing: Sharing between BS’s for different connections Intra-Sharing: Sharing between BS’s of the same connection, e.g. BS1 and BS2 share backup bandwidth on link c AS1 AS2 AS3 b c f BS2 BS1 BS3 a e d g 1 2 3 4 5 6

Path Protection: 1-(1-0.8 2)2  0.87 Faster Recovery and More Resilient Faster Recovery: Protects each AS using a shorter BS instead of protecting the entire AP using a longer BP (as in path protection) More Resilient/Robust: Tolerate more multiple failures than path protection (with the same or lower bandwidth consumption). Overall Reliability •Link failure prob: x = y = p = q =0.8 BS1 p Path Protection: 1-(1-0.8 2)2  0.87 PROMISE: (1-(1-0.8)2)2 0.92 x y s 2 d BS2 q

Other Benefits of PROMISE Can Succeed When Other Approaches Fail Routing policies, QoS constraints (e.g., hop limit on the AP and BP), or just APF Real/Avoidable Traps in SRLG networks Readily be applied to MPLS networks by extending the existing protocols for local repair/recovery in MPLS networks

Key Challenges in PROMISE Joint optimization of AP selection and the set of protecting BS's is extremely complex Even if AP is found first as in APF-based heuristic, How to optimally divide AP into AS’s (then corresponding BS's) Harder than modeling the general multi-commodity flow problem: number of BS’s, and the source and destination for each BS are not known beforehand.