Availability of IP/MPLS networks Sanjay Kalra October 2002
Agenda Introduction How to measure Availability Network Design example One Router vs. Two Routers Software Dependability Summary 2
Reliability + Recovery Definition of Availability Availability is the probability that an item will be able to perform its designed functions at the stated performance level, within the stated conditions and in the stated environment when called upon to do so. Availability = Reliability Reliability + Recovery 11/13/2018
Quantification Percent Availability N-Nines Downtime Time Minutes/Year 99% 2-Nines 5,000 Min/Yr 99.9% 3-Nines 500 Min/Yr 99.99% 4-Nines 50 Min/Yr 99.999% 5-Nines 5 Min/Yr 99.9999% 6-Nines .5 Min/Yr To deploy dependable networks and devices, it is important to define a mechanism for quantifying dependability. The “9’s” terminology is the most familiar to the industry and is widely used to measure specifically the availability of network devices. The 9’s imply the amount of inherent downtime. Downtime is typically specified in Telcordia requirements such as GR-1110-CORE, Broadband Switching System (BSS) Generic Requirements. The 9’s provide an operational target to which networks and devices can be managed. 11/13/2018
PSTN End-2-End Availability 99.94% PSTN : The Yardstick ? Individual elements have an availability of 99.99% One Cut off call in 8000 calls (3 min for average call). Five ineffective calls in every 10,000 calls. PSTN End-2-End Availability 99.94% NI NI 0.005 % 0.005 % AN 0.01 % AN 0.01 % LE LE Facility Entrance Facility Entrance NI : Network Interface LE : Local Exchange LD : Long Distance AN : Access Network LD 0.005 % 0.005 % 0.02 % 11/13/2018 Source : http://www.packetcable.com/downloads/specs/pkt-tr-voipar-v01-001128.pdf
Services affect on Network Availability In IP Network Availability is a function of the Service being offered. Source : www.t1.org 11/13/2018
IP Network Expectations Service Delay Jitter Loss Availability Real Time Interactive (VOIP, Cell Relay ..) L H Layer 2 & Layer 3 VPN’s (FR/Ethernet/AAL5) M Internet Service Video Services L L H L : Low M : Medium H : High 11/13/2018
Agenda Introduction How to measure Network Availability Network Design example One Router vs. Two Routers Software Dependability Summary 11/13/2018 8
(Total number of Ports x sample period) The Port Method Based on Port count in Network Does not take into account the Bandwidth of ports e.g. OC-192 and 64k are both ports Good for dedicated Access service because ports are tied to customers. (Total # of Ports X Sample Period) - (number of impacted port x outage duration) x 100 (Total number of Ports x sample period) 11/13/2018
The Port Method Example 10,000 active access ports Network An Access Router with 100 access ports fails for 30 minutes. Total Available Port-Hours = 10,000*24 = 240,000 Total Down Port-Hours = 100*.5 = 50 Availability for a Single Day = (240000-50/240,000)*100 = 99.979166 % 11/13/2018
(Total amount of BW in network x sample period) The Bandwidth Method Based on Amount of Bandwidth available in Network Takes into account the Bandwidth of ports Good for Core Routers (Total amount of BW X Sample Period) - (Amount of BE impacted x outage duration) x 100 (Total amount of BW in network x sample period) 11/13/2018
The Bandwidth Method Example Total capacity of network 100 Gigabits/sec An Access Router with 1 Gigabits/sec BW fails for 30 minutes. Total BW available in network for a day = 100*24 = 2400 Gigabits/sec Total BW lost in outage = 1*.5 = 0.5 Availability for a Single Day = ((2400-0.5)/2,400)*100 = 99.979166 % 11/13/2018
] x 10-6 Defects Per Million Used in PSTN networks, defined as number of blocked calls per one million calls averaged over one year. DPM = [ (number of impacted customers x outage duration) (total number of customers x sample period) ] x 10-6 11/13/2018
Defects Per Million Example 10,000 active access ports Network An Access Router with 100 access ports fails for 30 minutes. Total Available Port-Hours = 10,000*24 = 240,000 Total Down Port-Hours = 100*.5 = 50 Daily DPM = (50/240,000)*1,000,000 = 208 11/13/2018
Agenda Introduction How to measure Availability Network Design example One Router vs. Two Routers Software Dependability Summary 11/13/2018 15
Calculating Availability: Series Multiplicative method: E1 x E2 x E3= As .999999 x .999999 .999991 x = .9999890 Additive method of UA (unavailability) .000001 + .000001 + .000009 = .0000110 This calculation shows that less elements in the system = more reliability. This is why collapsing the layers out of a PoP makes it a more reliable system. Juniper uses the Markov model to calculate failures: “Markov Model A probability model that uses state transit diagrams to support reliability prediction calculations of complex relationships and dependencies. Is memoryless and uses exponential distribution.” Total Availability of a system (As) is always less than the least available element. One Weak Link Significantly Weakens This Chain! 11/13/2018
Calculating Availability: Parallel For 1 out of 2 redundancy.. Additive Rule: As = E1+ E2 – E1 E2 As = .999999+.999999-(.999999*.999999) As = .999999999999 This is a probability calculation, showing one element working when the others fails. It assumes that both elements will not fail similtaneously. References: Standard Methods: Mil-HDBK-217 Telcordia SR-332 Other methods and databases: NSWC-94/L07: Navy mechanical reliability method CNET 93 / 98: France Telecom mechanical reliability method HRD5: British Telecom mechanical reliability method IEEE 1413: New standard for reliability prediction, 1998 rev. 1 NPRD/EPRD: Reliability Analysis Center (RAC) failure rate data Multiplicative Rule: As = 1–[(1-E1)(1-E2)] Not for Parallel Systems Where Both Elements Are Required Assumption is that Switchover Time is zero 11/13/2018
System Calculation: Series Simple E-3 Network, With One E-3 Trunk E-3 Server 1 2 ATM 3 4 ATM 5 99.98 99.99 99.992 99.992 99.95 99.9959 99.9959 99.9959 99.9959 99.9959 Availability 99.8835% Yearly downtime = (1-Availability) * 525600 minutes/year 11/13/2018
System Calculation: Parallel (1) System 1 availability 99.6341 Systems 2 availability 99.4311 99.9750 99.9563 99.9831 99.9845 99.8200 99.9932 99.95 99.975 99.82 Internet Gateway Data Centre Core Edge CPE E-3 Edge ATM Hub Core Server STM-16 STM-1 Core Availability, Data Centre to Customer CPE 99.9661% E-3 ATM Hub Core Edge Data Centre Core Core S1 & S2 network 99.9979 11/13/2018
System Calculation: Parallel (2) System 1 Availability 99.6958 99.9845 was 99.6341 Internet Gateway 99.9831 99.9831 99.9831 99.9831 99.9932 Data Center Core Edge NxE-1 Edge Core 99.999 99.8200 99.975 99.9850 Server CPE STM-16 STM-1 Core E-3 Availability, Data Centre to Customer CPE 99.9974% System 2 Availability 99.4828 99.9850 99.975 99.82 99.82 NxE-1 Edge Core Edge 99.999 Data Center 99.9831 99.9932 99.9831 99.9831 99.9831 Core Core was 99.4311 99.9831 99.9831 was 99.9661 !!! 3 9’s to 4 9’s 11/13/2018
Agenda Introduction How to measure Availability Network Design example One Router vs. Two Routers Software Dependability Summary 11/13/2018 21
Do we still need two routers or Router Redundancy Typical Network Designs have 2 routers for Redundancy Capacity Planning Redundancy in routers Power Supply Fans Routing Engines Switching Planes Forwarding plane Do we still need two routers or one is enough? 11/13/2018
One Router Versus two Routers Redundant Control Plane Forwarding Plane Power Supply FAN Line Card Link Availability = Router Availability 99.99979 Router Full Internal Redundancy (99.99979) HW Cost of two Router Configuration is 110%of one router configuration OC-48 LH No Redundancy at Router Level (99.99015) Link Availability = Parallel System Availability 99.999999 11/13/2018
One Router Advantages Cost Savings Lower OPEX Faster convergence For some PE Routers Single Router might be the only option!! As Service State is maintained on per flow basis for some network based services (e.g. Firewall, NAT) TDM links are usually connected to a single edge router A lot of customers terminate on a single router 11/13/2018
One Router Disadvantages Single Point of failure Configuration and Upgrade has to be exact Capacity Management has to be exact Main cost of a router is line cards and not chassis What if there is a DOS attack against the router ? 11/13/2018
One Router Disadvantages Physical Maintenance is not possible without downtime (Location Change) Still need protection against link failure Physical separation to prevent against natural disasters is not possible Networks have been always designed with two routers !!! 11/13/2018
Agenda Introduction How to measure Availability Network Design example One Router vs. Two Routers Software Dependability Summary 11/13/2018 27
SW to HW Reliability Differences Software reliability is not a function of manufacturing Software does not degrade over time Physical Environmental changes have no affect All software failures are the result of design/user errors 11/13/2018
SW to HW Reliability Differences Software can only be repaired by redesign MTTR is not measurable since code must be rewritten to fix a bug. Software bugs can be highly contagious The science of software correctness is still immature and is difficult to apply to software as complex and quickly changing as IP routing 11/13/2018
Agenda Introduction How to measure Availability Network Design example One Router vs. Two Routers Software Dependability Summary 11/13/2018 30
Summary No standard way to measure IP Availability Availability in IP networks depends on the Service being offered One vs. two Routers choice depends on requirements Lot of development happening in IP networks to improve Availability Graceful Restart, NSF, Fast Reroute … IP Dependability is a broad subject and there are challenges to its implementation, including: Conflict between Device Availability and Network Availability Devices still fails the single point of failure analysis So, device-level availability alone is not enough Network-level considerations (such as VRRP, MPLS fast reroute, etc.) are important Conflict between Availability and Reliability. Simple devices are inherently more reliable, but… A system with redundancies, backup, alternate paths and switching are not inherently simple This means redundant systems require design expertise Conflict between Cost of Downtime and Cost of Availability. Downtime is expensive Available systems and networks can also be expensive Where is the intersection between these two? 11/13/2018