Availability of IP/MPLS networks

Availability of IP/MPLS networks
Sanjay Kalra October 2002

Agenda Introduction How to measure Availability Network Design example
One Router vs. Two Routers Software Dependability Summary 2

Reliability + Recovery
Definition of Availability Availability is the probability that an item will be able to perform its designed functions at the stated performance level, within the stated conditions and in the stated environment when called upon to do so. Availability = Reliability Reliability + Recovery 11/13/2018

Quantification Percent Availability N-Nines Downtime Time Minutes/Year
99% 2-Nines 5,000 Min/Yr 99.9% 3-Nines 500 Min/Yr 99.99% 4-Nines 50 Min/Yr 99.999% 5-Nines 5 Min/Yr % 6-Nines .5 Min/Yr To deploy dependable networks and devices, it is important to define a mechanism for quantifying dependability. The “9’s” terminology is the most familiar to the industry and is widely used to measure specifically the availability of network devices. The 9’s imply the amount of inherent downtime. Downtime is typically specified in Telcordia requirements such as GR-1110-CORE, Broadband Switching System (BSS) Generic Requirements. The 9’s provide an operational target to which networks and devices can be managed. 11/13/2018

PSTN End-2-End Availability 99.94%
PSTN : The Yardstick ? Individual elements have an availability of 99.99% One Cut off call in 8000 calls (3 min for average call). Five ineffective calls in every 10,000 calls. PSTN End-2-End Availability 99.94% NI NI 0.005 % 0.005 % AN 0.01 % AN 0.01 % LE LE Facility Entrance Facility Entrance NI : Network Interface LE : Local Exchange LD : Long Distance AN : Access Network LD 0.005 % 0.005 % 0.02 % 11/13/2018 Source :

Services affect on Network Availability
In IP Network Availability is a function of the Service being offered. Source : 11/13/2018

IP Network Expectations
Service Delay Jitter Loss Availability Real Time Interactive (VOIP, Cell Relay ..) L H Layer 2 & Layer 3 VPN’s (FR/Ethernet/AAL5) M Internet Service Video Services L L H L : Low M : Medium H : High 11/13/2018

Agenda Introduction How to measure Network Availability
Network Design example One Router vs. Two Routers Software Dependability Summary 11/13/2018 8

(Total number of Ports x sample period)
The Port Method Based on Port count in Network Does not take into account the Bandwidth of ports e.g. OC-192 and 64k are both ports Good for dedicated Access service because ports are tied to customers. (Total # of Ports X Sample Period) - (number of impacted port x outage duration) x 100 (Total number of Ports x sample period) 11/13/2018

The Port Method Example
10,000 active access ports Network An Access Router with 100 access ports fails for 30 minutes. Total Available Port-Hours = 10,000*24 = 240,000 Total Down Port-Hours = 100*.5 = 50 Availability for a Single Day = ( /240,000)*100 = % 11/13/2018

(Total amount of BW in network x sample period)
The Bandwidth Method Based on Amount of Bandwidth available in Network Takes into account the Bandwidth of ports Good for Core Routers (Total amount of BW X Sample Period) - (Amount of BE impacted x outage duration) x 100 (Total amount of BW in network x sample period) 11/13/2018

The Bandwidth Method Example
Total capacity of network 100 Gigabits/sec An Access Router with 1 Gigabits/sec BW fails for 30 minutes. Total BW available in network for a day = 100*24 = 2400 Gigabits/sec Total BW lost in outage = 1*.5 = 0.5 Availability for a Single Day = (( )/2,400)*100 = % 11/13/2018

] x 10-6 Defects Per Million
Used in PSTN networks, defined as number of blocked calls per one million calls averaged over one year. DPM = [ (number of impacted customers x outage duration) (total number of customers x sample period) ] x 10-6 11/13/2018

Defects Per Million Example
10,000 active access ports Network An Access Router with 100 access ports fails for 30 minutes. Total Available Port-Hours = 10,000*24 = 240,000 Total Down Port-Hours = 100*.5 = 50 Daily DPM = (50/240,000)*1,000,000 = 208 11/13/2018

One Router vs. Two Routers Software Dependability Summary 11/13/2018 15

Calculating Availability: Series
Multiplicative method: E1 x E2 x E3= As x x = Additive method of UA (unavailability) + + = This calculation shows that less elements in the system = more reliability. This is why collapsing the layers out of a PoP makes it a more reliable system. Juniper uses the Markov model to calculate failures: “Markov Model A probability model that uses state transit diagrams to support reliability prediction calculations of complex relationships and dependencies. Is memoryless and uses exponential distribution.” Total Availability of a system (As) is always less than the least available element. One Weak Link Significantly Weakens This Chain! 11/13/2018

Calculating Availability: Parallel
For 1 out of 2 redundancy.. Additive Rule: As = E1+ E2 – E1 E2 As = ( * ) As = This is a probability calculation, showing one element working when the others fails. It assumes that both elements will not fail similtaneously. References: Standard Methods: Mil-HDBK-217 Telcordia SR-332 Other methods and databases: NSWC-94/L07: Navy mechanical reliability method CNET 93 / 98: France Telecom mechanical reliability method HRD5: British Telecom mechanical reliability method IEEE 1413: New standard for reliability prediction, 1998 rev. 1 NPRD/EPRD: Reliability Analysis Center (RAC) failure rate data Multiplicative Rule: As = 1–[(1-E1)(1-E2)] Not for Parallel Systems Where Both Elements Are Required Assumption is that Switchover Time is zero 11/13/2018

System Calculation: Series
Simple E-3 Network, With One E-3 Trunk E-3 Server 1 2 ATM 3 4 ATM 5 99.98 99.99 99.992 99.992 99.95 Availability % Yearly downtime = (1-Availability) * minutes/year 11/13/2018

System Calculation: Parallel (1)
System 1 availability Systems 2 availability 99.95 99.975 99.82 Internet Gateway Data Centre Core Edge CPE E-3 Edge ATM Hub Core Server STM-16 STM-1 Core Availability, Data Centre to Customer CPE % E-3 ATM Hub Core Edge Data Centre Core Core S1 & S2 network 11/13/2018

System Calculation: Parallel (2)
System 1 Availability was Internet Gateway Data Center Core Edge NxE-1 Edge Core 99.999 99.975 Server CPE STM-16 STM-1 Core E-3 Availability, Data Centre to Customer CPE % System 2 Availability 99.975 99.82 99.82 NxE-1 Edge Core Edge 99.999 Data Center Core Core was was !!! 3 9’s to 4 9’s 11/13/2018

Do we still need two routers or
Router Redundancy Typical Network Designs have 2 routers for Redundancy Capacity Planning Redundancy in routers Power Supply Fans Routing Engines Switching Planes Forwarding plane Do we still need two routers or one is enough? 11/13/2018

One Router Versus two Routers
Redundant Control Plane Forwarding Plane Power Supply FAN Line Card Link Availability = Router Availability Router Full Internal Redundancy ( ) HW Cost of two Router Configuration is 110%of one router configuration OC-48 LH No Redundancy at Router Level ( ) Link Availability = Parallel System Availability 11/13/2018

One Router Advantages Cost Savings Lower OPEX Faster convergence
For some PE Routers Single Router might be the only option!! As Service State is maintained on per flow basis for some network based services (e.g. Firewall, NAT) TDM links are usually connected to a single edge router A lot of customers terminate on a single router 11/13/2018

One Router Disadvantages
Single Point of failure Configuration and Upgrade has to be exact Capacity Management has to be exact Main cost of a router is line cards and not chassis What if there is a DOS attack against the router ? 11/13/2018

One Router Disadvantages
Physical Maintenance is not possible without downtime (Location Change) Still need protection against link failure Physical separation to prevent against natural disasters is not possible Networks have been always designed with two routers !!! 11/13/2018

SW to HW Reliability Differences
Software reliability is not a function of manufacturing Software does not degrade over time Physical Environmental changes have no affect All software failures are the result of design/user errors 11/13/2018

SW to HW Reliability Differences
Software can only be repaired by redesign MTTR is not measurable since code must be rewritten to fix a bug. Software bugs can be highly contagious The science of software correctness is still immature and is difficult to apply to software as complex and quickly changing as IP routing 11/13/2018

Summary No standard way to measure IP Availability
Availability in IP networks depends on the Service being offered One vs. two Routers choice depends on requirements Lot of development happening in IP networks to improve Availability Graceful Restart, NSF, Fast Reroute … IP Dependability is a broad subject and there are challenges to its implementation, including: Conflict between Device Availability and Network Availability Devices still fails the single point of failure analysis So, device-level availability alone is not enough Network-level considerations (such as VRRP, MPLS fast reroute, etc.) are important Conflict between Availability and Reliability. Simple devices are inherently more reliable, but… A system with redundancies, backup, alternate paths and switching are not inherently simple This means redundant systems require design expertise Conflict between Cost of Downtime and Cost of Availability. Downtime is expensive Available systems and networks can also be expensive Where is the intersection between these two? 11/13/2018

Availability of IP/MPLS networks

Similar presentations

Presentation on theme: "Availability of IP/MPLS networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Availability of IP/MPLS networks

Similar presentations

Presentation on theme: "Availability of IP/MPLS networks"— Presentation transcript:

Similar presentations

About project

Feedback