Failure Analysis of the PSTN: 2000 Patricia Enriquez Mills College Oakland, California Mentors: Aaron Brown David Patterson.

Slides:



Advertisements
Similar presentations
Steve Lewis J.D. Edwards & Company
Advertisements

What is E-rate? Program designed in 1996 to overcome the “digital divide” Designed to narrow Internet access gap between affluent and non-affluent school.
Business Plug-In B4 MIS Infrastructures.
Chapter 13 Managing Computer and Data Resources. Introduction A disciplined, systematic approach is needed for management success Problem Management,
Chapter 14 Network Design and Implementation. 2 Network Analysis and Design Aspects of network analysis and design Understanding the requirements for.
5 december 2011 Living Probabilistic Asset Management Dr.ir. J.A. van den Bogaard.
Chapter 20 Introduction to Systems Development and Systems Analysis Copyright © 2012 Pearson Education 20-1.
Documenting the Existing Network - Starting Points IACT 418 IACT 918 Corporate Network Planning.
Telecommunications Project Management Quality Management PERT.
PowerPoint Presentation for Dennis & Haley Wixom, Systems Analysis and Design Copyright 2000 © John Wiley & Sons, Inc. All rights reserved. Slide 1 Key.
Business Data Communications & Networking
Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin CHAPTER FIVE INFRASTRUCTURES: SUSTAINABLE TECHNOLOGIES CHAPTER.
RBNetERP or Enterprise Resource Planning is a software that allows companies to integrate all their operations and resources and manage them through one.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Data Centers and IP PBXs LAN Structures Private Clouds IP PBX Architecture IP PBX Hosting.
CHAPTER OVERVIEW SECTION 5.1 – MIS INFRASTRUCTURE
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
PowerPoint Presentation for Dennis & Haley Wixom, Systems Analysis and Design Copyright 2000 © John Wiley & Sons, Inc. All rights reserved. Slide 1 Systems.
ASHIMA KALRA.  COMPUTER NETWORK  Local Area Network (LAN) Local Area Network (LAN)  Metropolitan Area Network(MAN) Metropolitan Area Network(MAN) 
Internet Outage Trends Sean Donelan Equinix Inc NANOG 21 Atlanta, Georgia.
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
Lecture 13 Fault Tolerance Networked vs. Distributed Operating Systems.
Software Quality Chapter Software Quality  How can you tell if software has high quality?  How can we measure the quality of software?  How.
Term 2, 2011 Week 3. CONTENTS The physical design of a network Network diagrams People who develop and support networks Developing a network Supporting.
CHAPTER FIVE INFRASTRUCTURES: SUSTAINABLE TECHNOLOGIES
Business and Specialized IS Enterprise Systems ทัศนวรรณ ศูนย์กลาง ภาควิชาคอมพิวเตอร์ คณะ วิทยาศาสตร์
Chapter 9 Enterprise Systems
Prepared by: Dinesh Bajracharya Nepal Security and Control.
Unit 8 Syllabus Quality Management : Quality concepts, Software quality assurance, Software Reviews, Formal technical reviews, Statistical Software quality.
Frankfurt (Germany), 6-9 June 2011 EL-HADIDY – EG – S5 – 0690 Mohamed EL-HADIDY Dalal HELMI Egyptian Electricity Transmission Company Egypt EXAMPLES OF.
1 Availability Policy (slides from Clement Chen and Craig Lewis)
High Availability for Information Security Managing The Seven R’s Rich Schiesser Sr. Technical Planner.
2012 MITA-ATA Annual Conference August 6-8, 2012 Disaster Recovery Planning for Telecommunications Companies.
Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Business Plug-In B4 Enterprise Architectures (on OLC)
Chapter 2 Network Topology
 CS 5380 Software Engineering Chapter 11 Dependability and Security.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Microsoft Reseach, CambridgeBrendan Murphy. Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.
McLean HIGHER COMPUTER NETWORKING Lesson 15 (a) Disaster Avoidance Description of disaster avoidance: use of anti-virus software use of fault tolerance.
IT Essentials: PC Hardware and Software v4.0. Chapter 4 Objectives 4.1 Explain the purpose of preventive maintenance 4.2 Identify the steps of the troubleshooting.
Business Data Communications, Fourth Edition Chapter 11: Network Management.
SINTEF Energy Research 1 Fault statistics in distribution network Regulation of quality of supply Gerd Kjølle, Norway, RT3.
1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.
Chapter 5 McGraw-Hill/Irwin Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved.
Chapter 3 Receiving the Incident. Incident Management Process of receiving, processing and resolving user problems or requests. Here we are going to look.
Software Requirements and Design Khalid Ishaq
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
November  Talk to the person next to you for 2 minutes about your favourite game. What do you like about it and why?
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Network Reliability and Interoperability Council VII NRIC Council Meeting Focus Group 1C Analysis of Effectiveness of Best Practices Aimed at E911 and.
Welcome to CJ 110 Professor Brown. Unit 3: Methods of Data Collection In this unit we will address the importance of the origin and integrity of the data.
Objectives how to use a systematic, top-down process when designing computer networks focuses on the first step in top-down network design: analyzing your.
Thepul Ginige Lecture-7 Implementation of Information System Thepul Ginige.
Component 8/Unit 9aHealth IT Workforce Curriculum Version 1.0 Fall Installation and Maintenance of Health IT Systems Unit 9a Creating Fault Tolerant.
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.
Thepul Ginige Lecture-5 Implementation of Information System Part - I Thepul Ginige.
FOR OVER 100 YEARS…LIFE. POWERED BY EDISON. Southern California Edison Presentation to City of Palos Verdes Estates December 13, 2011 Marvin Jackmon, Regional.
ACEC National Energy and Environment Challenges in Asset Management? Brian Long Transmission Line Performance Xcel Energy August 15, 2015.
1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.
Principles of Information Systems, Ninth Edition Chapter 9 Enterprise Systems.
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
Introduction to High Availability
Sources of Failure in the Public Switched Telephone Network
Embracing Failure: A Case for Recovery-Oriented Computing
Large Distributed Systems
CHAPTER OVERVIEW SECTION 5.1 – MIS INFRASTRUCTURE
McGraw-Hill Technology Education
Information Systems, Ninth Edition
McGraw-Hill Technology Education
Presentation transcript:

Failure Analysis of the PSTN: 2000 Patricia Enriquez Mills College Oakland, California Mentors: Aaron Brown David Patterson

Approach Find the areas that are failing then try to fix/address the problems Use PSTN as a case study for ROC:  Large, widely used, networked system  Highly reliable infrastructure  Provides an upper limit for reliable computer service BEST CASE Ideal: Computers be as reliable as the telephone network.

Collecting Failure Data  Target System PSTN(PSTN)  Target System : US Public Switched Telephone Network (PSTN) FCC(FCC)  Detailed telephone service failure data available from the Federal Communications Commission (FCC)  Telephone Disruption reports: company name, duration, time, cause, and event disruption  Required by law for outages affecting 30,000 people or lasting at least 30 minutes

Outage ReportDate Place Explanation Number of Customers Affected Company Time Duration Blocked Calls Cause

Causes of Failure Human Error Acts of Nature Hardware Failure Software Failure Call overloads Vandalism

Categorizing the Failures Human Error Company workers Includes Contractors and Vendors External Acts of Nature Fire Rain Lightning Winds Floods Hardware Failure Network component failure Cable, power outage Software Failure corrupt/incorrect communication software Call Overloads Over network capacity Vandalism Intentional harm to telephone network equipment

Categorization Challenges Outages may have multiple causes Terminology Root Cause - cause behind the outage Direct Cause - immediate trigger i.e. Root Cause – latent error in software Direct Cause – Maintenance error (human)

Outages Breakdown by Number: Human- company Human- external Hardware Failure Software Failure Overload Vandalism Acts of Nature Total: 202 outages 55% Human Error accounts for 55% of the outages for 2000 *Vandalism accounts for < 1%

Eliminating Nature Nature has a mind of its own Cannot be controlled Does not relate to contained computer systems directly

Outage Breakdown by Number (Nature Factored Out) 59% Human Error accounts for 59% of all Outages Total: 187 outages Human- company Human- external Hardware Failure Software Failure Overload Vandalism

What could humans possibly do wrong? Cut incorrect cables Upgrade software incorrectly Incorrectly repair hardware Follow instructions incorrectly Fail to read documentation Do things out of order

Measuring Availability Number of Outages Only measures the number of outages. Does not include the duration of the outages. There’s more important information than simply the number of outages. Outage Duration Customers Affected Blocked Calls

A Second Metric Customer Minutes Outage duration in Minutes * Customers affected Captures collective customer experience Assumes all affected customers or lines attempted to make a call

Number of Outages vs. Customer Minutes Total: 187 outages Total: about 95 Million customer minutes/year Humans = 54%

A Better Metric Blocked calls Number of calls that are interrupted during a service disruption. Exact values or Estimated values Reported by the company on the disruption report. Measures how many “service items”or calls were interrupted. Does not assume customer use as do customer minutes.

Blocked Calls: % Human error accounts for 59% of all blocked calls Blocked calls only for pertinent reports. Human- company Human- external Hardware Failure Software Failure Overload Vandalism Vandalism – 0.5%

Summary: Humans were the greatest cause of failure. Humans caused most of the outages. What are the trends? Richard Kuhn reported on the failure in the phone system for (Sources of Failure in the Public Switched Telephone Network) Also found humans to be the biggest problem

Trends in Customer Minutes CauseTrend Human Error: Company Human Error: external Hardware 4960 Software Overload 3142 Vandalism 52 Minutes Minutes (millions of customer minutes/month) “Traditional Computing concentrates on tolerating hardware and operating system faults, ignoring faults by human operators…” (David Patterson, 2001)

A New Perspective Outages may be caused by multiple components Combinations of a few or several Each component plays a key role in the outage coming about.

Components Of Failure

Future Work: Directly apply data to the ROC project Could the ROC techniques have avoided these outages? Further categorize the data More specific categories within each general category Telephone Company Geographic location Breakdown Human error further Vendors, contractors, technicians, outsiders… Include more years of outages for further comparison

For Further Information: Patricia Enriquez