1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.

Slides:

Advertisements

Similar presentations

Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.

Advertisements

Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.

Business Continuity Section 3(chapter 8) BC:ISMDR:BEIT:VIII:chap8:Madhu N PIIT1.

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.

Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.

REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.

5th Conference on Intelligent Systems

© 2009 EMC Corporation. All rights reserved. Introduction to Business Continuity Module 3.1.

Managing Information Systems Information Systems Security and Control Part 1 Dr. Stephania Loizidou Himona ACSC 345.

(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.

3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.

Making Services Fault Tolerant

Copyright 2001, Agrawal & BushnellVLSI Test: Lecture 11 Lecture 1 Introduction n VLSI realization process n Verification and test n Ideal and real tests.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

Managing Information Systems Information Systems Security and Control Part 2 Dr. Stephania Loizidou Himona ACSC 345.

Processing Integrity and Availability Controls

1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.

Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

Copyright © 2015 Pearson Education, Inc. Processing Integrity and Availability Controls Chapter

Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu

1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.

Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu

Lecture 13 Fault Tolerance Networked vs. Distributed Operating Systems.

Reliability and Fault Tolerance Setha Pan-ngum. Introduction From the survey by American Society for Quality Control [1]. Ten most important product attributes.

2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.

Business Continuity and Disaster Recovery Chapter 8 Part 2 Pages 914 to 945.

Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Slide 1 Critical Systems Specification 2.

Eng. Mohammed Timraz Electronics & Communication Engineer University of Palestine Faculty of Engineering and Urban planning Software Engineering Department.

Software Reliability SEG3202 N. El Kadri.

IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.

IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz.

Protecting the Public, Astronauts and Pilots, the NASA Workforce, and High-Value Equipment and Property Mission Success Starts With Safety Believe it or.

SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.

Lecture 16: Storage and I/O EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.

Ch. 1.  High-profile failures ◦ Therac 25 ◦ Denver Intl Airport ◦ Also, Patriot Missle.

Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption.

Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.

I/O Computer Organization II 1 Introduction I/O devices can be characterized by – Behavior: input, output, storage – Partner: human or machine – Data rate:

Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.

Business Data Communications, Fourth Edition Chapter 11: Network Management.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.

CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.

COSC 3330/6308 Solutions to the Third Problem Set Jehan-François Pâris November 2012.

1 Taxonomy and Trends Dan Siewiorek Carnegie Mellon University June 2012.

1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University

Silicon Programming--Testing1 Completing a successful project (introduction) Design for testability.

Topic: Reliability and Integrity. Reliability refers to the operation of hardware, the design of software, the accuracy of data or the correspondence.

Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu

Introduction to High Availability

Sources of Failure in the Public Switched Telephone Network

Hardware & Software Reliability

Outline Introduction Background Distributed DBMS Architecture

Fault Tolerance & Reliability CDA 5140 Spring 2006

Maximum Availability Architecture Enterprise Technology Centre.

Software Reliability PPT BY:Dr. R. Mall 7/5/2018.

Processing Integrity and Availability Controls

Fault Tolerance In Operating System

Coding Theory Dan Siewiorek June 2012.

Software Reliability: 2 Alternate Definitions

RAID RAID Mukesh N Tekwani

COP 5611 Operating Systems Fall 2011

Reliability and Fault Tolerance

Fault Tolerance Distributed Web-based Systems

Mattan Erez The University of Texas at Austin July 2015

Introduction to Fault Tolerance

RAID RAID Mukesh N Tekwani April 23, 2019

Seminar on Enterprise Software

Presentation transcript:

1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012

2 Preview u Many terms have multiple usage that can lead to confusion when used out of context Sources of error u Faults go through at least ten stages from inception to repair - so designer better plan for all ten stages Relationship between sequence of events in handling a fault and mathematical measures

3 Outline u Introduction u Definitions u Sources of Errors

4 Introduction

5 WHY RELIABILITY? u Three of the driving factors: Critical applications –computer outage or error can cause loss of money, time, life –No longer just in aerospace, but in more mundane applications – customer expectations Increasing system complexity –more components,  more likelihood of failure (counter: increased rel. of | VLSI) –Lower signal/noise ratios in ↑ VLSI speed  more likelihood of transient errors –Diagnosis more difficult, downtime is longer, repair costs ↑ increased inventory costs too Relative cost is less

6 AVAILABILITY EXAMPLE u 90 MINUTES DOWNTIME PER WEEK u AVAILABILITY u RESERVATION SYSTEM -- $36,000/MINUTE DOWN u $3.24 MILLION PER WEEK u.1% AVAILABILITY = 10 MINUTES = $360,000.00

7 Univac I Checkers u Parity Memory Input to function table Output from function table, odd number of selected gates. Dummy lines preserve parity Unitypes u 1-of-n Intermediate line function table Memory bank select

8 Univac I Checkers (cont’d) u Duplication Registers Adder Comparitor Multiplier-quotient coupler Bus amplifier Bus interface u Automatic voltage monitoring system tests every DC voltage at rate of one per minute u “720 checker” counts 720 characters per I/O block

9 Modern Microprocessor checkers

10

11

12 DEFINITIONS & THE LIFE OF A FAULT

13 Definitions u RELIABILITY: SURVIVAL PROBABILITY When repair is costly or function is critical u AVAILABILITY: THE FRACTION OF TIME A SYSTEM MEETS ITS SPECIFICATION When service can be delayed or denied u REDUNDANCY: EXTRA HARDWARE, SOFTWARE, TIME

14 Stages in the development of a system STAGEERROR SOURCESERROR DETECTION SpecificationAlgorithm DesignSimulation & designFormal SpecificationConsistency checks, model checking PrototypeAlgorithm designStimulus/response Wiring & assemblytesting Timing Component Failure ManufactureWiring & assemblySystem testing Component failureDiagnostics InstallationAssemblySystem Testing Component failureDiagnostics Field OperationComponent failureDiagnostics Operator errors Environmental factors

15 Cause-effect sequence u FAILURE: component does not provide service u FAULT: deviation of logic function from design value Hard, Transient u ERROR: manifestation of a fault by incorrect value

16 Fault Classification u DURATION: Transient-design errors, environment Intermittent-repair by replacement Permanent-repair by replacement u EXTENT: Local (independent) Distributed (related) u VALUE: Determinate (stuck at X) Indeterminate (variable)

17 Basic Steps in Fault Handling u Fault Confinement -- contain it before it can spread u Fault Detection -- find out about it to prevent acting on bad data u Fault Masking -- mask effects u Retry -- since most problems are transient, just try again u Diagnosis -- figure out what went wrong as prelude to correction u Reconfiguration -- work around a defective component u Recovery -- resume operation after reconfiguration in degraded mode u Restart -- re-initialize (warm restart; cold restart) u Repair -- repair defective component u Reintegration -- after repair, go from degraded to full operation

18 MTBF -- MTTD -- MTTR Availability = MTTF ______________ MTTF + MTTR

19 Error Containment Levels u For distributed systems there are additional levels Containment to a single node or FTU Containment to a single bus or subsystem Containment to a single vehicle/piece of equipment in a national infrastructure

20 Sources of Errors

21 “Mainframe”Outage Sources (* the sum of these sources was 0.75)

22 Summary of Tandem Reported System Outage Data Customers Outage Customers Systems Processors700015,00025,500 Discs16,00046,00074,000 Reported Outages System MTBF8 years20 years21 years

23 Tandem Causes of System Failures (Up is good; down is bad)

24 Tandem Hardware Causes of Outage u Disks49% u Communications24% u Processors18% u Timing9% u Spares1%

25 Tandem Operations Causes of Outage u Procedures42% u Configurations39% u Move13% u Overflow4% u Upgrade1%

26 Tandem Maintenance Causes of Outage u Disk67% u Communication20% u Processor13%

27 Tandem Environmental Outages u Extended Power Loss80% u Earthquake 5% u Flood 4% u Fire 3% u Lightning 3% u Halon Activation 2% u Air Conditioning 2% u Total MTBF about 20 years u MTBAoG* about 100 years Roadside highway equipment will be more exposed than this * (AoG= “Act Of God”)

28 CMU Andrew File Server Study u Configuration 13 SUN II Workstations with processor 4 Fujitsu Eagle Disk Drives u Observations 21 Workstation Years u Frequency of events Permanent Failures 29 Intermittent Faults610 Transient Faults446 System Crashes298 u Mean Time To Permanent Failures6552 hours Intermittent Faults 58 hours Transient Faults 354 hours System Crash 689 hours

29 Some Interesting Ratios u Permanent Outages/Total Crashes = 0.1 u Intermittent Faults/Permanent Failures = 21 Thus first symptom appears over 1200 hours prior to repair u (Crashes - Permanent)/Total Faults = u 14/29 failures had three or fewer error log entries 8/29 had no error log entries