Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.

Similar presentations


Presentation on theme: "1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012."— Presentation transcript:

1 1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012

2 2 Preview u Many terms have multiple usage that can lead to confusion when used out of context Sources of error u Faults go through at least ten stages from inception to repair - so designer better plan for all ten stages Relationship between sequence of events in handling a fault and mathematical measures

3 3 Outline u Introduction u Definitions u Sources of Errors

4 4 Introduction

5 5 WHY RELIABILITY? u Three of the driving factors: Critical applications –computer outage or error can cause loss of money, time, life –No longer just in aerospace, but in more mundane applications – customer expectations Increasing system complexity –more components,  more likelihood of failure (counter: increased rel. of | VLSI) –Lower signal/noise ratios in ↑ VLSI speed  more likelihood of transient errors –Diagnosis more difficult, downtime is longer, repair costs ↑ increased inventory costs too Relative cost is less

6 6 AVAILABILITY EXAMPLE u 90 MINUTES DOWNTIME PER WEEK u AVAILABILITY 0.991 u RESERVATION SYSTEM -- $36,000/MINUTE DOWN u $3.24 MILLION PER WEEK u.1% AVAILABILITY = 10 MINUTES = $360,000.00

7 7 Univac I Checkers u Parity Memory Input to function table Output from function table, odd number of selected gates. Dummy lines preserve parity Unitypes u 1-of-n Intermediate line function table Memory bank select

8 8 Univac I Checkers (cont’d) u Duplication Registers Adder Comparitor Multiplier-quotient coupler Bus amplifier Bus interface u Automatic voltage monitoring system tests every DC voltage at rate of one per minute u “720 checker” counts 720 characters per I/O block

9 9 Modern Microprocessor checkers

10 10

11 11

12 12 DEFINITIONS & THE LIFE OF A FAULT

13 13 Definitions u RELIABILITY: SURVIVAL PROBABILITY When repair is costly or function is critical u AVAILABILITY: THE FRACTION OF TIME A SYSTEM MEETS ITS SPECIFICATION When service can be delayed or denied u REDUNDANCY: EXTRA HARDWARE, SOFTWARE, TIME

14 14 Stages in the development of a system STAGEERROR SOURCESERROR DETECTION SpecificationAlgorithm DesignSimulation & designFormal SpecificationConsistency checks, model checking PrototypeAlgorithm designStimulus/response Wiring & assemblytesting Timing Component Failure ManufactureWiring & assemblySystem testing Component failureDiagnostics InstallationAssemblySystem Testing Component failureDiagnostics Field OperationComponent failureDiagnostics Operator errors Environmental factors

15 15 Cause-effect sequence u FAILURE: component does not provide service u FAULT: deviation of logic function from design value Hard, Transient u ERROR: manifestation of a fault by incorrect value

16 16 Fault Classification u DURATION: Transient-design errors, environment Intermittent-repair by replacement Permanent-repair by replacement u EXTENT: Local (independent) Distributed (related) u VALUE: Determinate (stuck at X) Indeterminate (variable)

17 17 Basic Steps in Fault Handling u Fault Confinement -- contain it before it can spread u Fault Detection -- find out about it to prevent acting on bad data u Fault Masking -- mask effects u Retry -- since most problems are transient, just try again u Diagnosis -- figure out what went wrong as prelude to correction u Reconfiguration -- work around a defective component u Recovery -- resume operation after reconfiguration in degraded mode u Restart -- re-initialize (warm restart; cold restart) u Repair -- repair defective component u Reintegration -- after repair, go from degraded to full operation

18 18 MTBF -- MTTD -- MTTR Availability = MTTF ______________ MTTF + MTTR

19 19 Error Containment Levels u For distributed systems there are additional levels Containment to a single node or FTU Containment to a single bus or subsystem Containment to a single vehicle/piece of equipment in a national infrastructure

20 20 Sources of Errors

21 21 “Mainframe”Outage Sources (* the sum of these sources was 0.75)

22 22 Summary of Tandem Reported System Outage Data 198519871989 Customers100013002000 Outage Customers176205164 Systems240060009000 Processors700015,00025,500 Discs16,00046,00074,000 Reported Outages285294438 System MTBF8 years20 years21 years

23 23 Tandem Causes of System Failures (Up is good; down is bad)

24 24 Tandem Hardware Causes of Outage u Disks49% u Communications24% u Processors18% u Timing9% u Spares1%

25 25 Tandem Operations Causes of Outage u Procedures42% u Configurations39% u Move13% u Overflow4% u Upgrade1%

26 26 Tandem Maintenance Causes of Outage u Disk67% u Communication20% u Processor13%

27 27 Tandem Environmental Outages u Extended Power Loss80% u Earthquake 5% u Flood 4% u Fire 3% u Lightning 3% u Halon Activation 2% u Air Conditioning 2% u Total MTBF about 20 years u MTBAoG* about 100 years Roadside highway equipment will be more exposed than this * (AoG= “Act Of God”)

28 28 CMU Andrew File Server Study u Configuration 13 SUN II Workstations with 68010 processor 4 Fujitsu Eagle Disk Drives u Observations 21 Workstation Years u Frequency of events Permanent Failures 29 Intermittent Faults610 Transient Faults446 System Crashes298 u Mean Time To Permanent Failures6552 hours Intermittent Faults 58 hours Transient Faults 354 hours System Crash 689 hours

29 29 Some Interesting Ratios u Permanent Outages/Total Crashes = 0.1 u Intermittent Faults/Permanent Failures = 21 Thus first symptom appears over 1200 hours prior to repair u (Crashes - Permanent)/Total Faults = 0.255 u 14/29 failures had three or fewer error log entries 8/29 had no error log entries


Download ppt "1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012."

Similar presentations


Ads by Google