Download presentation
Presentation is loading. Please wait.
Published byEmil Ramsey Modified over 6 years ago
1
Fault Tolerance & Reliability CDA 5140 Spring 2006
Chapter 1 Overview & Definitions
2
Topics basic concepts of Fault Tolerance (FT)
reliability & availability of systems, both hardware & software tools to compare & contrast FT designs
3
What is FT? Computing in presence of errors
Some techniques from analog systems of 1940’s ’s Digital technology adds to these to be faster, better & cheaper Investigate architecture keeping in mind tradeoff of cost, weight & volume Becoming more important as digital systems become more & more prevalent
4
Why Have FT? Needed more in 21st century since: Harsher environments
Many novice users Increasing repair costs Larger systems Digital systems more prevalent More users dependent on digital systems from business to government to home to school
5
How is FT Obtained? Add redundancy in form of: Hardware, e.g. RAID
Software, e.g. 2 algorithms for same task Information, e.g. coding theory Time, e.g. on Internet if fault, then new route
6
Definitions & Terminology
Failure - departure from correct operation Fault - flaw in hardware or software resulting in failure, e.g. physical problems, design flaws, defects in hardware; design or implementation for software Error - incorrect response from module leading to system failure if no FT Type - hardware or software Cause - improper design, hardware failure, external disturbance
7
Definitions continued
Permanent Fault - always present, needs repair to remove Intermittent fault - not always present but still needs repair to remove Transient fault - will disappear without repair Fault latency - fault can go undetected & does not cause error Fault-avoidance - use of high quality components & careful design to avoid faults Fault-tolerance - use of redundancy (hardware, software, information or time) to correct system operation after fault occurs
8
Definitions continued
Graceful degradation - system still performs but with degraded but correct performance after faults Fail-safe - system can fail but only to safe state to avoid catastrophes Reliability - probability of not failing within time t given operating correctly at time 0 Availability - probability system operating correctly at time t Maintainability - probability that system can be restored to operation by time t given not operational at time 0
9
Definitions continued
Mean-time-to-failure (MTTF) - expected value of system failure time Mean-time-to-repair (MTTR) - expected value of system repair time Mean-time-between-failure (MTBF) - expected value between successive system failure, MTTF + MTTR Fault detection - method used to detect presence of fault Fault confinement - technique to confine damage of fault to as small an area as possible
10
Definitions continued
Fault diagnosis - automatic identification of faulty modules Recovery - system put into operating state, possibly degraded Hardware redundancy - extra hardware to detect, mask or diagnose faults Passive hardware redundancy - fault masking to hide faults & prevent faults from resulting in errors; no action by system
11
Definitions continued
Information redundancy - use of coding theory techniques (addition of bits) Software redundancy - use of diagnostic software or extra modules, each with distinct algorithm Temporal redundancy - repeating bus cycles or whole programs, new route on Internet
12
Microelectronic Growth
Density of chips dramatically increased & concomitantly, use of digital systems Obvious need for FT in space shuttle, nuclear power plants, but with increased use in homes, more faults likely so will need FT there too Interesting observations: 1999 typical home had microprocessors 2004 expected to be 280
13
Reliability & Availability
Goal: high reliability & availability based on sound analysis & not conjecture! Use both reliability & availability as measures
14
Air Traffic Control Example
ATC fails once/year, so MTTF = 8766 hours Airline Reservation System (ARS) down 5 times/year, so MTTF=1753 hours Availability (A) = uptime/(uptime + downtime) ATC down 1 hour, so A = 8765/( ) = ARS down for 1 minute, 5 times, or hours A = /(87666) =
15
Air Traffic Control Example cont’d
Unavailability U = 1-A So, comparing the two systems for U: ( )/( ) = 12 The ARS is 12 times better than the ATC in terms of availability. Homework 1: 1.13, 1.14, 1.17 (3 examples)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.