Presentation is loading. Please wait.

Presentation is loading. Please wait.

FAULT-TOLERANT COMPUTING

Similar presentations


Presentation on theme: "FAULT-TOLERANT COMPUTING"— Presentation transcript:

1 FAULT-TOLERANT COMPUTING
2017/4/22 FAULT-TOLERANT COMPUTING Jenn-Wei Lin Department of Computer Science and Information Engineering Fu Jen Catholic University Motivation and Introduction Lecture Set 1

2 ECE 753 Fault Tolerant Computing
2017/4/22 General Information Textbook Marin L. Shooman: Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design, John Wiley and Sons, 2002. D.P. Siewiorek and R.S. Swarz: Reliable Computer Systems: Design and Evaluation, 3rd ed. A. K. Peters, 1999. D. K. Pradhan, editor, Fault-Tolerant Computer System Design, Prentice-Hall, The book is out of print Paper Dependable Computing Conference Grading Policy Exam. 20% Presentation 40% (four) Term report & Project 40% Do not discuss much about topics here. Under computer system overall implies what is a compute system - its architecture and components Then focus on hardware and software components ECE 753 Fault Tolerant Computing

3 ECE 753 Fault Tolerant Computing
2017/4/22 Overview Motivation Introduction Terminology Fundamental Principles Fault-Error-Failure concept Do not discuss much about topics here. Under computer system overall implies what is a compute system - its architecture and components Then focus on hardware and software components ECE 753 Fault Tolerant Computing

4 ECE 753 Fault Tolerant Computing
2017/4/22 Motivation Informal Definition Key Attributes Who, What and Why Study Examples Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

5 ECE 753 Fault Tolerant Computing
2017/4/22 Motivation What is Fault-Tolerance? A “fault-tolerant system” is one that continues to perform at desired level of service in spite of failurs in some componetns that constitute the system. Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

6 ECE 753 Fault Tolerant Computing
2017/4/22 Motivation (contd.) Who is concerned about fault-tolerance? System Users Who is concerned at design stages? Universities R, d, and a (Research, development, applications) Industry r, D, and A (research, Development, Applications) Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

7 ECE 753 Fault Tolerant Computing
2017/4/22 Motivation (contd.) Examples General Purpose Systems PCs: RAMs with parity checks Workstations: error detection (HW), occasional corrective action (SW), ECC (HW), keeping log (SW) Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

8 ECE 753 Fault Tolerant Computing
2017/4/22 Motivation (contd.) Examples Reliable Systems Telephone systems Banking systems e.g. ATM Stock market Football games display/ticketing Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

9 ECE 753 Fault Tolerant Computing
2017/4/22 Motivation (contd.) Examples Critical and Life Critical Systems Manned and unmanned space borne systems Aircraft control systems Nuclear reactor control systems Life support systems Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

10 ECE 753 Fault Tolerant Computing
2017/4/22 Motivation (contd.) Examples Reliable -> Critical Systems 911 telephone switching system Traffic light control system Automobile control system (ABS, Fuel injection system) Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

11 ECE 753 Fault Tolerant Computing
2017/4/22 Introduction Historical perspective and major push Goals of fault-tolerance Applications of fault-tolerance Do not discuss much about topics here. Under computer system overall implies what is a compute system - its architecture and components Then focus on hardware and software components ECE 753 Fault Tolerant Computing

12 ECE 753 Fault Tolerant Computing
2017/4/22 Introduction (contd.) Historical Perspective not a new concept first use by J. van Neumann 1956 Major push Space program HW Fault tolerance - then SW Fault tolerance later Merge the two Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

13 ECE 753 Fault Tolerant Computing
2017/4/22 Introduction (contd.) Applications Space borne system long life system Airplane control system critical system Transaction processing system high availability system Switching system high availability over certain level of performance Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

14 ECE 753 Fault Tolerant Computing
2017/4/22 Terminology Reliability and concept of probability R(t): conditional probability that a system provides continuous proper service in the interval [0,t] given that it provided desired service at time 0. Availability The probability that an item is up at any point in time Uptime/(Uptime+Downtime) Dependability Property of computer system that allows reliance to be placed justifiably on service it delivers Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

15 Fundamental Principles
Dependability Impairments Faults, errors, failures Means Fault Avoidance, Fault Tolerance, Fault Removal, Fault Forecasting Measures Reliability, Availability, Maintainability ECE 753 Fault Tolerant Computing

16 Fundamental Principles (contd.)
A set of methods, tools and solutions that enable development of dependable systems. - Fault Prevention: how to prevent fault occurrence or introduction, - Fault Tolerance: how to ensure a service up to fulfilling the system’s function in the presence of faults, - Fault Removal: how to reduce the presence (number seriousness) of faults, - Fault Forecasting: how to estimate the present number, the future incidence, and the consequences of faults ECE 753 Fault Tolerant Computing

17 Fundamental Principles (contd.)
Fault Avoidance: To prevent by construction fault occurrence. E.g., nearly fault-free components, shielding against electromagnetic fields Drawbacks: - Cost of near-perfect components high - Cost of maintenance personnel Fault Tolerance: To provide, by redundancy, service complying with specification in spite of faults occurring ECE 753 Fault Tolerant Computing

18 Fundamental Principles (contd.)
Fault Removal: To minimize, by verification, the presence of faults. E.g. Am I building the right system? Concepts of coverage,etc. Fault Forecasting: To estimate, by evaluation, the presence, occurrence and consequences of faults. E.g. For how long will the system be right ? ECE 753 Fault Tolerant Computing

19 Fundamental Principles (contd.)
Reliability: A measure of continuous delivery of proper service (or equivalently, of the time to failure) from a reference initial time Availability: A measure of the delivery of the proper service with respect to the alternation of delivery of proper and improper service Maintainability: A measure of continuous delivery of improper service (time to restoration or repair) ECE 753 Fault Tolerant Computing

20 Fundamental Principles (contd.)
2017/4/22 Fundamental Principles (contd.) Hardware redundancy Low level High level Software Redundancy Time Redundancy Information Redundancy Do not discuss much about topics here. Under computer system overall implies what is a compute system - its architecture and components Then focus on hardware and software components ECE 753 Fault Tolerant Computing

21 Fundamental Principles (contd.)
2017/4/22 Fundamental Principles (contd.) Hardware Redundancy - Low level logic level Example 1 - Self checking circuits Example 2 - Arithmetic code A modular adder using the mathematical principle (A+B+|) mod k = ((A mod k) + (B mod k)) mod k Hardware Redundancy - High level Triplicate or 5-copies as in space shuttle Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

22 Fundamental Principles (contd.)
2017/4/22 Fundamental Principles (contd.) Software Redundancy Use two different programs/algorithms Time Redundancy Re-compute or redo the task and compare the results May or may not use the same hardware/software Information Redundancy backup information Use of ECC Question - What kind of FT is achieved? Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

23 Fault-Error-Failure concept
2017/4/22 Fault-Error-Failure concept Intuitive definitions Origins of faults Methods to break FEF chain Attribute of faults Do not discuss much about topics here. Under computer system overall implies what is a compute system - its architecture and components Then focus on hardware and software components ECE 753 Fault Tolerant Computing

24 Fault-Error-Failure concept (contd.)
2017/4/22 Fault-Error-Failure concept (contd.) Intuitive definitions Fault - An anomalous physical condition caused by a manufacturing problem, fatigue, external disturbance (intentional or un-intentional), desgin flaw, … Causes Error - Effect of activation of a fault Failure - over-all system effect of an error Fault -> Error -> Failure Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

25 Fault-Error-Failure concept (contd.)
2017/4/22 Fault-Error-Failure concept (contd.) Failure occurs when the delivered service deviates from the specified service; failures are caused by errors Error is the manifestation of a fault within a program or data structure Fault is an incorrect state of hardware or software resulting from failures of components, physical interferences from the environment, operator error or incorrect design Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

26 Fault-Error-Failure concept (contd.)
2017/4/22 Fault-Error-Failure concept (contd.) Causes of faults Specification mistakes Incorrect algorithms, architectures, or hardware and software design specification Implementation mistakes Process of transforming hardware and software specifications into the physical hardware and the actual software Poor design, poor component selection, poor construction, software coding mistakes Component defects Manufacturing imperfections, random device defects, and component wear-out External disturbance Radiation, electromagnetic interference, operator mistakes, battle damage, and environmental extremes Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing

27 Fault-Error-Failure concept (contd.)
2017/4/22 Fault-Error-Failure concept (contd.) Causes of faults Specification Mistakes Software Faults Implementaion Mistakes Errors System Failures External Disturbances Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. Haredware Faults Component Defects ECE 753 Fault Tolerant Computing

28 Fault-Error-Failure concept (contd.)
2017/4/22 Fault-Error-Failure concept (contd.) Characteristics of faults Fault nature Specify the type of fault Is the fault a hardware or a software fault? Fault duration Specify the length of time that a fault is active Permanent fault Transient fault Appear and disappear within a very short period of time Intermittent fault Appear, disappear, and reappear repeatedly Fault extent Fault is localized to a given hardware or software module or globally affects the hardware, the software, or both. Fault value Determinate or indeterminate Fault sensitive to either the data or time Be sure everyone has a conduct sheet. Re GROUND RULES: 1. In terms of allowed collaboration vs. individual work, ask if you are not sure. 2. Deactivate all cell phones or pagers during class unless you are on-call during your job. 3. No tape-recording permitted. Take notes. ECE 753 Fault Tolerant Computing


Download ppt "FAULT-TOLERANT COMPUTING"

Similar presentations


Ads by Google