ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Basic Concepts in Fault-Tolerance.

Slides:



Advertisements
Similar presentations
Principles of Engineering System Design Dr T Asokan
Advertisements

Give qualifications of instructors: DAP
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Fault-Tolerant Systems Design Part 1.
Introduction to Information Technologies
Reliability & Channel Coding
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Self-Checking Circuits
1/28 ECE th May 2014 H ardware Implementation of Self-checking circuits on FPGA Project Team #1 Chandru Loganathan Sakshi Gupta Vignesh Chandrasekaran.
Making Services Fault Tolerant
CS 151 Digital Systems Design Lecture 37 Register Transfer Level
CSCI 4550/8556 Computer Networks Comer, Chapter 7: Packets, Frames, And Error Detection.
DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Data Transmission Most digital messages are longer than just a few bits. It is neither practical nor economical to transfer all bits of a long message.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Error Detection and Correction
Shashank Srivastava Motilal Nehru National Institute Of Technology, Allahabad Error Detection and Correction : Data Link Layer.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Part.7.1 Copyright 2007 Koren & Krishna, Morgan-Kaufman FAULT TOLERANT SYSTEMS Part 7 - Coding.
ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering HIGH Level Fault-Tolerance: Checkpointing and recovery.
Information Coding in noisy channel error protection:-- improve tolerance of errors error detection: --- indicate occurrence of errors. Source.
(More) Interfacing concepts. Introduction Overview of I/O operations Programmed I/O – Standard I/O – Memory Mapped I/O Device synchronization Readings:
1 Data Link Layer Lecture 20 Imran Ahmed University of Management & Technology.
PART III DATA LINK LAYER. Position of the Data-Link Layer.
British Computer Society
© 2009 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved. 1 Communication Reliability Asst. Prof. Chaiporn Jaikaeo, Ph.D.
Error Coding Transmission process may introduce errors into a message.  Single bit errors versus burst errors Detection:  Requires a convention that.
Part 2: Packet Transmission Packets, frames Local area networks (LANs) Wide area networks (LANs) Hardware addresses Bridges and switches Routing and protocols.
Fault-Tolerant Systems Design Part 1.
Data and Computer Communications Chapter 6 – Digital Data Communications Techniques.
Data and Computer Communications by William Stallings Eighth Edition Digital Data Communications Techniques Digital Data Communications Techniques Click.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Synthesis Of Fault Tolerant Circuits For FSMs & RAMs Rajiv Garg Pradish Mathews Darren Zacher.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Data Link Layer PART III.
Error Detection and Correction
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
CprE 458/558: Real-Time Systems
FAULT-TOLERANT COMPUTING Jenn-Wei Lin Department of Computer Science and Information Engineering Fu Jen Catholic University Simple Concepts in Fault-Tolerance.
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Fault-Tolerant Systems Design Part 1.
Computer Science Division
Computer Communication & Networks Lecture 9 Datalink Layer: Error Detection Waleed Ejaz
CHAPTER 3: DATA LINK CONTROL Flow control, Error detection – two dimensional parity checks, Internet checksum, CRC, Error control, Transmission efficiency.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
10.1 Chapter 10 Error Detection and Correction Data can be corrupted during transmission. Some applications require that errors be detected and.
Varadarajan Srinivasan, Julian W. Farquharson,
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:
Part III: Data Link Layer Error Detection and Correction
ERROR DETECTION AND CORRECTION Chapter 8 Data Communications & Networking ERROR DETECTION AND CORRECTION Chapter 8 First Semester 2007/2008.
1 Chapter 1 Basic Structures Of Computers. Computer : Introduction A computer is an electronic machine,devised for performing calculations and controlling.
Week#3 Software Quality Engineering.
Self-Checking Circuits
ECE 753: FAULT-TOLERANT COMPUTING
ERROR DETECTION AND CORRECTION
DATA COMMUNICATION AND NETWORKINGS
Packetizing Error Detection
Packetizing Error Detection
Packetizing Error Detection
ECE 753: FAULT-TOLERANT COMPUTING
Error Detection and Correction
Seminar on Enterprise Software
Presentation transcript:

ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Basic Concepts in Fault-Tolerance

ECE 753 Fault Tolerant Computing2 Overview Introduction - Sources Hardware redundancy Information redundancy Time redundancy Software redundancy

ECE 753 Fault Tolerant Computing3 Introduction Sources Main source – Text Chapters 2 and 3 Other sources [prad:96] Chapter 1 [siew:99] Chapter 3 [Shooman:02] Chapter 4 These three books contain sufficient material covering this part of the course.These three books contain sufficient material covering this part of the course.

ECE 753 Fault Tolerant Computing4 Introduction (contd.) Scope - Explain using the example of a filterScope - Explain using the example of a filter inputs A/D digital subsystem - DSP/custom design D/A outputs Problems and solutions inputs out of range add extra code to check out of range inputs and outputs can also add code to check large deviations between samples software redundancy normally - could do in hardware but costlysoftware redundancy normally - could do in hardware but costly

ECE 753 Fault Tolerant Computing5 Introduction (contd.) Problems and solutions - contd. Power transients may corrupt the values or fault algorithm read values twice, execute algorithm twice and compare results in hardware or softwareread values twice, execute algorithm twice and compare results in hardware or software Time redundancy Values transmitted by A/D to the digital system may get corruptedValues transmitted by A/D to the digital system may get corrupted encode the values and decode them at the destination Information redundancy Components (DSP processor or A/D or D/A) may fail duplicate such parts Hardware redundancy

ECE 753 Fault Tolerant Computing6 Hardware redundancy Passive hardware redundancy TMR with a voter main problem single point of failure justification - voter is much lower complexity and can be designed using more reliable technologyjustification - voter is much lower complexity and can be designed using more reliable technology alternative - use of restoring organ –TMR with triplicated voterTMR with triplicated voter NMR voter based generalization Hardware voter (1-bit), software voter - simple Timing issue - sandwich between pairs of FFs

ECE 753 Fault Tolerant Computing7 Passive hardware redundancy (contd.) – Comparison between hw and sw voter schemes Comparison between hw and sw voter schemes hw swhw sw cost highlowcost highlow flexibilty inflexflexflexibilty inflexflex synch tightlylooselysynch tightlyloosely perfor highlowperfor highlow (fast)(slow)(fast)(slow) types of majority difftypes of majority diff voting* (others costly)(no extra cost)voting* (others costly)(no extra cost) Hardware redundancy (contd.)

ECE 753 Fault Tolerant Computing8 Passive hardware redundancy (contd.) – types of voting types of voting majority –in many practical situations it is meaninglessin many practical situations it is meaningless average –can have poor performance if a sensor always provide very low valuecan have poor performance if a sensor always provide very low value mid value –a good choice - can be very costly to implement in HWa good choice - can be very costly to implement in HW Hardware redundancy (contd.)

ECE 753 Fault Tolerant Computing9 Active hardware redundancy – Key - detect fault, locate, reconfigure Key - detect fault, locate, reconfigure See figure 1.6 of [prad:96] –duplicate with comparisonduplicate with comparison single point of failure –standby sparingstandby sparing one operational unit - it has its own fault detection mechanism on occurrence of fault a second unit (spare) is used –cold standby - standby is in unknown statecold standby - standby is in unknown state –hot standby - standby is same state as system - quick starthot standby - standby is same state as system - quick start can generalize to n - one active and n-1 standby spares Hardware redundancy (contd.)

Active approach to FT ECE 753 Fault Tolerant Computing10 Basic operations in active fault tolerance - Source: Pradhand 1996

ECE 753 Fault Tolerant Computing11 Active hardware redundancy (contd.) – Pair-and-a-spare - this combines “duplicate with comparison” with “standby sparing” Pair-and-a-spare - this combines “duplicate with comparison” with “standby sparing” duplicate units (pair of units) are used to compare and signal an error to the reconfiguration unitduplicate units (pair of units) are used to compare and signal an error to the reconfiguration unit second duplicate (pair, and possibly more in case of pair and k- spare) is used to take over in case the working duplicate (pair) detects an errorsecond duplicate (pair, and possibly more in case of pair and k- spare) is used to take over in case the working duplicate (pair) detects an error a pair is always operational –Watchdog timerWatchdog timer a “timer” - substantially low cost hardware monitors the function of the working unita “timer” - substantially low cost hardware monitors the function of the working unit Hardware redundancy (contd.)

ECE 753 Fault Tolerant Computing12 Hybrid hardware redundancy – Key - combine passive and active redundancy schemes Key - combine passive and active redundancy schemes –NMR with sparesNMR with spares example - 5 units –3 in TMR mode3 in TMR mode –2 spares2 spares –all 5 connected to a switch that can be reconfiguredall 5 connected to a switch that can be reconfigured comparison with 5MR –5MR can tolerate only two faults where as hybrid scheme can tolerate three faults that occur sequentially5MR can tolerate only two faults where as hybrid scheme can tolerate three faults that occur sequentially –cost of the extra fault-tolerance: switchcost of the extra fault-tolerance: switch Hardware redundancy (contd.)

ECE 753 Fault Tolerant Computing13 Hybrid hardware redundancy (contd.) –Self purging redundancySelf purging redundancy initially start with NMR purge one unit at at time till arrive at 3MR –can tolerate more faults initially compared to NMR with sparecan tolerate more faults initially compared to NMR with spare –cost of the switch - higher?cost of the switch - higher? –How does it compare to sift-out redundancy?How does it compare to sift-out redundancy? – Triple-duplex redundancy Triple-duplex redundancy combines duplication-with-compare and TMR Hardware redundancy (contd.)

ECE 753 Fault Tolerant Computing14 Information redundancy Key concept - add redundancy to information/dataKey concept - add redundancy to information/data –all schemes use Error detecting or Error correcting codingall schemes use Error detecting or Error correcting coding Use of parity –very effective single error detectionvery effective single error detection –encoding and decoding cost is lowencoding and decoding cost is low –commonly used in memories, transmission over short reliable channelscommonly used in memories, transmission over short reliable channels –limitationslimitations unable to detect common multiple errors can not be used in data transformation - for example addition does not preserve paritycan not be used in data transformation - for example addition does not preserve parity

ECE 753 Fault Tolerant Computing15 Information redundancy (Contd.) Error correcting codes –triplicationtriplication –Hamming code - you have learnt itHamming code - you have learnt it –byte error detection/correction - to be discussed laterbyte error detection/correction - to be discussed later –cyclic code - see bookcyclic code - see book m-out-of-n codes –encode each word (data/control) such that the coded word is of length n and each coded word has exactly m 1’s in itencode each word (data/control) such that the coded word is of length n and each coded word has exactly m 1’s in it can detect all single errors can detect all unidirectional multiple errors

ECE 753 Fault Tolerant Computing16 Information redundancy (Contd.) Berger codes –n information bits are encoded into an n+k bit code word. The k check bits are binary encoding of the number of 1’s (or 0’s) in the n information bitsn information bits are encoded into an n+k bit code word. The k check bits are binary encoding of the number of 1’s (or 0’s) in the n information bits can detect all single errors can detect all unidirectional multiple errors if carefully designed Arithmetic codes –AN codeAN code used for arithmetic function unit designs each data word is multiplied by a constant A makes use of the identity A(N+M) = AN + AM choice of A is important

ECE 753 Fault Tolerant Computing17 Information redundancy (Contd.) Arithmetic codes (Contd.) –Residue codeResidue code discussed earlier in the course using modulo addition makes use of the fact (M+N) mod k = (M mod k + N mod k) mod k –ChecksumsChecksums data is sent/stored with a checksum and when used the checksum is regenerated and compared to the a priory known checksumdata is sent/stored with a checksum and when used the checksum is regenerated and compared to the a priory known checksum functions used for checksum add, exclusive-OR (bit wise), end with end around carry, LFSR, … limitation can only perform (normally) error detection

ECE 753 Fault Tolerant Computing18 Information redundancy (Contd.) Self-Checking –This is a form of hardware redundancy but often it is closely related to ECC techniques, therefore I have chosen to include it hereThis is a form of hardware redundancy but often it is closely related to ECC techniques, therefore I have chosen to include it here –Assumptions: inputs are coded and outputs are codedAssumptions: inputs are coded and outputs are coded –Objective: in the presence of a fault the circuit should either continue to provide correct output(s) or indicate by providing an error indication that there is a fault.Objective: in the presence of a fault the circuit should either continue to provide correct output(s) or indicate by providing an error indication that there is a fault. Clearly error indication can not be 1-bit output (why?) With 2-bits output, 00 and 11 may indicate no failure other output combinations (10, 01) may indicate a failure

ECE 753 Fault Tolerant Computing19 Information redundancy (Contd.) Self-Checking (contd.) –Example applicationExample application two devices produce identical outputs and we compare these outputs to check their equalitytwo devices produce identical outputs and we compare these outputs to check their equality checker has two outputs encoded as follows –00 equal00 equal –11 unequal11 unequal –01 or 10 possible fault in the circuit01 or 10 possible fault in the circuit –(we will discuss input encoding when we discuss an example of a 2-rail 1-bit checker)(we will discuss input encoding when we discuss an example of a 2-rail 1-bit checker)

ECE 753 Fault Tolerant Computing20 Information redundancy (Contd.) Self-Checking (contd.) –DefinitionsDefinitions a circuit is fault secure if in the presence of a fault, the output is either always correct, or not a code word for valid input code wordsa circuit is fault secure if in the presence of a fault, the output is either always correct, or not a code word for valid input code words a circuit is self-testing if only valid inputs can be used to test it for the faultsa circuit is self-testing if only valid inputs can be used to test it for the faults a circuit is totally self-checking if it is fault secure and self- testinga circuit is totally self-checking if it is fault secure and self- testing –Example: a totally self-checking 2-rail 1-bit comparatorExample: a totally self-checking 2-rail 1-bit comparator assumptions –2 inputs and each input x is available as x and its complement2 inputs and each input x is available as x and its complement –x and its complement are independently generatedx and its complement are independently generated –note with these assumption the input space is encoded (4 valid inputs out of 16 possible inputs)note with these assumption the input space is encoded (4 valid inputs out of 16 possible inputs) – single stuck-at fault model single stuck-at fault model

ECE 753 Fault Tolerant Computing21 Time redundancy Key Concept - do a job more than once over timeKey Concept - do a job more than once over time –examplesexamples re-execution re-transmission of information –different faults and capabilities of different schemesdifferent faults and capabilities of different schemes transient faults –re-execution and re-transmission can detect such faults provided we wait for transient to subsidere-execution and re-transmission can detect such faults provided we wait for transient to subside permanent faults –simple re-execution or re-transmission will not work. Possible solutionssimple re-execution or re-transmission will not work. Possible solutions »send or process shifted version of datasend or process shifted version of data »send or process complemented data during second transmissionsend or process complemented data during second transmission

ECE 753 Fault Tolerant Computing22 Time redundancy (contd.) –Different faults and capabilities of different schemes (contd.)Different faults and capabilities of different schemes (contd.) faults in ALU –re-execution with complement or shifted version can detects permanent and transient faultsre-execution with complement or shifted version can detects permanent and transient faults –(RESO concept - re-computation with shifted operands)(RESO concept - re-computation with shifted operands) multiple re-computations –can detect and possibly correct transient and permanent faults if properly employed/designedcan detect and possibly correct transient and permanent faults if properly employed/designed

ECE 753 Fault Tolerant Computing23 Software redundancy Key concept - many copies of software including replication, alternative programs, and redundant codeKey concept - many copies of software including replication, alternative programs, and redundant code Different schemes –consistency/assertions checks and testsconsistency/assertions checks and tests results are too large? are the values indeed sorted? is hardware working correctly? - periodic testing model checking - build a model of the system and check the outputs of the system against the model output - application in process control systemsmodel checking - build a model of the system and check the outputs of the system against the model output - application in process control systems

ECE 753 Fault Tolerant Computing24 Software redundancy (contd.) Different schemes –Capability checksCapability checks check system limits and capabilities examples –is a write in an address space beyond the memory boundary?is a write in an address space beyond the memory boundary? »can write and read back to see if the information is therecan write and read back to see if the information is there –in multiprocessor environment, communicate and establish if a processor is alive before shipping computation/codein multiprocessor environment, communicate and establish if a processor is alive before shipping computation/code

ECE 753 Fault Tolerant Computing25 Software redundancy (contd.) Different schemes –N-version programming (software equivalent of NMR)N-version programming (software equivalent of NMR) N programs produce N values and a voter (normally software but can also be a hardware voter) votes on N valuesN programs produce N values and a voter (normally software but can also be a hardware voter) votes on N values What does it achieve –can tolerate software faults (what ever these may be - such as bit- flips) but will not tolerate design flawscan tolerate software faults (what ever these may be - such as bit- flips) but will not tolerate design flaws –if software runs on independent hardware components, it will tolerate hardware faultsif software runs on independent hardware components, it will tolerate hardware faults –if same hardware then it will tolerate transient faults that may affect the hardwareif same hardware then it will tolerate transient faults that may affect the hardware –if different software components are different versions or different algorithm implementations, then this method will tolerate both software and hardware faultsif different software components are different versions or different algorithm implementations, then this method will tolerate both software and hardware faults

ECE 753 Fault Tolerant Computing26 Software redundancy (contd.) Different schemes –Recovery block (software equivalent of standby sparing - normally more like cold standby version but active hardware redundancy)Recovery block (software equivalent of standby sparing - normally more like cold standby version but active hardware redundancy) different program versions, normally different algorithms implemented by the same or different programmers are useddifferent program versions, normally different algorithms implemented by the same or different programmers are used fastest, best, or primary version is normally in use if it fails an “acceptance test” next version is invoked Notes –graceful degradation is possiblegraceful degradation is possible –used where acceptance tests can be specifiedused where acceptance tests can be specified

ECE 753 Fault Tolerant Computing27 Software redundancy (contd.) Different schemes –N-self checking (software equivalent of pair and spare with hot standby)N-self checking (software equivalent of pair and spare with hot standby) different program versions, with each its acceptance test more than one version in use outputs are configured through a switch (conditional statement) if one pair fails, the result from the second version is used as soon as availableif one pair fails, the result from the second version is used as soon as available

ECE 753 Fault Tolerant Computing28 Summary An example to define the scope and list methodsAn example to define the scope and list methods Hardware redundancy –passive, active, and hybridpassive, active, and hybrid Information redundancy –coding method and self-checkingcoding method and self-checking Time redundancy –re-execution, re-transmission, and RESO conceptre-execution, re-transmission, and RESO concept Software redundancy –consistency checks, assertion check, N-version programming, capability checks, recovery block, and N-self checkingconsistency checks, assertion check, N-version programming, capability checks, recovery block, and N-self checking

ECE 753 Fault Tolerant Computing29 Summary (contd.) A summary chart of all techniques