Oct. 2007Fault MaskingSlide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

Survey of Detection, Diagnosis, and Fault Tolerance Methods in FPGAs
Fault-Tolerant Systems Design Part 1.
COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
Nov. 2005Math in ComputersSlide 1 Math in Computers A Lesson in the “Math + Fun!” Series.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Self-Checking Circuits
Binocular Bilateral Controller: A Hardware Fault Tolerant Implementation Marylène Audet March 2001 VLSI Testing.
Oct. 2007State-Space ModelingSlide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Oct State-Space Modeling Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Nov Malfunction Diagnosis and Tolerance Slide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.
Oct. 2007Defect Avoidance and CircumventionSlide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Oct Error Correction Slide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.
Oct Combinational Modeling Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
Nov. 2006Reconfiguration and VotingSlide 1 Fault-Tolerant Computing Hardware Design Methods.
Oct Terminology, Models, and Measures Slide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
1 Chapter Fault Tolerant Design of Digital Systems.
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
Nov. 2006Hardware Implementation StrategiesSlide 1 Fault-Tolerant Computing Hardware Design Methods.
Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,
Jan. 2015Part III – Faults: Logical DeviationsSlide 1.
Oct Fault Masking Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Oct Fault Testing Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
Oct Defect Avoidance and Circumvention Slide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments.
Spring 07, Apr 17, 19 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Soft Errors and Fault-Tolerant Design Vishwani.
Nov. 2007Reconfiguration and VotingSlide 1 Fault-Tolerant Computing Hardware Design Methods.
1 Enhancing Random Access Scan for Soft Error Tolerance Fan Wang* Vishwani D. Agrawal Department of Electrical and Computer Engineering, Auburn University,
Oct. 2007Combinational ModelingSlide 1 Fault-Tolerant Computing Motivation, Background, and Tools.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
CS3502: Data and Computer Networks DATA LINK LAYER - 1.
Oct. 2007Error DetectionSlide 1 Fault-Tolerant Computing Dealing with Mid-Level Impairments.
Reconfiguration Based Fault-Tolerant Systems Design - Survey of Approaches Jan Balach, Jan Balach, Ondřej Novák FIT, CTU in Prague MEMICS 2010.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Fault-Tolerant Systems Design Part 1.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Synthesis Of Fault Tolerant Circuits For FSMs & RAMs Rajiv Garg Pradish Mathews Darren Zacher.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 8: February 4, 2004 Fault Detection.
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
CprE 458/558: Real-Time Systems
FAULT-TOLERANT COMPUTING Jenn-Wei Lin Department of Computer Science and Information Engineering Fu Jen Catholic University Simple Concepts in Fault-Tolerance.
Redundancy. Definitions Simplex –Single Unit TMR or NMR –Three or n units with a voter TMR/Simplex –After the first failure, a good unit is switched out.
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Final Presentation DigiSat Reliable Computer – Multiprocessor Control System, Part B. Niv Best, Shai Israeli Instructor: Oren Kerem, (Isaschar Walter)
Fault-Tolerant Systems Design Part 1.
Computer Architecture Lecture 32 Fasih ur Rehman.
Oct. 2015Part III – Faults: Logical DeviationsSlide 1.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
A4 1 Barto "Sequential Circuit Design for Space-borne and Critical Electronics" Dr. Rod L. Barto Spacecraft Digital Electronics Richard B. Katz NASA Goddard.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Structuring Redundancy for Fault Tolerance Chapter 2 Designed by: Hadi Salimi Instructor: Dr. Mohsen Sharifi.
Chandrasekhar 1 MAPLD 2005/204 Reduced Triple Modular Redundancy for Tolerating SEUs in SRAM based FPGAs Vikram Chandrasekhar, Sk. Noor Mahammad, V. Muralidharan.
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
Self-Checking Circuits
ECE 753: FAULT-TOLERANT COMPUTING
Communication Networks: Technology & Protocols
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
CPE/EE 428/528 VLSI Design II – Intro to Testing (Part 3)
Fault-Tolerant Computing
Design of a ‘Single Event Effect’ Mitigation Technique for Reconfigurable Architectures SAJID BALOCH Prof. Dr. T. Arslan1,2 Dr.Adrian Stoica3.
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Part III – Faults: Logical Deviations
Seminar on Enterprise Software
Presentation transcript:

Oct. 2007Fault MaskingSlide 1 Fault-Tolerant Computing Dealing with Low-Level Impairments

Oct. 2007Fault MaskingSlide 2 About This Presentation EditionReleasedRevised FirstOct. 2006Oct This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. © Behrooz Parhami

Oct. 2007Fault MaskingSlide 3 Fault Masking

Oct. 2007Fault MaskingSlide 4

Oct. 2007Fault MaskingSlide 5 Multilevel Model Component Logic Service Result Information System Legend: Tolerance Entry Last lecture Today

Oct. 2007Fault MaskingSlide 6 Handling Faults RepairDiscardAbort Prevent Remove ExposeMask AvoidTolerate Fault Quality AssuranceTestingDynamic RedundancyStatic Redundancy Full? MonitorTest Yes No PerfectFixedRestoredUnaffectedInjuredScreenedFaulty-safeDegradedFaulty DetectMiss DetectMiss C o m p o n e n t o r S y s t e m S t a t e Reconfigure

Oct. 2007Fault MaskingSlide 7 Some Options for Fault Tolerance 1. Detect and replace Dynamic redundancy (cold/hot standby) Detection via -- coding, watchdog timer, self-checking -- duplication (pair-and-spares) 2. Mask Static redundancy May revert to simplex instead of duplex Design challenges include -- synchronization for voting -- voting on imprecise results 3. Mask, diagnose, and reconfigure Hybrid redundancy Fault masked at output, but diagnosed -- e.g., via comparison with voter output Faulty circuit is replaced by spare Becomes static upon spare exhaustion V Voter D 2 1 Detector Spare V S Switch-voter Spare

Oct. 2007Fault MaskingSlide 8 Comparing Fault Tolerance Schemes AdvantagesDrawbacks Less powerCoverage factor (cold standby) Long lifeTolerance latency (just add spares) Immediate maskingPower/area penalty High safetyVoter critical Immediate maskingPower/area penalty Long life andSwitch-voter critical high safety V Voter D 2 1 Detector Spare V S Switch-voter Spare

Oct. 2007Fault MaskingSlide 9 Inherent Fault Masking in Logic Circuits 0  1 fault in b is critical b c a d f g e h z  1 fault in c or d is not critical (it is masked) 1  0 fault in a or h is not critical (it is masked) Even nonredundant circuits have some masking capability Is there a way to exploit the inherent masking capabilities of logic gates to achieve fault tolerance?

Oct. 2007Fault MaskingSlide 10 Interwoven Redundant Logic Let x1, x2, x3, and x4 be 4 copies of the signal x 1  0 b c a d f g e h z a1a2b1b2a1a2b1b2 a1a2b1b2a1a2b1b2 a3a4b3b4a3a4b3b4 a3a4b3b4a3a4b3b4 e1e1 e2e2 e3e3 e4e4 f1f1 f2f2 f3f3 f4f4 e1e4f1f4e1e4f1f4 1  0 change is critical for AND, subcritical for OR 0  1 change is critical for OR, subcritical for AND To mask h critical faults: Number of gates multiplied by (h + 1) 2 Gate inputs multiplied by h + 1 For h = 1, the scheme is known as Quadded logic Alternating layers of ANDs and ORs can mask each other’s critical faults

Oct. 2007Fault MaskingSlide 11 Interwoven Logic for Nanoelectronics Half-adder implemented in quadded logic From: IEEE D&T July-Aug pp b c a s b a c s

Oct. 2007Fault MaskingSlide 12 Highly Reliable Logic with “Crummy” Relays Moore & Shannon, 1956 a: prob [contact made | energized] c: prob [contact made | not energized] “Make” contact (normally open) a > c “Break” contact (normally closed) a < c xy xy1 No matter how crummy the relays (i.e., how close the values of a and c), one can interconnect many of them in a redundant series-parallel structure to achieve arbitrarily high reliability prob [connection made | energized] = 2a 2 – a 4 (> a if a > 0.62) prob [connection made | not energized] = 2c 2 – c 4 (always < c) x x x x a > 0.5, c < 0.5

Oct. 2007Fault MaskingSlide 13 TMR with Perfect Voter Condition on the module reliability: R = R m [1 + (1 – R m )(2R m – 1)] (1 – R m )(2R m – 1) > 0  R m > 1/2 V Voter R RmRm TMR better Simplex better R t ln TMR Simplex MTTF: TMR 5/(6 ) Simplex 1/ R = 3R m 2 – 2R m 3 > R m ? RIF TMR/Simplex = (1 – R m )/(1 – R) = 1/[1 – R m (2R m – 1)]

Oct. 2007Fault MaskingSlide 14 TMR with Imperfect Voter Condition on the voter reliability R v > 1 / [3R m – 2R m 2 ] V Voter TMR better RvRv RmRm Simplex better Condition on the module reliability 3 – 9 – 8/R v – 8/R v 4 < R m < dR v min / dR m = (–3 + 4R m ) / (3R m – 2R m 2 ) 2 Example: R v = 0.95 requires that 0.56 < R m < 0.94 R = R v (3R m 2 – 2R m 3 ) > R m ?

Oct. 2007Fault MaskingSlide 15 TMR with Compensating Faults V Voter Example: R m = 0.998, p 0 = p 1 = R = 0.999, ,006 = 0.999,990 Basic TMR Compensation RIF TMR/Simplex = / 0.000,016 = 125 RIF Compen/TMR = 0.000,016 / 0.000,010 = 1.6 R m = 1 – p 0 – p 1 (0- and 1-fault probabilities) R = (3R m 2 – 2R m 3 ) + 6p 0 p 1 R m

Oct. 2007Fault MaskingSlide 16 Implementing a Bit-Voter TMR bit-voting: y = x 1 x 2  x 2 x 3  x 3 x 1 (carry output of a single-bit full-adder) What about 5MR, 7MR? V Bit-voter x1x1 x2x2 x3x3 y Other designs are also possible Arithmetic: add the bits, compare to threshold Mux-based Selection-based (majority of bit values is their median) 3-out-of-5 voter built of 2-input gatesTwo mux-based designs for a 3-out-of-5 bit-voter Gate-level design quickly explodes in size

Oct. 2007Fault MaskingSlide 17 Complexity of Different Bit-Voter Designs Cost of majority bit-voters as a function of the number n of inputs

Oct. 2007Fault MaskingSlide 18 Voting at the Word Level Using bit-by-bit voting may be dangerous One might think that in this example, any of the module outputs could be correct, so that producing 1 0 at the output isn’t all that wrong x 1 = 0 0 x 2 = 1 0 x 3 = 1 1 y = 1 0 However, with bit-by-bit voting, the output may be different from all inputs x 1 = x 2 = x 3 = y = Design of bit- and word-voting networks discussed in: Parhami, B., “Voting Networks,” IEEE TR, Aug. 1991

Oct. 2007Fault MaskingSlide 19 Some Simple Voter Designs If in the case of 3-way disagreement any of the inputs can be chosen, then a simple design is possible One can perform pseudo voting that yields the median of 3 analog signals (Dennis, N.G., Microelectronics and Reliability, Aug. 1974) Median and mean voting are also possible with digital signals This design can be readily generalized to a larger number of inputs x1x1 x2x2 x3x3 y Compare 0101 Disagree

Oct. 2007Fault MaskingSlide 20 A TMR Application and Its Bit-Voter Single-event upset (SEU) = Soft error Change of state caused by a high-energy particle strike SEU effect on DRAMs (from SANYO website) n + diffusion layer pn junction field due to impact TMR flip-flop for SEU tolerance DQ C DQ C DQ C 0 1 Data Clock Output Mux

Oct. 2007Fault MaskingSlide 21 Example: SEU Hardened Flip-Flop For list of flip-flop hardening methods and their comparison, see:

Oct. 2007Fault MaskingSlide 22 Switch for Standby Redundancy Standby redundancy requires an n-to-1 switch to select the output of the currently active module The detectors use various info to deduce fault conditions -- Error coding -- Reasonableness checks -- Watchdog timer D 2 1 Detector Spare D 2 1 Spares D 3 D n-to-1 switch Once a fault has been detected, the switch reconfigures the system by flagging the faulty unit and activating next spare in sequence If we use an n-to-2 switch and compare the two selected outputs, the configuration is known as “pair-and-spares”

Oct. 2007Fault MaskingSlide 23 Switch for Hybrid Redundancy Hybrid redundancy with n active and s spare modules requires an (n + s)-to-n switch to select the outputs of the active modules Self-purging redundancy is a variant of hybrid redundancy in which all modules are active at the outset, but they are purged as they disagree with the majority output V S Switch-voter Spare Voter in self-purging redundancy is a threshold voter that considers the inputs with weights of 1 (active) or 0 (purged)

Oct. 2007Fault MaskingSlide 24 Applications of nMR and Hybrid Redundancy The Space Shuttle: Uses 5-way redundancy in hardware Originally, 3 operational units + 2 spares (one warm, one cold) More recently, 4 operational + 1 spare Also, uses 2 independently developed software systems (Design diversity) Japanese Shinkansen “Bullet” Train Triple-duplex system (6-fold redundancy)