ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Fault Modeling Lectures Set 2
ECE 753 Fault Tolerant Computing2 Overview Fault Modeling References Introduction Fault models at different levels (HW) Error models High-level failure models (process or system failure)High-level failure models (process or system failure) Summary
ECE 753 Fault Tolerant Computing3 Recap Think about PROJECT Terminology and definitions Fundamental principles - Redundancy –Hardware - low and high levelHardware - low and high level –SoftwareSoftware –TimeTime –InformationInformation FEF Chain and methods to break it (barriers) –Attributes of faults and fault types - such as permanent, transient, intermittent (please read)Attributes of faults and fault types - such as permanent, transient, intermittent (please read)
ECE 753 Fault Tolerant Computing4 Fault Modeling References [abra:86] Abraham and Fuchs, Fault and error modeling for VLSI, Proc. IEEE, May 1986[abra:86] Abraham and Fuchs, Fault and error modeling for VLSI, Proc. IEEE, May 1986 [kala:13] Kalayappan and Sarangi, A survey of checker architectures, ACM Computing survey, Aug 2013[kala:13] Kalayappan and Sarangi, A survey of checker architectures, ACM Computing survey, Aug 2013 [mull:93] Hadzilacos and Toueg, Fault tolerant broadcast and related problems, In Distributed systems (book)[mull:93] Hadzilacos and Toueg, Fault tolerant broadcast and related problems, In Distributed systems (book)
ECE 753 Fault Tolerant Computing5 Fault Modeling (contd.) Introduction What is a model? –An abstraction that captures the behavior of the original system.An abstraction that captures the behavior of the original system. must be simple must lead to accurate conclusions
ECE 753 Fault Tolerant Computing6 Fault Modeling (contd.) Introduction Why use a model? –tractability of analysistractability of analysis –a non-destructive method to study (low cost, alternative to fault injection)a non-destructive method to study (low cost, alternative to fault injection) –manageable study space (can check equivalence and reduce the study space)manageable study space (can check equivalence and reduce the study space)
ECE 753 Fault Tolerant Computing7 Fault Modeling (contd.) Introduction Different models at different levels of abstractions:Different models at different levels of abstractions: –Chip level - manufacturing defects, random faults, transistor faults, gate failures, aging,…Chip level - manufacturing defects, random faults, transistor faults, gate failures, aging,… –System levelSystem level HW - aging, interconnect failures, chip failures, … SW - bugs, design flaws, incorrect algorithms,...
ECE 753 Fault Tolerant Computing8 Fault Modeling (contd.) Fault models at different levels (HW)Fault models at different levels (HW) Process level Transistor level Gate level Function level (often error models) Behaviour level (often timing failure models)Behaviour level (often timing failure models)... System level (usually failure models)
ECE 753 Fault Tolerant Computing9 Fault Modeling (contd.) Fault models at different levels (contd.) Process level - Defect models cluster defects point and random defects used to predict the process yield tested using optical and parametric tests effect of defect chip fails to perform its function unacceptable parameters - large capacitance, large delay, slow speed, high currentunacceptable parameters - large capacitance, large delay, slow speed, high current
ECE 753 Fault Tolerant Computing10 Fault Modeling (contd.) Fault models at different levels (contd.) Transistor level - failure of a transistor fabrication level causes - point defects, mask misallignment, design rule violationfabrication level causes - point defects, mask misallignment, design rule violation physical facts - shorts, opens, line-bridges, others size variations -> altered delays coupling/crosstalk degradation of elements - electromigration alpha particle hits power transients missing/extra transistors – PLAs Function modification/alteration - FPGA
ECE 753 Fault Tolerant Computing11 Fault Modeling (contd.) Fault models at different levels (contd.) Transistor level - erroneous behaviors High current incorrect logic output intermediate voltage different performance (operating speed) state change - alpha particle hit
ECE 753 Fault Tolerant Computing12 Fault Modeling (contd.) Fault models at different levels (contd.) Transistor level - prevalent fault models stuck-on and stuck-off faults bridging fault strength of signals delay fault coupling and cross talk Limitations very large number of possible faults makes it difficult to handle these faults (intractability due to large model space)very large number of possible faults makes it difficult to handle these faults (intractability due to large model space)
ECE 753 Fault Tolerant Computing13 Fault Modeling (contd.) Fault models at different levels (contd.) Transistor level - comments (these are fairly general and are not restricted to transistor level model)Transistor level - comments (these are fairly general and are not restricted to transistor level model) increasing computing power implies that we can handle large number of faults and complex modelsincreasing computing power implies that we can handle large number of faults and complex models these models used for test generation and not for fault tolerance per saythese models used for test generation and not for fault tolerance per say methods have been proposed to reduce the number of faults that need to be studied - e.g. fault equivalencemethods have been proposed to reduce the number of faults that need to be studied - e.g. fault equivalence classical method and newer methods (such as current testing) are employed in real testingclassical method and newer methods (such as current testing) are employed in real testing design for testability and built-in self-test are becoming prevelentdesign for testability and built-in self-test are becoming prevelent
ECE 753 Fault Tolerant Computing14 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Gate level - causes same as for transistors additional causes in SSI and board level - failed resistor, failed solder joint, failed wire wrap, …additional causes in SSI and board level - failed resistor, failed solder joint, failed wire wrap, … Gate level - erroneous behaviors similar to those as for transistors (one of the most commonly used model - why? See next slides)(one of the most commonly used model - why? See next slides)
ECE 753 Fault Tolerant Computing15 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Gate level - different models Stuck-at: a line value stays the same irrespective of the signal applied to the lineStuck-at: a line value stays the same irrespective of the signal applied to the line Advantages simplicity accuracy can model most real faults tractable model space - count the possible number of faultstractable model space - count the possible number of faults easy to use and easy to quantify (for quality metric) substantial empirical evidence of its practical use
ECE 753 Fault Tolerant Computing16 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Gate level - different models Stuck-at - (contd.) Disadvantages with increasing device density the model is being questioned often and loosing many of its advantageswith increasing device density the model is being questioned often and loosing many of its advantages Some real defects can not be modeled by this model more powerful computers are making it possible to handle other models - even at fabrication levelmore powerful computers are making it possible to handle other models - even at fabrication level
ECE 753 Fault Tolerant Computing17 Fault Modeling (contd.) Fault models a different levels (contd.) Gate level - different models Bridging faults - pair of lines in a circuit (at gate level) are shorted. Many variations such as intergate, intragate, neighboring lines, …Bridging faults - pair of lines in a circuit (at gate level) are shorted. Many variations such as intergate, intragate, neighboring lines, … Advantages simple realistic Disadvantages large number of faults difficult to relate to the quality metric
ECE 753 Fault Tolerant Computing18 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Gate level - different models Stuck-open/Stuck-On - Transistor based open fault can be modeled by logic level. Some time extra logic gates are used to model opens in this manner similar to modeling bridging faultsStuck-open/Stuck-On - Transistor based open fault can be modeled by logic level. Some time extra logic gates are used to model opens in this manner similar to modeling bridging faults
ECE 753 Fault Tolerant Computing19 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Gate level - different models Delay faults - delay of a gate or a line is different than the nominal or know delay in a perfect processDelay faults - delay of a gate or a line is different than the nominal or know delay in a perfect process Deals with critical paths - gate delay, path delay,... Advantages Performance oriented modeling Quite general Disadvantages Difficult to use and intractable (path delay)
ECE 753 Fault Tolerant Computing20 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Gate level - different models Other models coupling between pair of lines pin or I/O faults in gates (or chips) speedup/slow down of signals (sub-micron technologies)speedup/slow down of signals (sub-micron technologies) aging (such as NBTI in sub-micron technologies)
ECE 753 Fault Tolerant Computing21 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Function Level - when used lower level description is not available function level processing (e.g. simulation) is often fasterfunction level processing (e.g. simulation) is often faster design available only in mixed form (gate and function)design available only in mixed form (gate and function)
ECE 753 Fault Tolerant Computing22 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Function Level - where used combinational circuits logic blocks decoders finite state machines large complex circuits microprocessors (often only mix format is available, such as ALU in gate level, memory in functional level, etc.)microprocessors (often only mix format is available, such as ALU in gate level, memory in functional level, etc.) for other building blocks PLAs, RAMs, FPGAs
ECE 753 Fault Tolerant Computing23 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) System Level - when used interconnected systems ad hoc connected systems regular connected systems failure of a system or systems, or interconnects many failure models exist and will be dicussed later in the coursemany failure models exist and will be dicussed later in the course
ECE 753 Fault Tolerant Computing24 Fault Modeling (contd.) Error models Means of classifying the effect of physical fault(s) in a system - note from modeling point of view it is not necessary that we deduce it using a fault modelMeans of classifying the effect of physical fault(s) in a system - note from modeling point of view it is not necessary that we deduce it using a fault model Goals extent of information corrupted extent of error(s) propagated latency issue
ECE 753 Fault Tolerant Computing25 Fault Modeling (contd.) Error models (contd.) Error effects data control state Error Types (HW) bit errors (data, control, state) - single bit error assumption commonly used in practicebit errors (data, control, state) - single bit error assumption commonly used in practice unidirectional errors (mostly in data) byte errors (data) other - intermediate logic level
ECE 753 Fault Tolerant Computing26 Fault Modeling (contd.) Error models (contd.) Error Types (SW) branch error missing instruction error missing/dangling pointer errors
ECE 753 Fault Tolerant Computing27 Fault Modeling (contd.) High-level failure models (process or system failure)High-level failure models (process or system failure) System model single or multiple processor system single - multiple processes executing key - interacting processes - such as message passing systems, distributed systems,...key - interacting processes - such as message passing systems, distributed systems,...
ECE 753 Fault Tolerant Computing28 Fault Modeling (contd.) High-level failure models (process or system failure)High-level failure models (process or system failure) General classification crash failure - a faulty processor or system stops permanentlycrash failure - a faulty processor or system stops permanently omission failure - a faulty process omits inputs/outputs some times but when it works, it works correctlyomission failure - a faulty process omits inputs/outputs some times but when it works, it works correctly timing failure - inputs/outputs are delayed or arrive too earlytiming failure - inputs/outputs are delayed or arrive too early Byzantine failure or arbitrary failure - a faulty processor can exhibit arbitrary behavior including malicious natureByzantine failure or arbitrary failure - a faulty processor can exhibit arbitrary behavior including malicious nature
ECE 753 Fault Tolerant Computing29 Summary Fault modeling –ReferencesReferences –Fault models at different levelsFault models at different levels –Error modelsError models –Process or system failure modelsProcess or system failure models