ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Fault Modeling Lectures Set 2.

Slides:



Advertisements
Similar presentations
IC TESTING.
Advertisements

CMP238: Projeto e Teste de Sistemas VLSI Marcelo Lubaszewski Aula 2 - Teste PPGC - UFRGS 2005/I.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.
Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.
Copyright 2001, Agrawal & BushnellDay-1 AM-3 Lecture 31 Testing Analog & Digital Products Lecture 3: Fault Modeling n Why model faults? n Some real defects.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
TCSS 372A Computer Architecture. Getting Started Get acquainted (take pictures) Discuss purpose, scope, and expectations of the course Discuss personal.
02/02/20091 Logic devices can be classified into two broad categories Fixed Programmable Programmable Logic Device Introduction Lecture Notes – Lab 2.
Spring 08, Jan 15 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
Lecture 5 Fault Simulation
Lecture 5 Fault Modeling
Copyright 2001, Agrawal & BushnellVLSI Test: Lecture 51 Lecture 5 Fault Modeling n Why model faults? n Some real defects in VLSI and PCB n Common fault.
1 Introduction VLSI Testing. 2 Overview First digital products (mid 1940's) Complexity:low MTTF:hours Cost:high Present day products (mid 1980's) Complexity:high.
1/31/20081 Logic devices can be classified into two broad categories Fixed Programmable Programmable Logic Device Introduction Lecture Notes – Lab 2.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Copyright 2001, Agrawal & BushnellDay-1 AM-1 Lecture 11 Testing Analog & Digital Products Dr. Vishwani D. Agrawal James J. Danaher Professor of Electrical.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
ELEN 468 Lecture 231 ELEN 468 Advanced Logic Design Lecture 23 Testing.
Principle of Functional Verification Chapter 1~3 Presenter : Fu-Ching Yang.
The Processor Data Path & Control Chapter 5 Part 1 - Introduction and Single Clock Cycle Design N. Guydosh 2/29/04.
EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
ECE 553: TESTING AND TESTABLE DESIGN OF DIGITAL SYSTES Fault Modeling.
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
Unit V Fault Diagnosis.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Slide No. 1 Course: Logic Design Dr. Ali Elkateeb Topic: Introduction Course Number: COMP 1213 Course Title: Logic Design Instructor: Dr. Ali Elkateeb.
Unit I Testing and Fault Modelling
Computer Engineering Group Brandenburg University of Technology at Cottbus 1 Ressource Reduced Triple Modular Redundancy for Built-In Self-Repair in VLIW-Processors.
J. Christiansen, CERN - EP/MIC
Testing of Digital Systems: An Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Dr. Aiman.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Page 1EL/CCUT T.-C. Huang Mar TCH CCUT Introduction to IC Test Tsung-Chu Huang ( 黃宗柱 ) Department of Electronic Eng. Chong Chou Institute of Tech.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
1 Note on Testing for Hardware Components. 2 Steps in successful hardware design (basic “process”): 1.Understand the requirements (“product’) 2.Write.
Fault Models, Fault Simulation and Test Generation Vishwani D. Agrawal Department of ECE, Auburn University Auburn, AL 36849, USA
CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.
Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs Ghazanfar (Hossein) Asadi and Mehdi B. Tahoori Why Soft Error Rate (SER) Estimation?
TOPIC : Different levels of Fault model UNIT 2 : Fault Modeling Module 2.1 Modeling Physical fault to logical fault.
Section 1  Quickly identify faulty components  Design new, efficient testing methodologies to offset the complexity of FPGA testing as compared to.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
EE434 ASIC & Digital Systems Partha Pande School of EECS Washington State University
CS/EE 3700 : Fundamentals of Digital System Design
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
THE MICROPROCESSOR A microprocessor is a single chip of silicon that performs all of the essential functions of a computer central processor unit (CPU)
Jan. 26, 2001VLSI Test: Bushnell-Agrawal/Lecture 51 Lecture 5 Fault Modeling n Why model faults? n Some real defects in VLSI and PCB n Common fault models.
Silicon Programming--Testing1 Completing a successful project (introduction) Design for testability.
TOPIC : Introduction to Faults UNIT 2: Modeling and Simulation Module 1 : Logical faults due to physical faults.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
ECE 553: TESTING AND TESTABLE DESIGN OF DIGITAL SYSTES Functional testing.
TOPIC : Introduction to Faults UNIT 2: Modeling and Simulation Module 1 : Logical faults due to physical faults.
Copyright 2001, Agrawal & BushnellVLSI Test: Lecture 51 Lecture 5 Fault Modeling n Why model faults? n Some real defects in VLSI and PCB n Common fault.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Lecture 5: Design for Testability. CMOS VLSI DesignCMOS VLSI Design 4th Ed. 12: Design for Testability2 Outline  Testing –Logic Verification –Silicon.
EECE 320 L8: Combinational Logic design Principles 1Chehab, AUB, 2003 EECE 320 Digital Systems Design Lecture 8: Combinational Logic Design Principles.
Faults and fault-tolerance
ELEC 7770 Advanced VLSI Design Spring 2016 Introduction
ELEC 7770 Advanced VLSI Design Spring 2014 Introduction
ELEC Digital Logic Circuits Fall 2014 Logic Testing (Chapter 12)
Lecture 5 Fault Modeling
Faults and fault-tolerance
Automatic Test Generation for Combinational Circuits
Mattan Erez The University of Texas at Austin July 2015
Introduction to Fault Tolerance
COMS 361 Computer Organization
VLSI Testing Lecture 3: Fault Modeling
Presentation transcript:

ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Fault Modeling Lectures Set 2

ECE 753 Fault Tolerant Computing2 Overview Fault Modeling References Introduction Fault models at different levels (HW) Error models High-level failure models (process or system failure)High-level failure models (process or system failure) Summary

ECE 753 Fault Tolerant Computing3 Recap Think about PROJECT Terminology and definitions Fundamental principles - Redundancy –Hardware - low and high levelHardware - low and high level –SoftwareSoftware –TimeTime –InformationInformation FEF Chain and methods to break it (barriers) –Attributes of faults and fault types - such as permanent, transient, intermittent (please read)Attributes of faults and fault types - such as permanent, transient, intermittent (please read)

ECE 753 Fault Tolerant Computing4 Fault Modeling References [abra:86] Abraham and Fuchs, Fault and error modeling for VLSI, Proc. IEEE, May 1986[abra:86] Abraham and Fuchs, Fault and error modeling for VLSI, Proc. IEEE, May 1986 [kala:13] Kalayappan and Sarangi, A survey of checker architectures, ACM Computing survey, Aug 2013[kala:13] Kalayappan and Sarangi, A survey of checker architectures, ACM Computing survey, Aug 2013 [mull:93] Hadzilacos and Toueg, Fault tolerant broadcast and related problems, In Distributed systems (book)[mull:93] Hadzilacos and Toueg, Fault tolerant broadcast and related problems, In Distributed systems (book)

ECE 753 Fault Tolerant Computing5 Fault Modeling (contd.) Introduction What is a model? –An abstraction that captures the behavior of the original system.An abstraction that captures the behavior of the original system. must be simple must lead to accurate conclusions

ECE 753 Fault Tolerant Computing6 Fault Modeling (contd.) Introduction Why use a model? –tractability of analysistractability of analysis –a non-destructive method to study (low cost, alternative to fault injection)a non-destructive method to study (low cost, alternative to fault injection) –manageable study space (can check equivalence and reduce the study space)manageable study space (can check equivalence and reduce the study space)

ECE 753 Fault Tolerant Computing7 Fault Modeling (contd.) Introduction Different models at different levels of abstractions:Different models at different levels of abstractions: –Chip level - manufacturing defects, random faults, transistor faults, gate failures, aging,…Chip level - manufacturing defects, random faults, transistor faults, gate failures, aging,… –System levelSystem level HW - aging, interconnect failures, chip failures, … SW - bugs, design flaws, incorrect algorithms,...

ECE 753 Fault Tolerant Computing8 Fault Modeling (contd.) Fault models at different levels (HW)Fault models at different levels (HW) Process level Transistor level Gate level Function level (often error models) Behaviour level (often timing failure models)Behaviour level (often timing failure models)... System level (usually failure models)

ECE 753 Fault Tolerant Computing9 Fault Modeling (contd.) Fault models at different levels (contd.) Process level - Defect models cluster defects point and random defects used to predict the process yield tested using optical and parametric tests effect of defect chip fails to perform its function unacceptable parameters - large capacitance, large delay, slow speed, high currentunacceptable parameters - large capacitance, large delay, slow speed, high current

ECE 753 Fault Tolerant Computing10 Fault Modeling (contd.) Fault models at different levels (contd.) Transistor level - failure of a transistor fabrication level causes - point defects, mask misallignment, design rule violationfabrication level causes - point defects, mask misallignment, design rule violation physical facts - shorts, opens, line-bridges, others size variations -> altered delays coupling/crosstalk degradation of elements - electromigration alpha particle hits power transients missing/extra transistors – PLAs Function modification/alteration - FPGA

ECE 753 Fault Tolerant Computing11 Fault Modeling (contd.) Fault models at different levels (contd.) Transistor level - erroneous behaviors High current incorrect logic output intermediate voltage different performance (operating speed) state change - alpha particle hit

ECE 753 Fault Tolerant Computing12 Fault Modeling (contd.) Fault models at different levels (contd.) Transistor level - prevalent fault models stuck-on and stuck-off faults bridging fault strength of signals delay fault coupling and cross talk Limitations very large number of possible faults makes it difficult to handle these faults (intractability due to large model space)very large number of possible faults makes it difficult to handle these faults (intractability due to large model space)

ECE 753 Fault Tolerant Computing13 Fault Modeling (contd.) Fault models at different levels (contd.) Transistor level - comments (these are fairly general and are not restricted to transistor level model)Transistor level - comments (these are fairly general and are not restricted to transistor level model) increasing computing power implies that we can handle large number of faults and complex modelsincreasing computing power implies that we can handle large number of faults and complex models these models used for test generation and not for fault tolerance per saythese models used for test generation and not for fault tolerance per say methods have been proposed to reduce the number of faults that need to be studied - e.g. fault equivalencemethods have been proposed to reduce the number of faults that need to be studied - e.g. fault equivalence classical method and newer methods (such as current testing) are employed in real testingclassical method and newer methods (such as current testing) are employed in real testing design for testability and built-in self-test are becoming prevelentdesign for testability and built-in self-test are becoming prevelent

ECE 753 Fault Tolerant Computing14 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Gate level - causes same as for transistors additional causes in SSI and board level - failed resistor, failed solder joint, failed wire wrap, …additional causes in SSI and board level - failed resistor, failed solder joint, failed wire wrap, … Gate level - erroneous behaviors similar to those as for transistors (one of the most commonly used model - why? See next slides)(one of the most commonly used model - why? See next slides)

ECE 753 Fault Tolerant Computing15 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Gate level - different models Stuck-at: a line value stays the same irrespective of the signal applied to the lineStuck-at: a line value stays the same irrespective of the signal applied to the line Advantages simplicity accuracy can model most real faults tractable model space - count the possible number of faultstractable model space - count the possible number of faults easy to use and easy to quantify (for quality metric) substantial empirical evidence of its practical use

ECE 753 Fault Tolerant Computing16 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Gate level - different models Stuck-at - (contd.) Disadvantages with increasing device density the model is being questioned often and loosing many of its advantageswith increasing device density the model is being questioned often and loosing many of its advantages Some real defects can not be modeled by this model more powerful computers are making it possible to handle other models - even at fabrication levelmore powerful computers are making it possible to handle other models - even at fabrication level

ECE 753 Fault Tolerant Computing17 Fault Modeling (contd.) Fault models a different levels (contd.) Gate level - different models Bridging faults - pair of lines in a circuit (at gate level) are shorted. Many variations such as intergate, intragate, neighboring lines, …Bridging faults - pair of lines in a circuit (at gate level) are shorted. Many variations such as intergate, intragate, neighboring lines, … Advantages simple realistic Disadvantages large number of faults difficult to relate to the quality metric

ECE 753 Fault Tolerant Computing18 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Gate level - different models Stuck-open/Stuck-On - Transistor based open fault can be modeled by logic level. Some time extra logic gates are used to model opens in this manner similar to modeling bridging faultsStuck-open/Stuck-On - Transistor based open fault can be modeled by logic level. Some time extra logic gates are used to model opens in this manner similar to modeling bridging faults

ECE 753 Fault Tolerant Computing19 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Gate level - different models Delay faults - delay of a gate or a line is different than the nominal or know delay in a perfect processDelay faults - delay of a gate or a line is different than the nominal or know delay in a perfect process Deals with critical paths - gate delay, path delay,... Advantages Performance oriented modeling Quite general Disadvantages Difficult to use and intractable (path delay)

ECE 753 Fault Tolerant Computing20 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Gate level - different models Other models coupling between pair of lines pin or I/O faults in gates (or chips) speedup/slow down of signals (sub-micron technologies)speedup/slow down of signals (sub-micron technologies) aging (such as NBTI in sub-micron technologies)

ECE 753 Fault Tolerant Computing21 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Function Level - when used lower level description is not available function level processing (e.g. simulation) is often fasterfunction level processing (e.g. simulation) is often faster design available only in mixed form (gate and function)design available only in mixed form (gate and function)

ECE 753 Fault Tolerant Computing22 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) Function Level - where used combinational circuits logic blocks decoders finite state machines large complex circuits microprocessors (often only mix format is available, such as ALU in gate level, memory in functional level, etc.)microprocessors (often only mix format is available, such as ALU in gate level, memory in functional level, etc.) for other building blocks PLAs, RAMs, FPGAs

ECE 753 Fault Tolerant Computing23 Fault Modeling (contd.) Fault models a different levels (contd.)Fault models a different levels (contd.) System Level - when used interconnected systems ad hoc connected systems regular connected systems failure of a system or systems, or interconnects many failure models exist and will be dicussed later in the coursemany failure models exist and will be dicussed later in the course

ECE 753 Fault Tolerant Computing24 Fault Modeling (contd.) Error models Means of classifying the effect of physical fault(s) in a system - note from modeling point of view it is not necessary that we deduce it using a fault modelMeans of classifying the effect of physical fault(s) in a system - note from modeling point of view it is not necessary that we deduce it using a fault model Goals extent of information corrupted extent of error(s) propagated latency issue

ECE 753 Fault Tolerant Computing25 Fault Modeling (contd.) Error models (contd.) Error effects data control state Error Types (HW) bit errors (data, control, state) - single bit error assumption commonly used in practicebit errors (data, control, state) - single bit error assumption commonly used in practice unidirectional errors (mostly in data) byte errors (data) other - intermediate logic level

ECE 753 Fault Tolerant Computing26 Fault Modeling (contd.) Error models (contd.) Error Types (SW) branch error missing instruction error missing/dangling pointer errors

ECE 753 Fault Tolerant Computing27 Fault Modeling (contd.) High-level failure models (process or system failure)High-level failure models (process or system failure) System model single or multiple processor system single - multiple processes executing key - interacting processes - such as message passing systems, distributed systems,...key - interacting processes - such as message passing systems, distributed systems,...

ECE 753 Fault Tolerant Computing28 Fault Modeling (contd.) High-level failure models (process or system failure)High-level failure models (process or system failure) General classification crash failure - a faulty processor or system stops permanentlycrash failure - a faulty processor or system stops permanently omission failure - a faulty process omits inputs/outputs some times but when it works, it works correctlyomission failure - a faulty process omits inputs/outputs some times but when it works, it works correctly timing failure - inputs/outputs are delayed or arrive too earlytiming failure - inputs/outputs are delayed or arrive too early Byzantine failure or arbitrary failure - a faulty processor can exhibit arbitrary behavior including malicious natureByzantine failure or arbitrary failure - a faulty processor can exhibit arbitrary behavior including malicious nature

ECE 753 Fault Tolerant Computing29 Summary Fault modeling –ReferencesReferences –Fault models at different levelsFault models at different levels –Error modelsError models –Process or system failure modelsProcess or system failure models