ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolerance: Watchdog and Re-execution.

Slides:



Advertisements
Similar presentations
NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
Advertisements

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Fault-Tolerant Systems Design Part 1.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.
Making Services Fault Tolerant
Interrupts (contd..) Multiple I/O devices may be connected to the processor and the memory via a bus. Some or all of these devices may be capable of generating.
Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Instruction Level Parallelism (ILP) Colin Stevens.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Basic Input/Output Operations
Multiscalar processors
Fehlererkennung in SW David Rigler. Overview Types of errors detection Fault/Error classification Description of certain SW error detection techniques.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Slipstream Processors by Pujan Joshi1 Pujan Joshi May 6 th, 2008 Slipstream Processors Improving both Performance and Fault Tolerance.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering HIGH Level Fault-Tolerance: Checkpointing and recovery.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.
(More) Interfacing concepts. Introduction Overview of I/O operations Programmed I/O – Standard I/O – Memory Mapped I/O Device synchronization Readings:
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
Fault-Tolerant Systems Design Part 1.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
CprE 458/558: Real-Time Systems
Fault-Tolerant Systems Design Part 1.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
ECE 353 Lab 2 Pipeline Simulator. Aims Further experience in C programming Handling strings Further experience in the use of assertions Reinforce concepts.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Chapter One Introduction to Pipelined Processors
Introduction to Computer Organization Pipelining.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Chapter One Introduction to Pipelined Processors.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Memory Protection through Dynamic Access Control Kun Zhang, Tao Zhang and Santosh Pande College of Computing Georgia Institute of Technology.
CMSC 611: Advanced Computer Architecture
Multiscalar Processors
Multi-core processors
nZDC: A compiler technique for near-Zero silent Data Corruption
Cache Memory Presentation I
CDA 3101 Spring 2016 Introduction to Computer Organization
Superscalar Processors & VLIW Processors
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Hardware Multithreading
Ka-Ming Keung Swamy D Ponpandi
How to improve (decrease) CPI
Interrupt handling Explain how interrupts are used to obtain processor time and how processing of interrupted jobs may later be resumed, (typical.
Levels of Parallelism within a Single Processor
ECE 753: FAULT-TOLERANT COMPUTING
Hardware Multithreading
Fault Tolerant Systems in a Space Environment
ECE 753: FAULT-TOLERANT COMPUTING
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Low Level Fault-Tolerance: Watchdog and Re-execution

ECE 753 Fault Tolerant Computing 2 Overview Introduction Watchdog techniques –Timers, watchdog processors, error model, control flow checking, memory access and assertion checkingTimers, watchdog processors, error model, control flow checking, memory access and assertion checking Re-execution for fault-tolerance –Basic techniques: RESO concept, program re- execution, instruction re-executionBasic techniques: RESO concept, program re- execution, instruction re-execution –Case studies: Fine grain parallel architecture (CRAY), SMT architecture, multiscalar architecture. Chip MultiprocessorCase studies: Fine grain parallel architecture (CRAY), SMT architecture, multiscalar architecture. Chip Multiprocessor Summary

ECE 753 Fault Tolerant Computing 3 Introduction References Watchdog - [mahm:88] Re-execution - [rotenberg:99], [rashid:00] [subra:10], [kala:13]Re-execution - [rotenberg:99], [rashid:00] [subra:10], [kala:13] Sohi, Franklin, and Saluja, “A study of time- redundant fault-tolerant techniques for high- performance pipelined computers,” Proceedings FTCS-19, June 1989, pp Sohi, Franklin, and Saluja, “A study of time- redundant fault-tolerant techniques for high- performance pipelined computers,” Proceedings FTCS-19, June 1989, pp

ECE 753 Fault Tolerant Computing 4 Introduction (contd.) Somewhat higher level than ECC and masking at circuit levelSomewhat higher level than ECC and masking at circuit level Bordering between hardware and software (hardware often assisted by software)Bordering between hardware and software (hardware often assisted by software) These are some of the very first fault- tolerance methodsThese are some of the very first fault- tolerance methods

ECE 753 Fault Tolerant Computing 5 Watchdog techniques Key concept –A process or processor is checked by another hardware (normally) unit of its actions. Actions checked include if the process is still active, alive, not executing incorrect paths during execution, etc.A process or processor is checked by another hardware (normally) unit of its actions. Actions checked include if the process is still active, alive, not executing incorrect paths during execution, etc. Processor watchdog

ECE 753 Fault Tolerant Computing 6 Watchdog: Timers Check for aliveness –Processor resets the timer at certain intervals or on certain conditionsProcessor resets the timer at certain intervals or on certain conditions –Timer raises error flag if not reset before it overrunsTimer raises error flag if not reset before it overruns Processor timer Error

ECE 753 Fault Tolerant Computing 7 Watchdog: Timers (contd.) Check for timeout –Processor sends a message and starts a timer, the second processor must reply within this time (hardware/software implementation)Processor sends a message and starts a timer, the second processor must reply within this time (hardware/software implementation) Timer Processor B Processor A

ECE 753 Fault Tolerant Computing 8 Watchdog: Timers (contd.) Applications –Processor control systems (chemical, mechanical and other control systems)Processor control systems (chemical, mechanical and other control systems) –Switching systems – messages sent or received often await certain length of time before they are repeatedSwitching systems – messages sent or received often await certain length of time before they are repeated –Networks – messages often have timeouts associated with themNetworks – messages often have timeouts associated with them

ECE 753 Fault Tolerant Computing 9 Watchdog: Processors Architecture – can be complex but let us consider the following simple architectureArchitecture – can be complex but let us consider the following simple architecture Memory Processor data address control BUS Watchdog (observer)

ECE 753 Fault Tolerant Computing 10 Watchdog: Processors (contd.) What can it achieve? –Observe the address busObserve the address bus Can observe the data Can observe instructions Can check the flow of program control –Need to know what kind of errors can occur to determine the capability of this methodNeed to know what kind of errors can occur to determine the capability of this method

ECE 753 Fault Tolerant Computing 11 Watchdog: Error models Experimental setup to develop error models applicable at this levelExperimental setup to develop error models applicable at this level –Processor-memory architectureProcessor-memory architecture –Inject faults (random errors) - in I/O processor, within processor (register file, states), within memoryInject faults (random errors) - in I/O processor, within processor (register file, states), within memory –SimulateSimulate –Also hardware was designed to inject such faults and study the impact/behaviorAlso hardware was designed to inject such faults and study the impact/behavior

ECE 753 Fault Tolerant Computing 12 Watchdog: Error models (contd.) Conclusions of the studies –Program flow could change (branch to no branch, or vise a versa)Program flow could change (branch to no branch, or vise a versa) –Instruction fetched from data spaceInstruction fetched from data space –Access to non existence memory spaceAccess to non existence memory space –Data fetched from instruction spaceData fetched from instruction space –Illegal instructionIllegal instruction –Writing in protected area (ROM)Writing in protected area (ROM) 60% of all faults could be detected by monitoring control flow – Thus we need to develop methods that are good in monitoring control flow60% of all faults could be detected by monitoring control flow – Thus we need to develop methods that are good in monitoring control flow

ECE 753 Fault Tolerant Computing 13 Watchdog: Control flow checking Basic principle –Analyze the program and extract control informationAnalyze the program and extract control information Branch free intervals Subroutine calls –Assign signatures to branch free intervals and provide these signatures to the watchdog processor to check these valuesAssign signatures to branch free intervals and provide these signatures to the watchdog processor to check these values

ECE 753 Fault Tolerant Computing 14 Watchdog: Control flow checking (contd.) A simple example Program watchdog start  receive start branch observe bus free cont. to form code signature check sig X ---  Check X against collected sig

ECE 753 Fault Tolerant Computing 15 Watchdog: Control flow checking (contd.) Details and variations –Structural integrity checkingStructural integrity checking Analyze the program control flow – create a program control flow graphAnalyze the program control flow – create a program control flow graph Assign unique identifier to the nodes of the graph Provide control flow graph to the watchdog along with the identifiersProvide control flow graph to the watchdog along with the identifiers In case of branches, watchdog expects one of the many possible identifiersIn case of branches, watchdog expects one of the many possible identifiers Limitations –Performance impact – insertion of special instructionsPerformance impact – insertion of special instructions –Inability to detect data processing variations – add to subInability to detect data processing variations – add to sub

ECE 753 Fault Tolerant Computing 16 Watchdog: Control flow checking (contd.) Details and variations (contd.) –Derived signature checkingDerived signature checking Compiler identifies branch free intervals and generates signatures (such as check sum) for these intervalsCompiler identifies branch free intervals and generates signatures (such as check sum) for these intervals At run time these signatures are provided to the watchdog using tag bits to differentiate between regular instructions and watchdog messagesAt run time these signatures are provided to the watchdog using tag bits to differentiate between regular instructions and watchdog messages Watchdog monitors the bus and generates the signatures and compare these signatures with the signatures captured from the bus (compiled signature)Watchdog monitors the bus and generates the signatures and compare these signatures with the signatures captured from the bus (compiled signature) Example: associate two tag bits with every memory word to differentiate between instructions and compiled signatures – when a tag for signature appears on the bus watchdog captures the tag and forces a NOP on the bus for the regular processorExample: associate two tag bits with every memory word to differentiate between instructions and compiled signatures – when a tag for signature appears on the bus watchdog captures the tag and forces a NOP on the bus for the regular processor

ECE 753 Fault Tolerant Computing 17 Watchdog: Control flow checking (contd.) Details and variations (contd.) –Derived signature checking (contd.)Derived signature checking (contd.) Coverage –Can detect random errors in instructions in branch free intervals (but aliasing can occur)Can detect random errors in instructions in branch free intervals (but aliasing can occur) Overheads –Memory width increase due to tag bitsMemory width increase due to tag bits – Memory increase due to signatures insertions Memory increase due to signatures insertions –Performance impact due to NOPsPerformance impact due to NOPs Solutions –Using path signature method – reduces the number of signatures neededUsing path signature method – reduces the number of signatures needed –Branch address hashing – merge signature and branch addressBranch address hashing – merge signature and branch address

ECE 753 Fault Tolerant Computing 18 Watchdog: Mem access and assertion checks What to do about memory/data errors –Use ECCUse ECC –Few other methods using watchdogFew other methods using watchdog Check for non existent memory addresses Check for out of range addresses Capability based checking for objects is also possibleCapability based checking for objects is also possible Assertion based checking and sanity checks using watchdog (independent hardware) is also possibleAssertion based checking and sanity checks using watchdog (independent hardware) is also possible

ECE 753 Fault Tolerant Computing 19 Re-execution for fault-tolerance Key concept –Execute a program/instruction twice (or more times) and then compare the results.Execute a program/instruction twice (or more times) and then compare the results. –A time redundancy technique, but if multiple hardware platforms are available, it is a hardware redundancy techniqueA time redundancy technique, but if multiple hardware platforms are available, it is a hardware redundancy technique –Can detect transient faults. But it can also be employed to detect some permanent faults (see RESO next) even if the same hardware is used.Can detect transient faults. But it can also be employed to detect some permanent faults (see RESO next) even if the same hardware is used.

ECE 753 Fault Tolerant Computing 20 Re-execution: Basic Techniques RESO concept –Re-execution of an instruction with shifted operandsRe-execution of an instruction with shifted operands Already discussed early in the course Can detect transient faults Can also detect many permanent faults

ECE 753 Fault Tolerant Computing 21 Re-execution: Basic Techniques (contd.) Program Re-execution –Make two copies the programMake two copies the program Execute them serially –Can use RESO if the hardware platform is same for both executionsCan use RESO if the hardware platform is same for both executions Execute them in parallel if sufficient hardware redundancy is availableExecute them in parallel if sufficient hardware redundancy is available –May take twice as long or twice the hardwareMay take twice as long or twice the hardware –When/how to compare: impacts the system complexityWhen/how to compare: impacts the system complexity –Performance impactPerformance impact Serial computation: High latency Parallel computation: Complex implementation, and hence possible loss of performanceParallel computation: Complex implementation, and hence possible loss of performance

ECE 753 Fault Tolerant Computing 22 Re-execution: Basic Techniques (contd.) Instruction Re-execution – fine grain parallelismInstruction Re-execution – fine grain parallelism –Re-execute every instruction on same or different hardware, depending upon the redundancy availableRe-execute every instruction on same or different hardware, depending upon the redundancy available May use RESO if same hardware is used for instruction re-executionMay use RESO if same hardware is used for instruction re-execution –If sufficient resources are available, this method may have little impact on the performanceIf sufficient resources are available, this method may have little impact on the performance

ECE 753 Fault Tolerant Computing 23 Re-execution: Case studies Introduction to case studies –CRAYCRAY Instruction re-execution –SMT architectureSMT architecture Two copies the program are interleaved as two threads for simultaneous executionTwo copies the program are interleaved as two threads for simultaneous execution –Multiscalar architectureMultiscalar architecture Two copies of the program are executed on many processing elements simultaneouslyTwo copies of the program are executed on many processing elements simultaneously –Chip multiprocessorChip multiprocessor With critical value forwarding (DSN-2010)

ECE 753 Fault Tolerant Computing 24 Re-execution: Case studies (contd.) CRAY Instruction re-execution Duplication of instruction in hardware Sufficient resources and pipelining available for re-execution without doubling the execution timeSufficient resources and pipelining available for re-execution without doubling the execution time Consider a generic fine grain parallel architecture (OH)Consider a generic fine grain parallel architecture (OH) Consider executing a code segment (OH) Now look at ways of duplicating instructions and executing original and duplicated instructions (OH)Now look at ways of duplicating instructions and executing original and duplicated instructions (OH) Some experimental results

ECE 753 Fault Tolerant Computing 25 Re-execution: Case studies (contd.) AR-SMT –High level view of the technique (OH)High level view of the technique (OH) Concept of execution (Active) streams Re-execution of the instruction stream – Redundant streamRe-execution of the instruction stream – Redundant stream –Issue of delay buffer length and latencyIssue of delay buffer length and latency –Implementation issues and coverageImplementation issues and coverage –Performance impactPerformance impact

ECE 753 Fault Tolerant Computing 26 Re-execution: Case studies (contd.) Multiscalar –Concept of control flow graph (OH)Concept of control flow graph (OH) –Basic architecture (OH)Basic architecture (OH) – Static division of PUs and performance impact (OH) Static division of PUs and performance impact (OH) –Dynamic division of PUs and performance impact (OH)Dynamic division of PUs and performance impact (OH)

ECE 753 Fault Tolerant Computing 27 Re-execution: Case studies (contd.) Chip Multiprocessor (See slide set) –IntroIntro –Design Overview and conceptDesign Overview and concept – Evaulation Evaulation –ConclusionConclusion

ECE 753 Fault Tolerant Computing 28 Watchdog and Re-execution: Comments Concepts discussed here can be used to design high performance processorsConcepts discussed here can be used to design high performance processors –Performance improvement via speculationPerformance improvement via speculation Have a very high performance speculative processor Verify the control flow using watchdog or use a second processor to fully verify the executed stream by the speculative processor.Verify the control flow using watchdog or use a second processor to fully verify the executed stream by the speculative processor. This will lead to a processor with high performance (throughput) albeit high latencyThis will lead to a processor with high performance (throughput) albeit high latency

ECE 753 Fault Tolerant Computing 29 Summary Watchdog –TimerTimer –ProcessorProcessor –Control flow checkingControl flow checking Re-execution –Basic techniquesBasic techniques –Case studies: CRAY, AR-SMT, MultiscalarCase studies: CRAY, AR-SMT, Multiscalar