Simulation Fault-Injection & Software Fault-Tolerance

Slides:

Advertisements

Similar presentations

Computer Systems & Architecture Lesson 2 4. Achieving Qualities.

Advertisements

NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.

Fault-Tolerant Systems Design Part 1.

This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.

Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.

CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.

DRACO Architecture Research Group. DSN, Edinburgh UK, Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Multiscalar processors

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.

Principle of Functional Verification Chapter 1~3 Presenter : Fu-Ching Yang.

University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.

Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi.

Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.

1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University

With Scott Arnold & Ryan Nuzzaci An Adaptive Fault-Tolerant Memory System for FPGA- based Architectures in the Space Environment Dan Fay, Alex Shye, Sayantan.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

Fault-tolerant Typed Assembly Language Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker Princeton University.

Event Management & ITIL V3

SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.

Computer Engineering Group Brandenburg University of Technology at Cottbus 1 Ressource Reduced Triple Modular Redundancy for Built-In Self-Repair in VLIW-Processors.

Architectural Optimizations Ed Carlisle. DARA: A LOW-COST RELIABLE ARCHITECTURE BASED ON UNHARDENED DEVICES AND ITS CASE STUDY OF RADIATION STRESS TEST.

Fault-Tolerant Systems Design Part 1.

European Test Symposium, May 28, 2008 Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI Kundan.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.

Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.

CprE 458/558: Real-Time Systems

Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.

Fault-Tolerant Systems Design Part 1.

CE Operating Systems Lecture 2 Low level hardware support for operating systems.

Full and Para Virtualization

CE Operating Systems Lecture 2 Low level hardware support for operating systems.

Hrushikesh Chavan Younggyun Cho Structural Fault Tolerance for SOC.

Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.

CPSC 871 John D. McGregor Module 8 Session 1 Testing.

Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.

Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,

Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.

A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

RAID Technology By: Adarsha A,S 1BY08A03. Overview What is RAID Technology? What is RAID Technology? History of RAID History of RAID Techniques/Methods.

GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.

University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.

RAID TECHNOLOGY RASHMI ACHARYA CSE(A) RG NO

CPSC 372 John D. McGregor Module 8 Session 1 Testing.

MAPLD 2005/213Kakarla & Katkoori Partial Evaluation Based Redundancy for SEU Mitigation in Combinational Circuits MAPLD 2005 Sujana Kakarla Srinivas Katkoori.

Free Transactions with Rio Vista

TRANSACTION PROCESSING SYSTEM (TPS)

John D. McGregor Session 9 Testing Vocabulary

Multiscalar Processors

nZDC: A compiler technique for near-Zero silent Data Corruption

John D. McGregor Session 9 Testing Vocabulary

Real-time Software Design

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

UnSync: A Soft Error Resilient Redundant Multicore Architecture

John D. McGregor Session 9 Testing Vocabulary

Hwisoo So. , Moslem Didehban#, Yohan Ko

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Fault Tolerance Distributed Web-based Systems

Instruction Level Parallelism (ILP)

Single Event Upset Simulation

Fault Tolerant Systems in a Space Environment

Seminar on Enterprise Software

Presentation transcript:

Simulation Fault-Injection & Software Fault-Tolerance Ed Carlisle

Outline Background Simulation Fault-Injection Process-Level Redundancy Radiation Effects Fault Injection Fault Tolerance Simulation Fault-Injection Methodology Results Related Research Process-Level Redundancy Architecture Maintaining Transparency Results & Overhead Conclusions

Radiation Effects Transient faults (or soft errors) Occur when particles strike a device causing the deposit or removal of energy which inverts transistor state Usually observed as a bit-flip In order to study these effects in the lab, some form of fault injection can be used

Hardware Fault-Injection Using radiation beam or electromagnetic interference Similar to what a device would experience in harsh environment Using probes to introduce voltage or current changes Advantage Closely resembles real-world effects on device Disadvantages Possible to damage device under test Device under test must be modified to perform injection

Software Fault-Injection Compile-time injection Corrupts an application’s instructions during compilation Runtime injection Uses a trigger mechanism to inject faults during execution Faults can be targeted at any software-visible components Advantage Device under test does not need to be modified Disadvantage Possible to disturb processing workload in unintended ways

Simulation Fault-Injection Fault injection can be performed in simulation of system Advantages Injections are transparent to target system Simulation offers greatest amount of controllability and observability Disadvantages Building simulation for target device is not a trivial task Faults in physical system may not manifest in simulation Python

Fault Tolerance Usually involves some form of redundancy Hardware Fault-Tolerance Memory and caches can be protected with ECC or parity TMR is one of the most common forms of HW FT Example of TMR (Triple Modular Redundancy) shown below

Fault Tolerance Hardware Fault-Tolerance (cont’d) Hardware devices can also be fabricated using processes that are less susceptible to radiation effects Process of radiation hardening devices can be prohibitively expensive and time consuming RadHard devices are generations behind their COTS counterparts in terms of performance and power consumption Software Fault-Tolerance Very cost-effective approach compared to hardware FT Does not require any modification to device architecture Leverages high-performance, low-power commercial off-the-shelf (COTS) components

Questions?

Nicholas J. Wange, Justin Quek, Todd M. Rafacz, Sanjay J. Patel Univeristy of Illinois at Urbana-Champaign International Conference on Dependable Systems and Networks 2004 Characterizing the Effects of Transient Faults on a High Performance Processor Pipeline

Overview Detailed Verilog model created for a microprocessor architecture, similar in complexity to the Alpha 21264 or AMD Athlon Created a methodology for performing fault injection on a detailed latch-level simulation of a complex processor Studied the propagation and/or masking of faults from the micro-architectural level to the architectural level

Verilog Processor Model Features Alpha ISA subset Speculative instruction scheduling Memory dependence prediction Sophisticated branch prediction Up to 132 instructions can occupy the 12 stage pipeline

Fault-Injection Methodology A time at which to inject fault is first selected Randomly selected from 250-300 start points Then the bit to corrupt is randomly selected Injected faults are a single bit-flip of a state element The trial is monitored for up to 10,000 cycles At each cycle, architectural state is verified against non-injected golden execution Trials are placed into four categories depending on the outcome Each experiment consists of 25,000-30,000 trials

Trial Outcome Categories Micro-architectural state match Occurs when every bit of state in the machine is equivalent to a non-fault-injected simulation Termination Premature termination of the workload (execution error) Silent data corruption Trials that result in software-visible register or memory corruption (data error) Gray area Trial that does not result in failure (termination or silent data corruption) or micro-architectural state match

Results

Results This chart shows which types of state (relative to their contribution of overall state) contribute to silent data corruption and terminated results Register file corruption is the leading cause of silent data corruption (data errors) and terminated (execution errors) outcomes

Results Although noise is present in the graph, a correlation between processor utilization and benign fault rate can be seen As the number of valid instructions (those that will commit results) in the pipeline decreases the benign fault rate increases Benign faults do not affect program correctness

Shortfalls Some instructions of the Alpha ISA were not implemented in the processor model 10,000 cycle limit for monitoring is quite low Certainly not enough time for most benchmarks to complete Certain components were ignored for fault injection These include caches and prediction structures Corrupted registers were considered application failures However, I have observed in my research that the majority of faults targeted at registers do not affect program execution or output In my research I use the Simics cycle-accurate system simulation environment to perform fault injections into the register file of the Freescale P2020 dual-core PowerPC-based processor

Simics Fault-Injection Workflow Select checkpoint for injection and inject fault Create Simics script to load and execute injected checkpoint Run Simics script Monitor console output to determine outcome Log results and exit Simics Create Simics script to load initial checkpoint Calculate cycles required for execution Create checkpoints and exit Simics

Simics Simulation Fault-Injection Results Simics simulation does not have the same level of detail needed to perform fault injection at the micro-architectural level, but does allow for register file fault-injection The chart below shows results obtained when injecting single-bit faults into each of the general purpose registers, during a matrix multiplication application

Questions?

Alex Shye, Joseph Blomstedt, Tipp Moseley, Vijay Janapa Reddi, Daniel A. Connors IEEE Transaction on Dependable and Secure Computing April-June 2009 PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures

Process-Level Redundancy Similar to TMR hardware fault-tolerance scheme Creates a set of redundant processes for an application and compares each output to ensure correct execution Leverages multiple processing cores by allowing the operating system to schedule redundant processes to available cores Biggest challenge is maintaining determinism Transparency can be achieved by maintaining user-expected process semantics Does not require any modifications to target application, operating system, or device architecture Important for legacy binaries whose source is no longer available

Sphere of Replication Specifies the boundary for fault detection and containment Data entering the SoR is replicated All execution within the SoR is redundant Any data leaving the SoR is compared to check for faults Any execution outside the SoR is not protected A typical hardware-centric SoR is shown on the left PLR’s software-centric SoR is shown on the right

PLR Components Monitor process Figurehead process Master process Maintains semantics Figurehead process Master process Slave processes Redundant processes System call emulation Maintains determinism Responsible for fault detection and recovery

Maintaining Process Semantics Example semantics: Each application is assigned a process identifier (PID) which exists throughout execution and returned to the operating system after completion When an application exits, it returns the correct exit code A signal that is sent to a valid PID will have the intended effects (e.g. SIGKILL will kill the process) Figurehead process Original process becomes figurehead process after redundant processes are created Does not perform any real work

Maintaining Process Semantics Figurehead process (cont’d) Sleeps and waits for redundant processes to complete Receives application exit value and exits correctly Responsible for forwarding incoming signals to all redundant processes Monitor process Certain signals are not easily forwarded A SIGKILL signal would kill the figurehead process, but leave behind all redundant processes Monitor process polls the state of figurehead process If figurehead is killed or stopped, monitor process will kill or stop redundant processes

Maintaining Determinism & Transparency System call emulation unit Responsible for input replication, output comparison, and system call emulation Responsible for ensuring that redundant processes interacting with the system appear as if only the original process is executing System calls that return nondeterministic data (such as the system time) must be emulated to ensure all processes use the same data Master vs. slave processes System calls that modify any system state are only executed by the master process Other system calls are performed once for the master process and replicated for the slave processes

Fault Detection The system call emulation unit is responsible for providing fault detection and recovery A fault causing the application to hang can be detected by a watchdog timer attached to the emulation unit The timer begins when a processes enters the unit If the rest of processes do not enter the unit within a specified amount of time, an execution error is signaled Faults causing control-flow errors can also be detected if all processes do not request the same system call when entering the emulation unit

Fault Recovery If an output mismatch occurs, a majority vote can be used to kill process producing incorrect data Bad process is then replaced by forking correct process A watchdog timeout can occur in two cases If a faulty process calls the emulation unit while other processes are executing, it is killed and replaced by forking a correct process at the next system call If a faulty process hangs while the other processes are waiting in the emulation unit, it is killed and replaced by a correct process If a process fails, it is simply replaced by duplicating one of the remaining processes

Results PLR eliminates all failed, abort, and incorrect cases Output comparison converts abort and incorrect cases to mismatches PLR detects failed cases, converting them into sighandler cases A small number of failed cases are detected as mismatch with PLR The mismatch is caught before the application can fail Some floating-point benchmarks actually caused correct outcomes to become mismatches with PLR enabled The specdiff tool included with the benchmarks uses a tolerance when checking output data, whereas PLR’s output comparison checks raw data

Overhead Incurred A) 2 processes B) 3 processes C) 2 processes optimized D) 3 processes optimized Contention overhead is mainly caused by sharing memory bandwidth between redundant processes Emulation overhead is caused by synchronization and transferring/comparing data in shared memory

Shortfalls Functionality of system call emulation unit is detailed, however not many implementation details are provided Replicating results would be hard to accomplish without more specific implementation details Faults occurring during PLR code or operating system execution are not protected against Only supports single-threaded applications May not function as intended if using more redundant processes than physical cores available Timeouts assume all processes are running concurrently

Conclusions Simulation Fault-Injection Process-Level Redundancy Allowed for injections to target areas not accessible to software or hardware fault-injection tools Showed that many faults are masked before they are even visible to software Process-Level Redundancy Software fault-tolerance scheme Similar to triple modular redundancy hardware scheme Transparent to system and target application Does not require any user intervention to apply protection Able to detect all application failures and incorrect output

Questions?