DRACO Architecture Research Group. DSN, Edinburgh UK, 06.25.2007 Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance.

Slides:



Advertisements
Similar presentations
NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
Advertisements

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
Exceptional Control Flow Processes Today. Control Flow Processors do only one thing: From startup to shutdown, a CPU simply reads and executes (interprets)
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
Simulation Fault-Injection & Software Fault-Tolerance
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
Shadow Profiling: Hiding Instrumentation Costs with Parallelism Tipp Moseley Alex Shye Vijay Janapa Reddi Dirk Grunwald (University of Colorado) Ramesh.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
KMemvisor: Flexible System Wide Memory Mirroring in Virtual Environments Bin Wang Zhengwei Qi Haibing Guan Haoliang Dong Wei Sun Shanghai Key Laboratory.
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation.
Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
1 RAKSHA: A FLEXIBLE ARCHITECTURE FOR SOFTWARE SECURITY Computer Systems Laboratory Stanford University Hari Kannan, Michael Dalton, Christos Kozyrakis.
Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
With Scott Arnold & Ryan Nuzzaci An Adaptive Fault-Tolerant Memory System for FPGA- based Architectures in the Space Environment Dan Fay, Alex Shye, Sayantan.
Computer System Architectures Computer System Software
Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.
University of Maryland Compiler-Assisted Binary Parsing Tugrul Ince PD Week – 27 March 2012.
Orchestra: Intrusion Detection Using Parallel Execution and Monitoring of Program Variants in User-Space Babak Salamat, Todd Jackson, Andreas Gal, Michael.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
Operating System Support for Virtual Machines Samuel T. King, George W. Dunlap,Peter M.Chen Presented By, Rajesh 1 References [1] Virtual Machines: Supporting.
Instituto de Informática and Dipartimento di Automatica e Informatica Universidade Federal do Rio Grande do Sul and Politecnico di Torino Porto Alegre,
Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Title of Selected Paper: IMPRES: Integrated Monitoring for Processor Reliability and Security Authors: Roshan G. Ragel and Sri Parameswaran Presented by:
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
EEE440 Computer Architecture
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
ATtiny23131 A SEMINAR ON AVR MICROCONTROLLER ATtiny2313.
Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.
Exceptional Control Flow Topics Exceptions except1.ppt CS 105 “Tour of the Black Holes of Computing”
Operating Systems 1 K. Salah Module 1.2: Fundamental Concepts Interrupts System Calls.
80386DX functional Block Diagram PIN Description Register set Flags Physical address space Data types.
Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.
Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures Arijit Biswas, Paul Racunas, Shubu Mukherjee FACT Group, DEG, Intel Joel.
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Exceptional Control Flow
Selective Code Compression Scheme for Embedded System
Computer Architecture: Multithreading (III)
nZDC: A compiler technique for near-Zero silent Data Corruption
Effective Data-Race Detection for the Kernel
Architecture & Organization 1
Daya S Khudia, Griffin Wright and Scott Mahlke
RAID RAID Mukesh N Tekwani
Hwisoo So. , Moslem Didehban#, Yohan Ko
Architecture & Organization 1
Exceptions Control Flow
2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.
RAID RAID Mukesh N Tekwani April 23, 2019
Fault Tolerant Systems in a Space Environment
Presentation transcript:

DRACO Architecture Research Group. DSN, Edinburgh UK, Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance Alex Shye Tipp Moseley + Vijay Janapa Reddi* Joseph Blomstedt Daniel A. Connors University of Colorado at Boulder, ECE + University of Colorado at Boulder, CS * Harvard University, EECS

DRACO Architecture Research Group. DSN, Edinburgh UK, Outline Introduction and Motivation Software-centric Fault Detection Process-Level Redundancy Experimental Results Conclusion

DRACO Architecture Research Group. DSN, Edinburgh UK, Alpha particles Neutrons Device coupling Power supply noise etc. Transient Faults (Soft Errors) Transient faults are already an issue!! - Sun Microsystems [Baumann Rel. Notes 2002] - LANL ASC Q Supercomputer [Michalak IEEE TDMR 2005] - …

DRACO Architecture Research Group. DSN, Edinburgh UK, Predicted Soft Error Rates “The neutron SER for a latch is likely to stay constant in the future process generations…” [Karnik VLSI 2001] [Hareland VLSI 2001] SER = Soft Error Rate Small SER decrease per generation

DRACO Architecture Research Group. DSN, Edinburgh UK, Moore’s Law Continues [Source: Transient faults will be a significant issue in the design and execution of future microprocessors

DRACO Architecture Research Group. DSN, Edinburgh UK, Background One categorization: [Mukherjee HPCA 2005] I.Benign Fault II.Detected Unrecoverable Error (DUE) False DUE- Detected fault would not have altered correctness True DUE- Detected fault would have altered correctness III.Silent Data Corruption (SDC) Hardware Approaches –Specialized redundant hardware, redundant multi-threading Software Approaches –Compiler solutions: instruction duplication, control flow checking –Low-cost, flexible alternative but higher overhead

DRACO Architecture Research Group. DSN, Edinburgh UK, Goal Use software to leverage available hardware parallelism for low-overhead transient fault tolerance.

DRACO Architecture Research Group. DSN, Edinburgh UK, Sphere of Replication (SoR) 1. Input Replication 2. Redundant Execution 3. Output Comparison SoR

DRACO Architecture Research Group. DSN, Edinburgh UK, Software-centric Fault Detection Most previous approaches are hardware-centric –Even compiler approaches (e.g. EDDI, SWIFT) Software-centric able to leverage strengths of a software approach –Correctness is defined by software output –Ability to see larger scope effect of a fault –Ignore benign faults Processor Cache Memory Devices Processor SoR Application Operating System PLR SoR Libraries Hardware-centric Software-centric

DRACO Architecture Research Group. DSN, Edinburgh UK, Slave Processes identical address space, file descriptors, etc. not allowed to perform system I/O Process-Level Redundancy (PLR) App Libs App Libs App Libs Operating System Sys. Call Emulation Unit Master Process only process allowed to perform system I/O Watchdog Alarm Watchdog Alarm occasionally a process will hang System Call Emulation Unit (SCEU) Enforces SoR with input replication and output comparison System call emulation for determinism Detects and recovers from transient faults

DRACO Architecture Research Group. DSN, Edinburgh UK, Enforcing SoR Input Replication –All read events: read(), gettimeofday(), getrusage(), etc. –Return value from all system calls Output Comparison –All write events: write(), msync(), etc. –System call parameters

DRACO Architecture Research Group. DSN, Edinburgh UK, Maintaining Determinism Master process executes system call Slave processes emulate it –Ignore some: rename(), unlink() –Execute similar/altered system call Identical address space: mmap() Process-specific data: open(), lseek() Challenges we do not handle yet –Shared memory –Asynchronous signals –Multi-threading

DRACO Architecture Research Group. DSN, Edinburgh UK, Fault Detection/Recovery PLR supports detection/recovery from multiple faults by increasing number of redundant processes and scaling the majority vote logic Output Mismatch Detected as a mismatch of compare buffers on an output comparison Use majority vote ensure correct data exists, kill incorrect process, and fork() to create a new one Program FailureSystem call emulation unit registers signal handlers for SIGSEGV, SIGIOT, etc. Re-create the dead process by forking one of existing processes TimeoutWatchdog alarm times outDetermine the missing process and fork() to create a new one Type of ErrorDetection Mechanism Recovery Mechanism

DRACO Architecture Research Group. DSN, Edinburgh UK, Windows of Vulnerability Fault during PLR execution Fault during execution of operating system

DRACO Architecture Research Group. DSN, Edinburgh UK, Experimental Methodology Set of SPEC2000 benchmarks Prototype developed with Intel Pin dynamic binary instrumentation tool –Use Pin Probes API to intercept system calls Register Fault Injection (SPEC2000 test inputs) –1000 random test cases per benchmark generated from an instruction profile Test case: a specific bit in a source/dest register in a particular instruction invocation –Insert fault with Pin IARG_RETURN_REGS instruction instrumentation –specdiff in SPEC2000 harness determines output correctness PLR Performance (SPEC2000 ref inputs) –4-way SMP, 3.00Ghz Intel Xeon MP 4096KB L3 cache, 6GB memory –Red Hat Enterprise Linux AS release 4

DRACO Architecture Research Group. DSN, Edinburgh UK, Fault Injection Results

DRACO Architecture Research Group. DSN, Edinburgh UK, Fault Injection Results w/ PLR

DRACO Architecture Research Group. DSN, Edinburgh UK, PLR Performance As a comparison: SWIFT is.4x slowdown for detection and 2x slowdown for detection+recovery Contention Overhead: Overhead of running multiple processes using shared resources (caches, bus, etc) Emulation Overhead: Overhead of PLR synchronization, shared memory transfers, etc.

DRACO Architecture Research Group. DSN, Edinburgh UK, Conclusion Present a software-implemented transient fault tolerance technique to utilize general-purpose hardware with multiple cores Differentiate between hardware-centric and software-centric fault detection models –Show how software-centric can be effective in ignoring benign faults Prototype PLR system runs on a 4-way SMP machine with 16.9% overhead for detection and 41.1% overhead with recovery Questions?

DRACO Architecture Research Group. DSN, Edinburgh UK, Extra Slides

DRACO Architecture Research Group. DSN, Edinburgh UK, Predicted Soft Error Rates [Shivakumar DSN 2002] “The neutron SER for a latch is likely to stay constant in the future process generations…” [Karnik VLSI 2001] [Hareland VLSI 2001] SER = Soft Error Rate Logic Gates Latches SRAM Small SER decrease per generation

DRACO Architecture Research Group. DSN, Edinburgh UK, Overhead Breakdown

DRACO Architecture Research Group. DSN, Edinburgh UK, Maintaining Determinism Master process executes system call Redundant processes emulate it –Ignore some: rename(), unlink() –Execute similar/altered system call Identical address space: mmap() Process-specific data: open(), lseek() Challenges –Shared memory –Asynchronous signals –Multi-threading Compare syscall type and cmd line parameters read() Write resulting file offset and read buffer to shmem Copy the read buffer from shmem lseek() to correct file offset Write cmd line parameters and syscall type to shmem Barrier Master Process Redundant Processes Example of handling a read() system call