NEMESIS: A Software Approach for Computing in Presence of Soft Errors

Slides:



Advertisements
Similar presentations
NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
Advertisements

Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.
Quantitative Analysis of Control Flow Checking Mechanisms for Soft Errors Aviral Shrivastava, Abhishek Rhisheekesan, Reiley Jeyapaul, and Carole-Jean Wu.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
1 Seoul National University Wrap-Up. 2 Overview Seoul National University Wrap-Up of PIPE Design  Exception conditions  Performance analysis Modern.
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.
Making Services Fault Tolerant
Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
Slipstream Processors by Pujan Joshi1 Pujan Joshi May 6 th, 2008 Slipstream Processors Improving both Performance and Fault Tolerance.
Design of SCS Architecture, Control and Fault Handling.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Fault Tolerant Infective Countermeasure for AES
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
1 Transient Fault Recovery For Chip Multiprocessors Mohamed Gomaa, Chad Scarbrough, T. N. Vijaykumar and Irith Pomeranz School of Electrical and Computer.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Computer Engineering Group Brandenburg University of Technology at Cottbus 1 Ressource Reduced Triple Modular Redundancy for Built-In Self-Repair in VLIW-Processors.
Vigilante: End-to-End Containment of Internet Worms Authors : M. Costa, J. Crowcroft, M. Castro, A. Rowstron, L. Zhou, L. Zhang, and P. Barham In Proceedings.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
Architectural Optimizations Ed Carlisle. DARA: A LOW-COST RELIABLE ARCHITECTURE BASED ON UNHARDENED DEVICES AND ITS CASE STUDY OF RADIATION STRESS TEST.
Fault-Tolerant Systems Design Part 1.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
CprE 458/558: Real-Time Systems
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.
Static Analysis to Mitigate Soft Errors in Register Files Jongeun Lee, Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University, USA.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
F453 Module 8: Low Level Languages 8.1: Use of Computer Architecture.
Memory Protection through Dynamic Access Control Kun Zhang, Tao Zhang and Santosh Pande College of Computing Georgia Institute of Technology.
A New Approach to Software-Implemented Fault Tolerance
Free Transactions with Rio Vista
Computer Architecture Chapter (14): Processor Structure and Function
Exceptional Control Flow
Soft-Error Detection through Software Fault-Tolerance Techniques
CSC 4250 Computer Architectures
nZDC: A compiler technique for near-Zero silent Data Corruption
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Fault Tolerance In Operating System
InCheck – An Integrated Recovery Methodology for nZDC
UnSync: A Soft Error Resilient Redundant Multicore Architecture
Daya S Khudia, Griffin Wright and Scott Mahlke
Hwisoo So. , Moslem Didehban#, Yohan Ko
Seoul National University
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Machine-Level Programming: Control Flow
Fault Tolerance Distributed Web-based Systems
Fundamentals of Computer Organisation & Architecture
Mattan Erez The University of Texas at Austin July 2015
Free Transactions with Rio Vista
InCheck: An In-application Recovery Scheme for Soft Errors
Sampoorani, Sivakumar and Joshua
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Fault Tolerant Systems in a Space Environment
Software Techniques for Soft Error Resilience
Seminar on Enterprise Software
Presentation transcript:

NEMESIS: A Software Approach for Computing in Presence of Soft Errors Moslem Didehban, Aviral Shrivastava, Sai Ram Dheeraj Lokam

Reliability is important! Todays computer based systems are virtually everywhere, inside our body … wearables and Many of this applications are either safety or mission critical For example, CPS systems like autonomous cars and drones, are clearely safety critical … and their failure come with severe consequences…

Soft error protection is required Soft errors: Historically, a problem for high-altitude applications ITRS 2015 predicts soon even ground-level applications will be at risk. Failure rate is expected to increase: More components  more failures Solution: Redundancy Hardware-level solutions ARM Cortex-R Dual lockstep processor Software-level solutions Time redundancy (Flexible) Transient faults/Soft errors a major thread for reliability Die size grows by ~14% to satisfy mores low Threshold voltage decreases Hundred times more frequent than hard errors Sources of soft errors Cosmic rays and alpha particles Noise in power supply Electromagnetic interference Temperature, pressure, voltage, vibrations “The nation depends on fragile software” Information Technology Research: Investing in Our Future, 1999

Software-level error resilience scheme Instruction-level soft error tolerant schemes Error Detection Majority-voting Examples: SWIFTR[2007], selective-SWIFTR[2013], ELZAR [2016] Examples: EDDI[2002], SWIFT[2005], Shoestring[2010], DRIFT[2013], SIMD-Based Soft Error Detection [16], IPAS [2016], nZDC [2016]

A Closer Look into SWIFR movl -4(%rbp), %eax cmpl -8(%rbp), %eax jne .L2 cmpl -12(%rbp), %eax movl -8(%rbp), %eax je .L6 .L2: jne .L4 movl %eax, -12(%rbp) jmp .L6 .L4: jne .L5 movl %eax, -8(%rbp) .L5: jne .L6 movl -12(%rbp), %eax movl %eax, -4(%rbp) .L6: Redundant computations if ((adr != adr*) || (addr != adr **) || (adr * != adr **)){ if (adr == adr *) // addr ** is faulty adr ** = adr; else if (adr * == adr **) // addr is faulty adr = adr *; else if (adr == adr **) // addr * is faulty adr * = adr; } Val**, addr** val*, addr* val, addr Majority-voter(val, val*, val**) Majority-voter(adr, adr*, adr**) store val[addr]

Limitations of SWIFTR Almost half of instructions (memory and control flow) are unprotected Register file vulnerability introduced by frequent and long voting operations Majority voting operations should take place before all memory and compare operations ~45%

Silent Data Corruption Experimental results Segmentation Fault Recognizable Change in program behaviour ARM Cortex A 53 Program Abort Masked No Change in program behaviour Recovered Silent Data Corruption Wrong Output (~5%) (~2.4%) 10 million micro-architectural level random fault injection experiments on Original and SWIFT-R protected programs Schirmeier, Horst, Christoph Borchert, and Olaf Spinczyk. "Avoiding pitfalls in fault-injection based comparison of program susceptibility to soft errors." Dependable Systems and Networks (DSN), 2015 45th Annual IEEE/IFIP International Conference on. IEEE, 2015.

Reliability is hard to achieve! “Organization of redundancy and fault-tolerance for ultra-high reliability is a challenging problem: redundancy management can account for half the software in a flight control system and, if less than perfect can itself become the primary source of system failure.” -- John Rushby https://shemesh.larc.nasa.gov/fm/fm-ft.html

Off performance-critical-path error handling NEMESIS M-stream D-stream R-stream Goal: Protecting the execution of all instructions without any vulnerable window M- and D- streams are used for error detection and R-stream is used just for error recovery Checking the result of critical instructions instead of their operands Error detectors instead of voting-operations Off performance-critical-path error handling Critical Operation Off performance-critical-path error handling Error Diagnosis routine Error Detector Recoverable Memory restoration Majority-voting No Error Detected Not-Recoverable Critical Operation Restart

Error detection on the result of store operations Main idea: Load back the result of store and check it against redundant version store Val[Adr] Val, Adr Val*, Adr* M-stream D-stream [adr] Val Memory Rather than checking store register operands, NEMESIS detects errors on the results of store load SCR [Adr*] if (SCR != Val*) Diagnosis(); Checking the results of store, verifies the execution of store as well as the correct operand computations

Challenges in checking the result of Store store Val[Adr] Val, Adr Val*, Adr* M-stream D-stream [adr] Val Memory Val Faulty Address Store is silent. (The Store Value is presented in Store target address even before the execution of store) First lest define silent stores.. On average ~20% of stores are silent load SCR [Adr*] if (SCR != Val*) Diagnosis(); Undetected Errors on address part of silent stores remain undetected.

Solving the problem of Silent Stores Main idea: Skip over silent stores Val, Adr Val*, Adr* M-stream D-stream Val, Adr Val*, Adr* M-stream D-stream load SCR  [Adr] If (SCR == Val) Jump L; store Val[Adr] load SCR [Adr*] L: if (SCR != Val*) Diagnosis(); Silent Store Check Store Result Check load VCR  [Adr] load SCR  [Adr*] If (SCR == Val) mov SCR, VCR Jump L; store Val[Adr] load VCR [Adr*] L: if (VCR != Val*) Diagnosis(); Silent Store Check Store Result Check Suffer from missing-memory-update errors

Error Diagnosis and recovery on store operations Why do we need diagnosis routine? Inter-stream error propagation Unavailable memory backup Errors altering store effective address computations If error is diagnosed as recoverable: mov Rm, 10 mov Rd, 10 mov Rr, 10 add Rm, Rm, 10 add Rd, Rd, 10 add Rr, Rr, 10 Error alters first add destination register pointer (Rm) to the Rd Rm = 10 Rd = 30 Rm = 20 Three different values! Voting cannot solve the problem (1) Restore the state of memory (2) Masking the effect of error from registers (3) Program resume by store re-execution

Error detection on the result of branch operations Simple Control Flow BB0 cmp r1, r2 If (cond) .BB1 BB1 BB2 Taken Not-Taken BB0 cmp r1, r2 If (cond) .BB1 BB1 BB2 Taken Not-Taken NEMESIS Control-Flow Transformation cmp r1*, r2* If (!cond) Diagnosis() cmp r1*, r2* If (cond) Diagnosis()

Error detection on the result of branch operations Fan-in Basic blocks BB1 cmp r1, r2 If (cond) .BB1 cmp r1, r3 cmp r1, r2 If (cond) .BB11 cmp r1, r3 If (cond) .BB12 NEMESIS Control-Flow Transformation .BB11 .BB12 cmp r1*, r2* If (!cond) . Diagnosis() Jump BB1 cmp r1*, r3* If (!cond) . Diagnosis() Jump BB1 Taken Taken BB1

Experimental setup LLVM 3.7 Gem5 simulator NEMESIS was implemented as late backend pass Gem5 simulator 5 million faults in various components (registerFile, Pipeline registers, FUs, LSQ)

NEMESIS-protected programs never produce wrong result! Load-Store Unit Register file Pipeline Registers Functional Units

Performance Overhead NEMESIS protected programs are on average around 25% faster than SWIFT-R protected ones. NEMESIS is faster because: Off-performance critical path error recovery Relax memory read instruction triplication Register Hungary benchmarks, i.e., rijndeal and adpcm, show significant slow-down

Detected but not recoverable errors

Summary This work Future work A compiler technique, named NEMESIS, for fault detection and recovery is proposed Safe off-critical-path error recovery Checking the results of critical operations rather that their operands A CFC mechanism Future work Enhancing the coverage of NEMESIS to permanent errors