Software Techniques for Soft Error Resilience

Slides:

Advertisements

Similar presentations

NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.

Advertisements

Quantitative Analysis of Control Flow Checking Mechanisms for Soft Errors Aviral Shrivastava, Abhishek Rhisheekesan, Reiley Jeyapaul, and Carole-Jean Wu.

NC STATE UNIVERSITY ASPLOS-XII Understanding Prediction-Based Partial Redundant Threading for Low-Overhead, High-Coverage Fault Tolerance Vimal Reddy Sailashri.

CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.

LIFT: A Low-Overhead Practical Information Flow Tracking System for Detecting Security Attacks Feng Qin, Cheng Wang, Zhenmin Li, Ho-seop Kim, Yuanyuan.

Fehlererkennung in SW David Rigler. Overview Types of errors detection Fault/Error classification Description of certain SW error detection techniques.

Cost-Efficient Soft Error Protection for Embedded Microprocessors

Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.

GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.

Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi.

Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.

IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,

CML CML Compiler Optimization to Reduce Soft Errors in Register Files Jongeun Lee, Aviral Shrivastava* Compiler Microarchitecture Lab Department of Computer.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Title of Selected Paper: IMPRES: Integrated Monitoring for Processor Reliability and Security Authors: Roshan G. Ragel and Sri Parameswaran Presented by:

Laboratoire d' Intégration des Systèmes et des Technologies System-Level Hardware-Based Protection of Memories against Soft-Errors Valentin Gherman Samuel.

Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.

Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.

Yun-Chung Yang TRB: Tag Replication Buffer for Enhancing the Reliability of the Cache Tag Array Shuai Wang; Jie Hu; Ziavras S.G; Dept. of Electr. & Comput.

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.

Mixed Criticality Systems: Beyond Transient Faults Abhilash Thekkilakattil, Alan Burns, Radu Dobrin and Sasikumar Punnekkat.

Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,

Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.

A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.

Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,

Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.

Static Analysis to Mitigate Soft Errors in Register Files Jongeun Lee, Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University, USA.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.

University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.

Computer Architecture: Branch Prediction (II) and Predicated Execution

Presented by: Daniel Taylor

SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee.

MSWAT: Hardware Fault Detection and Diagnosis for Multicore Systems

Soft-Error Detection through Software Fault-Tolerance Techniques

Morgan Kaufmann Publishers

Fault Tolerance & Reliability CDA 5140 Spring 2006

nZDC: A compiler technique for near-Zero silent Data Corruption

Fault Tolerance In Operating System

Improving Program Efficiency by Packing Instructions Into Registers

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

InCheck – An Integrated Recovery Methodology for nZDC

UnSync: A Soft Error Resilient Redundant Multicore Architecture

Daya S Khudia, Griffin Wright and Scott Mahlke

Pipelining: Advanced ILP

Packetizing Error Detection

Hwisoo So. , Moslem Didehban#, Yohan Ko

Packetizing Error Detection

NVIDIA Fermi Architecture

Fault Tolerance Distributed Web-based Systems

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Mattan Erez The University of Texas at Austin July 2015

NEMESIS: A Software Approach for Computing in Presence of Soft Errors

InCheck: An In-application Recovery Scheme for Soft Errors

Mengjia Yan† , Jiho Choi† , Dimitrios Skarlatos,

Packetizing Error Detection

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.

Co-designed Virtual Machines for Reliable Computer Systems

*Qiang Zhu Fujitsu Laboratories LTD. Japan

Fault Tolerant Systems in a Space Environment

Guest Lecturer: Justin Hsia

Presentation transcript:

Software Techniques for Soft Error Resilience Moslem Didehban Committee Members: Aviral Shrivastava Carole-Jean Wu Lawrence Clark Scott Mahlke

Resilience Against Soft Errors MTBF (One car perspective) 1 year 120 years MTBF (Toyota perspective: 10 million cars sold last year) 3 seconds ~6 minutes 1 Microprocessor

Scope of my dissertation Scope of Research Redundancy as the main protection strategy Flexibility Detection Latency add R1, R2, R3 add R2, R2, R3 Check Coarse-grained Fine-grained Thread-Level Function-Level Process-Level Program Statement Level Main Program Redundant Program Check Scope of my dissertation

Publications DAC, 2016, “nZDC: A compiler technique for near Zero Silent Data Corruption.” IEEE Transactions on Reliability, “A Compiler Technique for Processor-Wide Protection From Soft Errors in Multithreaded Environments.” 67.1 (2018): 249-263. DAC, 2017, “InCheck: An in-application recovery scheme for soft errors.” ICCAD, 2017, “NEMESIS: A Software Approach for Computing in Presence of Soft Errors”. IEEE Transactions on Dependable and Secure Computing, “Generic Soft Error Data and Control Flow Error Detection by Instruction Duplication” (under review). DATE, 2018, “Expert: Effective and flexible error protection by redundant multithreading.” DATE, 2019, “A software-level Redundant MultiThreading for Soft/Hard Error Detection and Recovery”. ACM Transactions on Architecture and Code Optimization, “A Software-level Redundant MultiThreaded Scheme for Protection against Hardware Random Faults” (to be submitted). DAC 2019, “WholeSafe: Whole Microprocessor Soft Error Detection and Recovery” (to be submitted) Dissertation

Presentation Organization Need for new fine-grained error protection Overview of our proposed techniques Verilog level fault injection results Memory and data path protection Core redundancy for soft and hard error protection

On the Shoulders of Giants EDDI Stanford 2002 ADD R3, R1, R2 ADD R3*, R1*, R2* MUL R4, R3, R5 MUL R4*, R3*, R5* BNE R4, R4*, Error store 0(SP) R4 Store offset(SP)R4* Instruction Duplication Memory Duplication Fault Model: Transient Single Bit-Flip ADD R3, R1, R2 MUL R4, R3, R5 store 0(SP)R4 Original Code EDDI Paper: Strategies for Fault-Tolerant, Space-Based Computing: Lessons Learned from the ARGOS Testbedl

On the Shoulders of Giants Performance EDDI Stanford 2002 ADD R3, R1, R2 ADD R3*, R1*, R2* MUL R4, R3, R5 MUL R4*, R3*, R5* BNE R4, R4*, Error store 0(SP) R4 Store offset(SP)R4* Instruction Duplication Memory Duplication Fault Model: Transient Single Bit-Flip SWIFT Princeton 2005 ADD R3, R1, R2 ADD R3*, R1*, R2* MUL R4, R3, R5 MUL R4*, R3*, R5* BNE R4, R4*, Error BNE SP, SP*, Error store 0(SP)R4 Instruction Duplication ECC-protected Memory Shoestring UMich 2010 ADD R3, R1, R2 ADD R3*, R1*, R2* MUL R4, R3, R5 MUL R4*, R3*, R5* BNE R4, R4*, Error store 0(SP)R4 Selective Duplication ECC-protected Memory ADD R3, R1, R2 MUL R4, R3, R5 store 0(SP)R4 Original Code EDDI Paper: Strategies for Fault-Tolerant, Space-Based Computing: Lessons Learned from the ARGOS Testbedl Pre-store error detection leaves store operations unprotected.

Our Error Detection Solution Philosophy: Error Protection First Failure mode: SDC “SDC occurs when incorrect data is delivered by a computing system to the user without any error being logged.” nZDC ASU 2016 ADD R3, R1, R2 ADD R3*, R1*, R2* MUL R4, R3, R5 MUL R4*, R3*, R5* store 0(SP) R4 load 0(SP*) R4 BNE R4, R4*, Error ADD R3, R1, R2 MUL R4, R3, R5 store 0(SP)R4 Original Code Post-Store Data Flow Error Detection

Evaluation Set up OpenRISC architecture Synthesizable Verilog Code of OR1k implementation Randomly generate a number <1792 Find component and fault site Fault Injection Time randomly selected from trace Randomly pick a fault site and a cycle, and flip the value for one cycle by XORing that value with ‘1’. No micro benchmarking

Fault Injection Results 10,600 FI experiments per each version of a program nZDC Reliability based on number of nines Version Coverage Overhead ORG 90% 1x SWIFT 2.7x nZDC 99.9% 2.9x No micro benchmarking or switching from ISS to RTL Scaled-SDC = # of SDCs * runtime overhead SWIFT: 3.8x SDC Reduction nZDC: 104x SDC Reduction

nZDC: Branch Direction Check (1) BNE R1, R2, . BB3 BNE R1, R2, .BB3 .BB0 .BB0 Post-Branch Direction Checking Fall Through Path Fall Through Path Taken Path Taken Path BNE R1*, R2*, .Err .BB2 BEQ R1*, R2*, .Err .BB3 .BB2 .BB3 .BB3

nZDC: Branch Direction Check (2) Jump .BB3 BNE R1, R2, .BB-CH BNE R1, R2, .BB3 Jump .BB3 .BB0 .BB0 Taken Path Fall Through Path Fall Through Path .BB-CH Taken Path BEQ R1*, R2*, .Err Jump .BB3 Post-Branch Direction Checking BNE R1*, R2*, .Err .BB2 .BB2 .BB3 .BB3

nZDC: Unexpected Jump Error Detection (1) . R7 == R7* Equal-points-of-execution are points that the state of master and redundant registers are same. A program protected by instruction-duplication. M R An unwanted jump from an EPoE to another EPoE cannot be detected by instruction duplication based schemes. Examples: Errors hitting nPC Opcode changes to branch Error affecting address of a jump operation Number of undetected branch = number of (EPoE * EPoE -1)/2 Solution: Reducing number of Equal-points-of-execution will decrease the chance of undetected unwanted jumps. 1) Changing Instruction Scheduling 2) Two registers Ri and Rj as always Ri != Rj M R M R

nZDC: Unexpected Jump Error Detection (2) MICR += 5; R RICR += 6; MICR +=1; RICR += 1; MICR +=2; Printf() If (RICR != MICR) Error Asymmetric Signatures M R Printf() M R Printf() Scheduling

Importance of unwanted jump error detection nZDC nZDC-- Reliability based on number of nines Version Coverage Overhead ORG 90% 1x nZDC-- 99% 2.7x nZDC 99.9% 2.9x

nZDC Vulnerability .L1 Random memory write errors Silent Stores store load r1, [r3] load r1*, [r3*] add r5, r1, #10 add r5*, r1*, #10 bnq r5, r6, .L1 .L1 // Do something Random memory write errors Opcode change-to-store Random write (control signals) Silent Stores Unwanted jumps store Memory [mem] = [mem*] r1 store r1  [mem] load r1  [mem*] bnq r1, r1*, Error r1

Error Recovery store val [addr] voting (val, val *, val **) Store Operation Error Detector No Error Diagnosis routine Memory restoration Majority-voting DUR Recoverable Store Reply NEMESIS 2017 ASU Memory Checkpointing InCheck 2017 ASU voting (addr, addr*, addr**) store val [addr] voting (val, val *, val **) SWIFTR 2007 Princeton Too much complexity because of single memory. Vulnerable against random write errors.

Revisiting soft error recovery solutions Vulnerable against random write errors. Too much complexity because of single memory. Code Size Increases drastically. ECC cannot take care of all memory errors i.e. MBE and errors on cache controllers. What if hardware does not provide protection? Or only parity?

WholeSafe: Instruction and Memory Triplication store r1  [r2] store r1*  [r2* + offset 1] store r1* *  [r2* * + offset 2] Recovery challenge: Delivering the correct answer is the goal, not getting rid of wrong answer. Used-to-be-friend hardware error detection mechanisms (exceptions) are now Enemy! Error masking in exception routines Ignoring exceptions Safe recovery from unwanted jumps is challenging! If (RICR != MICR) Error;

WholeSafe RTL FI Results (on going) For ORG and SWIFT-R we assume ECC in memory and inject 2100 errors only of microprocessor data path and register file. For WholeSafe we inject errors (single and MBU up to 5 bit flips) in instruction cache and memory. We inject 3000 faults for each WholeSafe-protected program. # scaled-SDCs Correct Results ORG 81.3% SWIFTR 91.3% WholeSafe 96.0%

What about MBEs and permanent faults? Applying nZDC error detection strategy to multicore systems [DATE 2018] Core i 81K transient fault and ~16K permanent fault 3000 transient faults + 600 permanent faults for each version of program More than 65x better error coverage than SRMT1! [1] Wang, Cheng, et al. "Compiler-managed software-based redundant multi-threading for transient fault detection." CGO, 2007. Core j Shared Memory Performance overhead of permanent nZDC is around 5x. SRMT around 4x. store data[mem] data 1 load tmp[mem] If (tmp != data) Error; 2

FiSHER: Flexible Soft and Hard Error Resiliency

Publications DAC, 2016, “nZDC: A compiler technique for near Zero Silent Data Corruption.” IEEE Transactions on Reliability, “A Compiler Technique for Processor-Wide Protection From Soft Errors in Multithreaded Environments.” 67.1 (2018): 249-263. DAC, 2017, “InCheck: An in-application recovery scheme for soft errors.” ICCAD, 2017, “NEMESIS: A Software Approach for Computing in Presence of Soft Errors”. IEEE Transactions on Dependable and Secure Computing, “Generic Soft Error Data and Control Flow Error Detection by Instruction Duplication” (under review). DATE, 2018, “Expert: Effective and flexible error protection by redundant multithreading.” DATE, 2019, “A software-level Redundant MultiThreading for Soft/Hard Error Detection and Recovery”. ACM Transactions on Architecture and Code Optimization, “A Software-level Redundant MultiThreaded Scheme for Protection against Hardware Random Faults” (to be submitted). DAC 2019, “WholeSafe: Whole Microprocessor Soft Error Detection and Recovery” (to be submitted) Dissertation

What I learned Effient error resilience is great only if protection is accomplished. Simple triplication and voting! Protection package encompasses data-flow errors, wrong-direction branches and unexpected-jump errors. User-level resilience Seemingly small vulnerability windows add up quickly. Hard to achieve five nine reliability Recovery is challenging. Maybe that’s why restarting is the preferable recovery strategy

Thank you.

Non-Failure System-Visible Error SDCs ORG SWIFT-R WholeSafe ADPCMC 80.1% 90.2% 95.1% 8.7% 6.8% 4.5% 11.24% 3.00% 0.37% BITCOUNT 78.5% 91.7% 96.3% 9.8% 6.5% 3.2% 11.71% 1.81% 0.57% CRC 77.7% 91.9% 95.8% 10.1% 6.3% 4.1% 12.19% 1.76% 0.03% SHA 90.1% 6.2% 4.0% 3.62% 0.17% STRINGSEARCH 86.24% 91.48% 95.97% 9.81% 6.00% 3.30% 3.95% 2.52% 0.73% QSORT 85.05% 92.24% 96.93% 10.29% 6.09% 2.87% 4.67% 1.67% 0.20% Average 81.3% 91.3% 96.0% 9.6% 3.7% 9.17% 2.40% 0.34%

nZDC is multithreaded Environment: Load transformation

nZDC is multithreaded Environment: Store transformation

InCheck: Performance overhead

Detected but not recoverable errors

Example of Nemesis memory write error detection/recovery

Check memory write instructions store x2, [x1] cmp x1, x1* b.ne error cmp x2, x2* Duplicable computations ++ Eliminate RF vulnerable intervals -- “store” is unprotected Check after store cmp x1, x1* b.ne error cmp x2, x2* store x2, [x1] Duplicable computations -- RF vulnerable intervals -- “store” is unprotected SWIFT store x2, [x1] load x2*, [x1*] cmp x1, x1* b.ne error cmp x2, x2* Duplicable computations ++ address part is protected -- data part is vulnerable Checking load store x2, [x1] load x2, [x1*] cmp x2, x2* b.ne error Duplicable computations ++ “store” is protected ++ optimal number of checks nZDC