InCheck: An In-application Recovery Scheme for Soft Errors

Slides:



Advertisements
Similar presentations
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
Advertisements

Quantitative Analysis of Control Flow Checking Mechanisms for Soft Errors Aviral Shrivastava, Abhishek Rhisheekesan, Reiley Jeyapaul, and Carole-Jean Wu.
Henry C. H. Chen and Patrick P. C. Lee
Code optimization: –A transformation to a program to make it run faster and/or take up less space –Optimization should be safe, preserve the meaning of.
Fault-Tolerant Systems Design Part 1.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
5th Conference on Intelligent Systems
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Hardware Fault Recovery for I/O Intensive Applications Pradeep Ramachandran, Intel Corporation, Siva Kumar Sastry Hari, NVDIA Manlap (Alex) Li, Latham.
Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.
An Integrated Framework for Dependable Revivable Architectures Using Multi-core Processors Weiding Shi, Hsien-Hsin S. Lee, Laura Falk, and Mrinmoy Ghosh.
Functional Coverage Driven Test Generation for Validation of Pipelined Processors P. Mishra and N. Dutt Proceedings of the Design, Automation and Test.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.
What Exactly are the Techniques of Software Verification and Validation A Storehouse of Vast Knowledge on Software Testing.
Models for Software Reliability N. El Kadri SEG3202.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Software Testing. Definition To test a program is to try to make it fail.
Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
Software Reliability SEG3202 N. El Kadri.
High Performance Embedded Computing © 2007 Elsevier Lecture 5: Embedded Systems Issues Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Copyright © 2008 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1.
Presenter: Jyun-Yan Li Effective Software-Based Self-Test Strategies for On-Line Periodic Testing of Embedded Processors Antonis Paschalis Department of.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Encore: Low-Cost,
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Safety Critical Systems 5 Testing T Safety Critical Systems.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
EXT2C: Increasing Disk Reliability Brian Pellin, Chloe Schulze CS736 Presentation May 3 th, 2005.
Mixed Criticality Systems: Beyond Transient Faults Abhilash Thekkilakattil, Alan Burns, Radu Dobrin and Sasikumar Punnekkat.
Computing system Lesson Objective: Understand what is meant by a ‘computer system’ Learning Outcome: Define the key words and give a brief explanation.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Chapter 8 System Management Semester 2. Objectives  Evaluating an operating system  Cooperation among components  The role of memory, processor,
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.
Static Analysis to Mitigate Soft Errors in Register Files Jongeun Lee, Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University, USA.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.
1 Security Architecture and Designs  Security Architecture Description and benefits  Definition of Trusted Computing Base (TCB)  System level and Enterprise.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Pouya Ostovari and Jie Wu Computer & Information Sciences
Seminar CARR Fault Coverage Theoretical Estimation 21 February 2005 Center for Advanced Reactor Research Jun-Seok Lee.
Secure Proactive Recovery – a Hardware Based Mission Assurance Scheme 1 6 th International Conference on Information Warfare and Security, 2011.
Soft Error Analysis of FPGA under ISO Standard
Modeling Stream Processing Applications for Dependability Evaluation
nZDC: A compiler technique for near-Zero silent Data Corruption
CS1251 Computer Architecture
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
InCheck – An Integrated Recovery Methodology for nZDC
UnSync: A Soft Error Resilient Redundant Multicore Architecture
Daya S Khudia, Griffin Wright and Scott Mahlke
Hwisoo So. , Moslem Didehban#, Yohan Ko
מערכות הפעלה ערן טרומר סמסטר א' תשע"ב
Elementary Statistics: Picturing The World
Fault Tolerance Distributed Web-based Systems
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
15-740/ Computer Architecture Lecture 5: Precise Exceptions
NEMESIS: A Software Approach for Computing in Presence of Soft Errors
Use ECP, not ECC, for hard failures in resistive memories
Instruction Level Parallelism
Software Techniques for Soft Error Resilience
Presentation transcript:

InCheck: An In-application Recovery Scheme for Soft Errors Moslem Didehban Sai Ram Dheeraj Lokam Aviral Shrivastava

Silent Data Corruption in processor-wide Fault Injection Segmentation Fault Recognizable Change in program behaviour ARM Cortex A 53 Program Abort Masked No Change in program behaviour Recovered (Not applicable to unprotected programs) Silent Data Corruption - SDC Un-Recognizable Change in program behaviour 72,000 Random Faults Processor components subjected to FI Register File Load Store Queue Pipeline Registers Functional Units ITRS roadmap for 2015 lists Muon-induced soft-errors as a major reliability challenge in both near term and long term microprocessor designs.

Goal of InCheck: Implementing a software level protection technique which, guarantees “Zero” Output Corruption provides “Safe” recovery from detected soft-errors

Software techniques to detect SDCs Flexibility They can optimized based on the workload characteristics without having to change the hardware Coverage There are software techniques that can provide very good SDC coverage. Eg. nZDC (DAC 2016) Safety critical applications are mixed critical

SWIFT-R Transformation mov x1, #0x04 mov x1*, #0x04 mov x1**, #0x04 Original Code mov x1, #0x04 load x2, [x1] add x2, x2, #0x10 and x1, x2, #0x10 store x2, [x1] majority_voting (x1, x1*, x1**) load x2, [x1] mov x2*, x2 mov x2**,x2 majority_voting (R, R*, R**) cmp R, R* b.ne recover from R** cmp R,R** b.ne recover from R* cmp R*,R** b.ne recover from R add x2, x2, #0x10 add x2*, x2*, #0x10 add x2**, x2**, #0x10 and x1, x2, #0x10 and x1*, x2*, #0x10 and x1**, x2**, #0x10 majority_voting (x2, x2*, x2**) majority_voting(x1, x1*, x1**) store x2, [x1]

Original vs SWIFT-R SDC Distribution

x1 recovered at next majority voting Pitfalls of SWIFT-R cmp x1, x1* b.ne recover from x1** cmp x1, x1** b.ne recover from x1* cmp x1*, x1** b.ne recover from x1 load x2, [x1] mov x2*, x2 mov x2**,x2 Majority Voting LET happens on x1 after compare load x2, [x1] Un-safe Recovery x1 recovered at next majority voting

Pitfalls of SWIFT-R majority_voting (x2, x2*, x2**) majority_voting(x1, x1*, x1**) store x2, [x1] Majority Voting in performance critical path store x2, [x1] ++ Vulnerability ++ Execution time Reserved 60% of regs. Talk about register spilling. majority_voting (R, R*, R**) cmp R, R* b.ne recover from R** cmp R,R** b.ne recover from R* cmp R*,R** b.ne recover from R

SWIFT-R vs InCheck SDC Distribution

InCheck Framework .BB Recovery Computations Redundant Diagnosis Verified Register Preservation (*Ensures no Latent Error) .BB Memory Restoration Registers Restoration Rollback to .BB Recovery Computations Redundant YES Memory Checkpointing Store Diagnosis Error Detected? YES Is recovery Safe? NO Unrecoverable Error

Performance evaluation: nZDC+InCheck programs run 36% faster than their SWIFT-R counterparts

Breakdown of Recoverable & UnRecoverable Errors in InCheck

Thank you!

Back up

Statistical Fault Injection N – Initial Population Size p – estimated probability of faults resulting in a failure (standard error) e – margin of error t – confidence level. Probability that the exact value is actually within the error interval (computed w.r.t Normal distribution) *Leveugle, Régis, et al. "Statistical fault injection: quantified error and confidence." 2009 Design, Automation & Test in Europe Conference & Exhibition. IEEE, 2009.

SWIFT vs SWIFT-R