Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,

Slides:

Advertisements

Similar presentations

Feedback Control Real-Time Scheduling: Framework, Modeling, and Algorithms Chenyang Lu, John A. Stankovic, Gang Tao, Sang H. Son Presented by Josh Carl.

Advertisements

Using Implications for Online Error Detection Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI

NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

Fault-Tolerant Systems Design Part 1.

Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.

NC STATE UNIVERSITY ASPLOS-XII Understanding Prediction-Based Partial Redundant Threading for Low-Overhead, High-Coverage Fault Tolerance Vimal Reddy Sailashri.

CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.

NATW 2008 Using Implications for Online Error Detection Nuno Alves, Jennifer Dworak, R. Iris Bahar Division of Engineering Brown University Providence,

Hardware Fault Recovery for I/O Intensive Applications Pradeep Ramachandran, Intel Corporation, Siva Kumar Sastry Hari, NVDIA Manlap (Alex) Li, Latham.

Cost-Efficient Soft Error Protection for Embedded Microprocessors

1 RAKSHA: A FLEXIBLE ARCHITECTURE FOR SOFTWARE SECURITY Computer Systems Laboratory Stanford University Hari Kannan, Michael Dalton, Christos Kozyrakis.

MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.

Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.

1 Telematics/Networkengineering Confidential Transmission of Lossless Visual Data: Experimental Modelling and Optimization.

Sparse Coding for Specification Mining and Error Localization Runtime Verification September 26, 2012 Wenchao Li, Sanjit A. Seshia University of California.

IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.

INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.

A New Method For Developing IBIS-AMI Models

Politecnico di Torino Dipartimento di Automatica ed Informatica TORSEC Group Performance of Xen’s Secured Virtual Networks Emanuele Cesena Paolo Carlo.

LOGO Soft-Error Detection Through Software Fault-Tolerance Techniques by Gökhan Tufan İsmail Yıldız.

Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S.

Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,

European Test Symposium, May 28, 2008 Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI Kundan.

Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.

Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,

Title of Selected Paper: IMPRES: Integrated Monitoring for Processor Reliability and Security Authors: Roshan G. Ragel and Sri Parameswaran Presented by:

Using Likely Program Invariants to Detect Hardware Errors Swarup Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou.

IntelliTraffic (An intelligent Traffic Control System) Project Coordinator – Dr. Chathura De Silva Presented by Group 07 Members – Sathyajith Dharmawardhane.

SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.

Application-Aware SoftWare AnomalyTreatment (SWAT) of Hardware Faults Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita.

Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.

Fault-Tolerant Systems Design Part 1.

Detecting Errors Using Multi-Cycle Invariance Information Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence,

Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Error-Detecting and Error-Correcting Codes

Preserving Application Reliability on Unreliable Hardware Siva Hari Department of Computer Science University of Illinois at Urbana-Champaign.

Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.

Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.

Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.

Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University.

Preserving Application Reliability on Unreliable Hardware Siva Hari Adviser: Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign.

GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.

University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.

Value Prediction Kyaw Kyaw, Min Pan Final Project.

Approximate Computing: (Old) Hype or New Frontier? Sarita Adve University of Illinois, EPFL Acks: Vikram Adve, Siva Hari, Man-Lap Li, Abdulrahman Mahmoud,

Memory Protection through Dynamic Access Control Kun Zhang, Tao Zhang and Santosh Pande College of Computing Georgia Institute of Technology.

Kernel Code Coverage Nilofer Motiwala Computer Sciences Department

A Fault Tolerance Protocol for Uploads: Design and Evaluation

MSWAT: Hardware Fault Detection and Diagnosis for Multicore Systems

Soft-Error Detection through Software Fault-Tolerance Techniques

nZDC: A compiler technique for near-Zero silent Data Corruption

SWAT: Designing Resilient Hardware by Treating Software Anomalies

Daya S Khudia, Griffin Wright and Scott Mahlke

Hwisoo So. , Moslem Didehban#, Yohan Ko

Soft Error Detection for Iterative Applications Using Offline Training

InCheck: An In-application Recovery Scheme for Soft Errors

Software Verification and Validation

Software Verification and Validation

Hardware Assisted Fault Tolerance Using Reconfigurable Logic

SDC is in the eye of the beholder: A Survey and preliminary study

Software Verification and Validation

Fault Tolerant Systems in a Space Environment

Software Techniques for Soft Error Resilience

Presentation transcript:

Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign, ‡ Intel Corporation

Motivation Hardware reliability is a challenge – Transient (soft) errors are a major problem Soft Error 2 Redundancy Overhead (perf., power, area) How? Tunable reliability Reliability Very high reliability at low-cost Symptom- based Detect errors using symptom monitors Fatal Traps Kernel Panic App Abort Goals: Full reliability at low-cost Tunable reliability vs. overhead Goals: Full reliability at low-cost Tunable reliability vs. overhead

Fault Outcomes 3 APPLICATION Output Symptom of Fault Transient fault APPLICATION Output Transient fault (e.g., bit 4 in R2) APPLICATION Output Symptom detectors (SWAT): Fatal traps, kernel panic, etc. Output Masked Silent Data Corruption (SDC) Transient fault Detection How to convert SDCs to detections?

SDCs to Detections 4 Add new detectors in error propagation path? – SDC coverage: Fraction of all SDCs converted to detections Will it be low-cost? APPLICATION Output Silent Data Corruption (SDC) APPLICATION.... Error Detection Error Detectors

Key Challenges 5 What to protect? SDC-causing fault sites Identified usingRelyzer[ASPLOS’12] How to Protect? Low-cost Detectors Where to place? Many errors propagate to few program values What detectors?Program-level propertiestests Uncovered fault-sites? Selective instruction-level duplication

Contributions Discovered common program properties around most SDC-causing sites Devised low-cost program-level detectors – Average SDC reduction of 84% – Average execution overhead 10% New detectors + selective duplication = Tunable resiliency at low-cost – Found near optimal detectors for any SDC target – Lower cost than pure redundancy for all SDC targets  E.g., 12% vs. 90% SDC reduction 6

Outline Motivation and introduction Categorizing and protecting SDC-causing sites Tunable resilience vs. overhead Methodology Results Conclusions 7

Outline Motivation and introduction Categorizing and protecting SDC-causing sites – Loop incrementalization – Registers with long life – Application-specific behavior Tunable resilience vs. overhead Methodology Results Conclusions 8

Insights Identify where to place the detectors and what detectors to use Placement of detectors ( where ) – Many errors propagate to few program values  End of loops and function calls Detectors ( what ) – Test program-level properties  E.g., comparing similar computations and checking value equality Fault model – Single bit flips in integer arch. registers 9

Loop Incrementalization 10 rA = base addr. of a rB = base addr. of b L: load r1 ← [rA]... load r2 ← [rB]... store r3 → [rA]... add rA = rA + 0x8 add rB = rB + 0x8 add i = i + 1 branch (i<n) L Array a, b; For (i=0 to n) {... a[i] = b[i] + a[i]... } C Code ASM Code

Loop Incrementalization 11 rA = base addr. of a rB = base addr. of b L: load r1 ← [rA]... load r2 ← [rB]... store r3 → [rA]... add rA = rA + 0x8 add rB = rB + 0x8 add i = i + 1 branch (i<n) L Array a, b; For (i=0 to n) {... a[i] = b[i] + a[i]... } C Code ASM Code Where: Errors from all iterations propagate here in few quantities Collect initial values of rA, rB, and i SDC-hot app sites No loss in coverage - lossless

Registers with Long Life Some long lived registers are prone to SDCs For detection – Duplicate the register value at its definition – Compare its value at the end of its life No loss in coverage - lossless Life time 12 R1 definition Copy Use 1 Compare Use 2 Use n......

Application-Specific Behavior 13 exp few s

Application-Specific Behavior (Contd.) 14 Bit Reverse Compare Parity

Tunable Resiliency vs. Overhead What if our detectors do not cover all SDC-causing sites? – Use selective instruction-level redundancy What if our low-overhead is still not tolerable but lower resiliency is? – Tunable resiliency vs. overhead 15

Identifying Near Optimal Detectors: Naïve Approach 16 Bag of detectors SDC coverage SFI 50% Example: Target SDC coverage = 60% Sample 1 Overhead = 10% Sample 2 Overhead = 20% SFI 65% Tedious and time consuming

Identifying Near Optimal Detectors: Our Approach 17 Bag of detectors Selected Detectors SDC Covg.= X% Overhead = Y% Detector 1. Set attributes, enabled by Relyzer [ASPLOS’12] 2. Dynamic programming Constraint: Total SDC covg. ≥ 60% Objective: Minimize overhead Overhead = 9% Obtained SDC coverage vs. Performance trade-off curves

Methodology Six applications from SPEC 2006, Parsec, and SPLASH2 Fault model: single bit flips in int arch registers at every dynamic instr Ran Relyzer, obtained SDC-causing sites, examined them manually Our detectors – Implemented in architecture simulator – Overhead estimation: Num assembly instrns needed Selective redundancy – Overhead estimation: 1 extra instrn for every uncovered instrn Lossy detectors’ coverage – Statistical fault injections (10,000) 18

Categorization of SDC-causing Sites Categorized >88% SDC-causing sites 19 Added Lossy Detectors Added Lossless Detectors

SDC coverage 84% average SDC coverage (67% - 92%) 20

SDC coverage 21 84% average SDC coverage (67% - 92%)

Execution Overhead 10% average overhead (0.1% - 18%) 22

Execution Overhead 10% average overhead (0.1% - 18%) 23

SDC Coverage vs. Overhead Curve Consistently better over pure (selective) instruction-level duplication 24 Our detectors + Redundancy Pure Redundancy 18% 90% 24% 99%

Conclusions Reduction in SDCs is crucial for low-cost reliability Discovered common program properties around most SDC-causing sites Devised low-cost program-level detectors – 84% avg. SDC coverage at 10% avg. cost New detectors + selective duplication = Tunable resiliency at low-cost – Found near optimal detectors for any SDC target – Lower cost than pure redundancy for all SDC targets Future directions – More applications and fault models – Automating detectors’ placement and derivation 25