Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,

Slides:



Advertisements
Similar presentations
Using Implications for Online Error Detection Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI
Advertisements

NC STATE UNIVERSITY 1 Assertion-Based Microarchitecture Design for Improved Fault Tolerance Vimal K. Reddy Ahmed S. Al-Zawawi, Eric Rotenberg Center for.
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
NC STATE UNIVERSITY ASPLOS-XII Understanding Prediction-Based Partial Redundant Threading for Low-Overhead, High-Coverage Fault Tolerance Vimal Reddy Sailashri.
F ORMAL D IAGNOSIS OF H ARDWARE T RANSIENT E RRORS IN P ROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan T HE E LECTRICAL AND C OMPUTER.
NATW 2008 Using Implications for Online Error Detection Nuno Alves, Jennifer Dworak, R. Iris Bahar Division of Engineering Brown University Providence,
Hardware Fault Recovery for I/O Intensive Applications Pradeep Ramachandran, Intel Corporation, Siva Kumar Sastry Hari, NVDIA Manlap (Alex) Li, Latham.
An Integrated Framework for Dependable Revivable Architectures Using Multi-core Processors Weiding Shi, Hsien-Hsin S. Lee, Laura Falk, and Mrinmoy Ghosh.
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
Fehlererkennung in SW David Rigler. Overview Types of errors detection Fault/Error classification Description of certain SW error detection techniques.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation Kypros Constantinides University of Michigan Onur.
1 RAKSHA: A FLEXIBLE ARCHITECTURE FOR SOFTWARE SECURITY Computer Systems Laboratory Stanford University Hari Kannan, Michael Dalton, Christos Kozyrakis.
MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
Software Faults and Fault Injection Models --Raviteja Varanasi.
System/Software Testing
SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet.
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s:
Understanding the Propagation of Hard Errors to Software and Implications for Resilient System Design M. Li, P. Ramachandra, S.K. Sahoo, S.V. Adve, V.S.
European Test Symposium, May 28, 2008 Nuno Alves, Jennifer Dworak, and R. Iris Bahar Division of Engineering Brown University Providence, RI Kundan.
Using Likely Program Invariants to Detect Hardware Errors Swarup Kumar Sahoo, Man-Lap Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou.
Functional Verification of Dynamically Reconfigurable Systems Mr. Lingkan (George) Gong, Dr. Oliver Diessel The University of New South Wales, Australia.
SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup.
Application-Aware SoftWare AnomalyTreatment (SWAT) of Hardware Faults Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan D EPARTMENT OF E LECTRICAL AND C OMPUTER E NGINEERING T HE U NIVERSITY OF B RITISH C OLUMBIA.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
On-Demand Dynamic Software Analysis Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan December 12,
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.
Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,
ATCA at UIUC M. Haney, M. Kasten High Energy Physics Z. Kalbarczyk, T. Pham, T. Nguyen Coordinated Science Laboratory ILLINOIS UNIVERSITY OF ILLINOIS AT.
SWAT: Designing Resilient Hardware by Treating Software Anomalies Byn Choi, Siva Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Sarita Adve,
Preserving Application Reliability on Unreliable Hardware Siva Hari Department of Computer Science University of Illinois at Urbana-Champaign.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.
Software Managed Resiliency Siva Hari Lei Chen, Xin Fu, Pradeep Ramachandran, Swarup Sahoo, Rob Smolenski, Sarita Adve Department of Computer Science University.
Preserving Application Reliability on Unreliable Hardware Siva Hari Adviser: Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Value Prediction Kyaw Kyaw, Min Pan Final Project.
Approximate Computing: (Old) Hype or New Frontier? Sarita Adve University of Illinois, EPFL Acks: Vikram Adve, Siva Hari, Man-Lap Li, Abdulrahman Mahmoud,
Raghuraman Balasubramanian Karthikeyan Sankaralingam
MSWAT: Hardware Fault Detection and Diagnosis for Multicore Systems
Modeling Stream Processing Applications for Dependability Evaluation
nZDC: A compiler technique for near-Zero silent Data Corruption
SWAT: Designing Resilient Hardware by Treating Software Anomalies
InCheck – An Integrated Recovery Methodology for nZDC
Energy-Efficient Address Translation
Daya S Khudia, Griffin Wright and Scott Mahlke
Hwisoo So. , Moslem Didehban#, Yohan Ko
Fault Injection: A Method for Validating Fault-tolerant System
NEMESIS: A Software Approach for Computing in Presence of Soft Errors
InCheck: An In-application Recovery Scheme for Soft Errors
Software Verification and Validation
Software Verification and Validation
Hardware Counter Driven On-the-Fly Request Signatures
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Co-designed Virtual Machines for Reliable Computer Systems
Software Verification and Validation
Software Techniques for Soft Error Resilience
Presentation transcript:

Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,

Technology Scaling and Reliability Challenges 2 Nanometers Increase (X) Our Focus *Source: Inter-Agency Workshop on HPC Resilience at Extreme Scale hosted by NSA Advanced Computing Systems, DOE/SC, and DOE/NNSA, Feb 2012

Motivation 3 Overhead (perf., power, area) Reliability High reliability at low-cost Redundancy

SWAT: SoftWare Anomaly Treatment Need handle only hardware faults that propagate to software Fault-free case remains common, must be optimized  Watch for software anomalies (symptoms) – Zero to low overhead “always-on” monitors Effective on SPEC, Server, and Media workloads <1% µarch faults escape detectors and corrupt app output (SDC) BUT, Silent Data Corruption rate is not zero 4 Fatal Traps Kernel Panic HangsApp Abort Out of Bounds

Motivation 5 Redundancy Overhead (perf., power, area) Reliability How? Tunable reliability SWAT Very high reliability at low-cost Goals: Full reliability at low-cost Systematic resiliency evaluation Tunable reliability vs. overhead Goals: Full reliability at low-cost Systematic resiliency evaluation Tunable reliability vs. overhead

APPLICATION Output Fault Outcomes 6 Fault-free execution Masked APPLICATION Output Transient Fault e.g., bit 4 in R1 Faulty executions APPLICATION Output Symptom detectors (SWAT): Fatal traps, assertion violations, etc. Symptom of Fault Detection Transient fault again in bit 4 in R1

Fault Outcomes APPLICATION Output Fault-free execution Masked APPLICATION Output APPLICATION Output Symptom of Fault APPLICATION Output 7 X DetectionSDC Faulty executions Silent Data Corruption (SDC) SDCs are worst of all outcomes How to eliminate SDCs?

Approach 8 New detectors + selective duplication = Tunable resiliency at low cost Find SDC causing application sites [ASPLOS 2012] Detect at low cost [DSN 2012] Relyzer Comprehensive resiliency analysis, 96% accuracy APPLICATION. SDC-causing fault APPLICATION. Error Detection Program-level Error Detectors 84% SDCs detected at 10% cost Selective duplication for rest ~5 Years for one app <2 days for one app

APPLICATION Output Relyzer: Application Resiliency Analyzer 9 Pruning fault sites Application-level error equivalence Insight: Similar error propagation  similar outcome Example: Predict fault outcomes Equivalence Classes Representatives CFG Errors in X that take paths behave similarly X

Relyzer Contributions [ASPLOS 2012] Relyzer: A complete application resiliency analysis technique Developed novel fault pruning techniques – 3 to 6 orders of magnitude fewer injections for most apps – 99.78% app fault sites pruned  Only 0.004% represent 99% of all fault sites  Can identify all potential SDC causing fault sites 10 Relyzer

SDC-hot app sites SDC-targeted Program-level Detectors Detectors only for SDC-vulnerable app locations Challenge: Where to place detectors and what detectors to use? Where: Many SDC-causing errors propagate to few program values What (detectors): Test program-level properties 11 Array a, b; For (i=0 to n) { a[i] = b[i] + a[i] } C Code A, B = base addr. of a, b L: load r1, r2 ← [A], [B] store r3 → [A].. add A = A + 0x8 add B = B + 0x8 add i = i + 1 branch (i<n) L ASM Code All errors propagate here in few quantities Collect initial values of A, B, and i Example:

Contributions [DSN 2012] Discovered common program properties around most SDC-causing sites Devised low-cost program-level detectors – Avg. SDC reduction of 10% avg. cost New detectors + selective duplication = Tunable resiliency at low-cost 12 Relyzer + new detectors + selective duplication Relyzer + selective duplication 18% 90% 24% 99%

Other Contributions 13 APPLICATION Output mSWAT [Hari et al., MICRO’09] Symptom detectors on Multicore systems Novel diagnosis to isolate faulty core Detection Time Checkpointing and rollback I/O intensive apps Latency-recoverability Diagnosis Recovery Accurate fault modeling FPGA validation of SWAT detectors [Pellegrini et al., DATE’12] Gate-µarch-level simulator [Li et al., HPCA’09] Complete Resiliency Solution Siva Hari University of Illinois at Urbana-Champaign

Backup 14

Identifying Near Optimal Detectors: Naïve Approach 15 Bag of detectors SDC coverage SFI 50% Example: Target SDC coverage = 60% Sample 1 Overhead = 10% Sample 2 Overhead = 20% SFI 65% Tedious and time consuming

Identifying Near Optimal Detectors: Our Approach 16 Bag of detectors Selected Detectors SDC Covg.= X% Overhead = Y% Detector 1. Set attributes, enabled by Relyzer 2. Dynamic programming Constraint: Total SDC covg. ≥ 60% Objective: Minimize overhead Overhead = 9% Obtained SDC coverage vs. Performance trade-off curves [DSN’12]