SDC is in the eye of the beholder: A Survey and preliminary study

Slides:



Advertisements
Similar presentations
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
Advertisements

System Software Environments Breakout Report June 27, 2002.
1 Saad Arrabi 2/24/2010 CS  Definition of soft errors  Motivation of the paper  Goals of this paper  ACE and un-ACE bits  Results  Conclusion.
Tamper Evident Microprocessors Adam Waksman Simha Sethumadhavan Computer Architecture & Security Technologies Lab (CASTL) Department of Computer Science.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,
Fault Prediction and Software Aging
A Fault-tolerant Architecture for Quantum Hamiltonian Simulation Guoming Wang Oleg Khainovski.
Failure Avoidance through Fault Prediction Based on Synthetic Transactions Mohammed Shatnawi 1, 2 Matei Ripeanu 2 1 – Microsoft Online Ads, Microsoft Corporation.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.
Software Faults and Fault Injection Models --Raviteja Varanasi.
Chapter 3 Memory Management: Virtual Memory
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.
Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.
Assuring Application-level Correctness Against Soft Errors Jason Cong and Karthik Gururaj.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.
Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Sequential Hardware Prefetching in Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE Computer Society, Michel Dubois, Senior Member, IEEE, and.
CprE 458/558: Real-Time Systems
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan D EPARTMENT OF E LECTRICAL AND C OMPUTER E NGINEERING T HE U NIVERSITY OF B RITISH C OLUMBIA.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Efficient Soft Error.
Gill 1 MAPLD 2005/234 Analysis and Reduction Soft Delay Errors in CMOS Circuits Balkaran Gill, Chris Papachristou, and Francis Wolff Department of Electrical.
GangES: Gang Error Simulation for Hardware Resiliency Evaluation Siva Hari 1, Radha Venkatagiri 2, Sarita Adve 2, Helia Naeimi 3 1 NVIDIA Research, 2 University.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
On Reliable Modular Testing with Vulnerable Test Access Mechanisms Lin Huang, Feng Yuan and Qiang Xu.
Experience Report: System Log Analysis for Anomaly Detection
14 Compilers, Interpreters and Debuggers
LetGo: A Lightweight Continuous Framework for HPC Applications under Failures Bo Fang, Qiang Guan, Nathan DeBardeleben, Karthik Pattabiraman and Matei.
Testing Tutorial 7.
Generating Automated Tests from Behavior Models
Ultrascale Systems Research Center, Los Alamos National Laboratory2
SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee.
nZDC: A compiler technique for near-Zero silent Data Corruption
Application Level Fault Tolerance and Detection
Cross-cutting concepts in science
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Hwisoo So. , Moslem Didehban#, Yohan Ko
Fault Injection: A Method for Validating Fault-tolerant System
Experiments in Machine Learning
Processor Fundamentals
Soft Error Detection for Iterative Applications Using Offline Training
Dynamic Prediction of Architectural Vulnerability
Dynamic Prediction of Architectural Vulnerability
Mattan Erez The University of Texas at Austin July 2015
Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs
InCheck: An In-application Recovery Scheme for Soft Errors
Hardware Counter Driven On-the-Fly Request Signatures
2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.
Luanzheng Guo, Dong Li University of California Merced
Fault Tolerant Systems in a Space Environment
A task of induction to find patterns
Lab 8: GUI testing Software Testing LTAT
Chapter 13: I/O Systems.
Introduction to Computer Systems Engineering
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Functional Safety Solutions for Automotive
Determining the capacity of any quantum computer to perform a useful computation Joel Wallman Quantum Resource Estimation June 22, 2019.
Presentation transcript:

SDC is in the eye of the beholder: A Survey and preliminary study Bo Fang*, Panruo Wu✝, Qiang Guan ☨, Nathan DeBardeleben ☨, Laura Monroe ☨, Sean Blanchard ☨, Zizong Chen ✝, Karthik Pattabiraman* and Matei Ripeanu* *The University of British Columbia, Canada ☨ Ultrascale System Research Center, Los Alamos National Lab, USA ✝ The University of California Riverside, USA This is a position paper about how people should think in terms of characterization and detecting SDCs.

VS But many of us still do ! do not compare apple and orange. Obviously they are different things. Unobviously, in computer science, sometimes we are still doing this. I will give you an example in a couple of slides. But many of us still do !

Error Resilience Fault Error Failure SoC soft error trends: overall FIT rate per SoC is increasing [DATE 2014, Chandra AMD] A very important concept is called Fault –error –failure chain Fault is referred to hardware faults in our context, which is caused by particle strikes, neutrons, hardware defeats etc. It can cause for example, bit-flips Error: Deviaton of system behavior from the fault-free run Failure Violation of system’s specificaton e.g. crash As fault rates keep going, the error resilience study becomes more and more important.

Error Protection Space How large the space is Error can appear in any layer Only a fraction of the errors at the circuit level impacts the application Protection cost is different across layers protection with software-based techniques are essential for modern systems. Where apple and orange comparison happens is that Try to compare how efficient the two FT techniques where they are designed in different layers

Focus: Silent Data Corruption Crash Fault Hang SDC No Sign of Incorrect Execution Normal execution Report the preliminary study here

Preliminary Study: Cross-layer Data Corruptions Error propagation how much fault masking across different perspectives of the system? Fault injection Fault model PINFI [Wei DSN14] DOE mini apps Fault mode is a single bit-flip in the computation units of the processors.

Experimental Configuration Application output and application-specific correctness check Applications Output Application-specific correctness check LULESH Number of iterations Final origin energy Measures of symmetry Number of iterations: exactly the same Final origin energy: correct to at least 6 digits Measures of symmetry: smaller than 10-8 HPL Solution vector x Residual check on x CLAMR Number of cell units Mass change per iterations Threshold for the mass change per iteration Measure memory, output and app-specific correctness check

Cross-layer Error Resilience 46% of faults causes memory corruptions, but no impact on the final correctness 50% of output corruptions do not lead to final correctness deviation Data corruption rate Say that error resilience estimation can be misleading depending on the layer

No Sign for Incorrect Execution SDC Characterization What SDC How No Sign for Incorrect Execution When What parts/layers/data of systems we want to check How to check When to check

What: System-level Classification Memory OS System Call App Data Path Our position is that we need a system-level classification of SDC in the context of the whole system stack for characterization and detection. Benefits: 1. Enbale different point of view/ hardware guys/os guys/ resilience scientist/ application developer/user 2. Understand/improve the effectiveness of FT techniques Error detection mechanisms can be improved to see if the detected error in lower layer can really lead to unacceptable outcome (selective detection): needs cross-layer analysis Checkpoint/recovery schemes based on anomaly data monitoring can determine if a roll-back is needed by predicting the final outcome of a intermediate data corruption.

How: Precise vs Approximate Application output different from golden run Application output not pass check e.g. [Feng ASPLOS2010] [Hari, ASPLOS2012] [Reis, CGO2005] e.g. [Lu CASE2014] [Huang, IEEE TC2006] [Reis, CGO2005] Move that system level classification How Precise or vague checking Various layers

Example of the Impact Affects sensitivity 01100110 01100111 01100110 Bit-by-bit equality vs application-specific check 01100110 01100111 01100110 ✔️ 01100011 Here is an example of why this is important. The choice/requirement of how to determine an SDC affects the sensitivity of your determination. A gap can be expected.

When: Intermediate vs Final Most of studies Intermediate Application states (intermediate or final) violation ABFT algorithms Internal states of a linear solver. Make sure it does not violate/break mathematical invariants/algorithm-specific requirement e.g. [Berrocal, HPDC2015] [Chen, SIGPLAN2013] [Sloan, DSN2012]

Conclusion SDCs are the most important failure types for modern systems SDC characterization depends on multi-dimension knowledge SDC protection needs cross-layer analysis Advertising: Please attend the talk of our paper in the regular session: ePVF: An Enhanced Program Vulnerability Factor Methodology for Cross-layer Resilience Analysis (Wednesday, June 29th, 2016, 16:00 – 17:30)

Outline What are SDCs Classification of SDCs Impact on Fault Tolerance Design Skip this one