2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Advertisements

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 2.
Experimental Evaluation of a SIFT Environment for Parallel Spaceborne Applications K. Whisnant, Z. Kalbarczyk, R.K. Iyer, P. Jones Center for Reliable.
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
HYPOTHESIS TESTING Four Steps Statistical Significance Outcomes Sampling Distributions.
1 Systems of Linear Equations Iterative Methods. 2 B. Iterative Methods 1.Jacobi method and Gauss Seidel 2.Relaxation method for iterative methods.
1 Systems of Linear Equations Iterative Methods. 2 B. Direct Methods 1.Jacobi method and Gauss Seidel 2.Relaxation method for iterative methods.
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden.
1 Error Analysis Part 1 The Basics. 2 Key Concepts Analytical vs. numerical Methods Representation of floating-point numbers Concept of significant digits.
Design of SCS Architecture, Control and Fault Handling.
IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.
An efficient distributed protocol for collective decision- making in combinatorial domains CMSS Feb , 2012 Minyi Li Intelligent Agent Technology.
Chapter 8 Introduction to Hypothesis Testing
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Stable Multi-Target Tracking in Real-Time Surveillance Video
Evaluating Results of Learning Blaž Zupan
Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
This project has received funding from the European Union's Seventh Framework Programme for research, technological development.
AP Statistics Chapter 21 Notes
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.
Copyright © 2014 Pearson Education. All rights reserved Dealing with Errors LEARNING GOAL Understand the difference between random and systematic.
Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Week#3 Software Quality Engineering.
Secure Proactive Recovery – a Hardware Based Mission Assurance Scheme 1 6 th International Conference on Information Warfare and Security, 2011.
Problem and Motivation
LetGo: A Lightweight Continuous Framework for HPC Applications under Failures Bo Fang, Qiang Guan, Nathan DeBardeleben, Karthik Pattabiraman and Matei.
Testing Tutorial 7.
بسم الله الرحمن الرحيم.
nZDC: A compiler technique for near-Zero silent Data Corruption
Evaluating Results of Learning
Optimum Dispatch of Capacitors in Power Systems
Fault Tolerance In Operating System
Application Level Fault Tolerance and Detection
Iterative Methods Good for sparse matrices Jacobi Iteration
Supporting Fault-Tolerance in Streaming Grid Applications
Hwisoo So. , Moslem Didehban#, Yohan Ko
Fault Injection: A Method for Validating Fault-tolerant System
Chapter 9: Hypothesis Testing
Replication-based Fault-tolerance for Large-scale Graph Processing
Soft Error Detection for Iterative Applications Using Offline Training
Chapter 10. Numerical Solutions of Nonlinear Systems of Equations
An Adaptive Middleware for Supporting Time-Critical Event Response
Introduction to Scientific Computing II
Introduction to Scientific Computing II
Introduction to Scientific Computing II
Introduction to Scientific Computing II
More on Search: A* and Optimization
Uncertainty-driven Ensemble Forecasting of QoS in Software Defined Networks Kostas Kolomvatsos1, Christos Anagnostopoulos2, Angelos Marnerides3, Qiang.
Introduction to Scientific Computing II
Guihai Yan, Yinhe Han, and Xiaowei Li
Introduction to Scientific Computing II
Secure Proactive Recovery – a Hardware Based Mission Assurance Scheme
Home assignment #3 (1) (Total 3 problems) Due: 12 November 2018
Presentation transcript:

2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University 2/23/2019

Motivation Impact to Scientific Applications 2/23/2019 Motivation Soft Errors have become a threat in large scale systems Unpredictability Result from various factors Packaging material Radiation Voltage fluctuation Temperature Defective Hardware Transiency Silent Data Corruption (SDC) might not make application crash, but can result in erroneous output Impact to Scientific Applications Scientific applications expect accurate results – low tolerance to soft errors 2/23/2019

2/23/2019 Motivating Study Inspect Impact of Soft Error to iterative applications Inject bit flips in different bit of variable in different execution stage Observe how bit flip in different bits impacts the output Different execution stage (denoted as percentage of total iterations) impacts the output Mimics Single Event Upset (SEU) with only one bit flip in one execution By linearization method, we can get rid of the pointers. 2/23/2019

against the output from the normal execution. 2/23/2019 Impact from SEU to Iterative Applications By linearization method, we can get rid of the pointers. Impact of SEU to Sobel application: measured in Normalized Relative Difference against the output from the normal execution. 2/23/2019

Observation Significant errors occur in higher order bit flip 2/23/2019 Observation Significant errors occur in higher order bit flip Error from lower order bit flips are trivial and usually acceptable Errors occur in different iteration tends to affect the final output Early errors tend to be averaged by the iterative algorithm By linearization method, we can get rid of the pointers. 2/23/2019

Signature Based Detection 2/23/2019 Signature Based Detection Monitor convergence criteria (residual/signature) of the algorithm Normal execution leads to continues convergence Unexpected increase/decrease in convergence criteria is a signature of SDC Apply periodical checkpoint to recover in presence of SDC By linearization method, we can get rid of the pointers. 2/23/2019

Signature Based Detection 2/23/2019 Signature Based Detection Main Idea: Check for signature in each (or some) iteration Periodically take checkpoint If signature of soft error is detected, recover from the latest checkpoint By linearization method, we can get rid of the pointers. 2/23/2019

Partial Replication Identify critical session/iterations (CS) during the execution Replicate computation in critical sections Vote for correct result at the end of CS Avoid major impact of SDC 2/23/2019

Experiment Result Applications Datasets Evaluation Jacobi, Sobel, Conjugated Gradient (CG) Gauss Seidel (GS) Successive Over-Relaxation (SOR) Datasets Evaluation Effectiveness, Improvement from Partial Replication and Overhead 2/23/2019

Experiment Result - Effectiveness TP: True Positive FP: False Positive FN: False Negative F-Score over 90% is Considered as Effective Algorithm 2/23/2019

Experiment Result - Effectiveness 2/23/2019

Experiment Result - Overhead 2/23/2019 Experiment Result - Overhead Distribution of Execution Times: Signature Analysis + Checkpoint 2/23/2019

Experiment Result – Applying Partial Rep. 2/23/2019 Experiment Result – Applying Partial Rep. Results for Partial Replication and Partial Replication + Signature Analysis (including Checkpointing and Restart) on 32 nodes: Sobel and CG. Sobel replicates the last 40% of the execution while CG replicates the first 40%. 2/23/2019

Thanks 2/23/2019

Back up Slides I 2/23/2019

Motivation Traditional Checkpoint & Restart no longer satisfies the need of fault tolerance in large scale system. Huge amount of waste under low MTBF & large number of nodes Workload distribution on 100K nodes Need an alternative solution for larger system scale Decreasing MTBF Increasing checkpoint size due to increasing system size 2/23/2019