Presentation is loading. Please wait.

Presentation is loading. Please wait.

2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering.

Similar presentations


Presentation on theme: "2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering."— Presentation transcript:

1 2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University 2/23/2019

2 Motivation Impact to Scientific Applications
2/23/2019 Motivation Soft Errors have become a threat in large scale systems Unpredictability Result from various factors Packaging material Radiation Voltage fluctuation Temperature Defective Hardware Transiency Silent Data Corruption (SDC) might not make application crash, but can result in erroneous output Impact to Scientific Applications Scientific applications expect accurate results – low tolerance to soft errors 2/23/2019

3 2/23/2019 Motivating Study Inspect Impact of Soft Error to iterative applications Inject bit flips in different bit of variable in different execution stage Observe how bit flip in different bits impacts the output Different execution stage (denoted as percentage of total iterations) impacts the output Mimics Single Event Upset (SEU) with only one bit flip in one execution By linearization method, we can get rid of the pointers. 2/23/2019

4 against the output from the normal execution.
2/23/2019 Impact from SEU to Iterative Applications By linearization method, we can get rid of the pointers. Impact of SEU to Sobel application: measured in Normalized Relative Difference against the output from the normal execution. 2/23/2019

5 Observation Significant errors occur in higher order bit flip
2/23/2019 Observation Significant errors occur in higher order bit flip Error from lower order bit flips are trivial and usually acceptable Errors occur in different iteration tends to affect the final output Early errors tend to be averaged by the iterative algorithm By linearization method, we can get rid of the pointers. 2/23/2019

6 Signature Based Detection
2/23/2019 Signature Based Detection Monitor convergence criteria (residual/signature) of the algorithm Normal execution leads to continues convergence Unexpected increase/decrease in convergence criteria is a signature of SDC Apply periodical checkpoint to recover in presence of SDC By linearization method, we can get rid of the pointers. 2/23/2019

7 Signature Based Detection
2/23/2019 Signature Based Detection Main Idea: Check for signature in each (or some) iteration Periodically take checkpoint If signature of soft error is detected, recover from the latest checkpoint By linearization method, we can get rid of the pointers. 2/23/2019

8 Partial Replication Identify critical session/iterations (CS) during the execution Replicate computation in critical sections Vote for correct result at the end of CS Avoid major impact of SDC 2/23/2019

9 Experiment Result Applications Datasets Evaluation
Jacobi, Sobel, Conjugated Gradient (CG) Gauss Seidel (GS) Successive Over-Relaxation (SOR) Datasets Evaluation Effectiveness, Improvement from Partial Replication and Overhead 2/23/2019

10 Experiment Result - Effectiveness
TP: True Positive FP: False Positive FN: False Negative F-Score over 90% is Considered as Effective Algorithm 2/23/2019

11 Experiment Result - Effectiveness
2/23/2019

12 Experiment Result - Overhead
2/23/2019 Experiment Result - Overhead Distribution of Execution Times: Signature Analysis + Checkpoint 2/23/2019

13 Experiment Result – Applying Partial Rep.
2/23/2019 Experiment Result – Applying Partial Rep. Results for Partial Replication and Partial Replication + Signature Analysis (including Checkpointing and Restart) on 32 nodes: Sobel and CG. Sobel replicates the last 40% of the execution while CG replicates the first 40%. 2/23/2019

14 Thanks 2/23/2019

15 Back up Slides I 2/23/2019

16 Motivation Traditional Checkpoint & Restart no longer satisfies the need of fault tolerance in large scale system. Huge amount of waste under low MTBF & large number of nodes Workload distribution on 100K nodes Need an alternative solution for larger system scale Decreasing MTBF Increasing checkpoint size due to increasing system size 2/23/2019


Download ppt "2/23/2019 A Practical Approach for Handling Soft Errors in Iterative Applications Jiaqi Liu and Gagan Agrawal Department of Computer Science and Engineering."

Similar presentations


Ads by Google