Luanzheng Guo, Dong Li University of California Merced

Slides:

Advertisements

Similar presentations

1 Lecture 4: Procedure Calls Today’s topics:  Procedure calls  Large constants  The compilation process Reminder: Assignment 1 is due on Thursday.

Advertisements

MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 3 Memory Management Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,

Programming Types of Testing.

CSE 1301 Lecture 6B More Repetition Figures from Lewis, “C# Software Solutions”, Addison Wesley Briana B. Morrison.

Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.

Hashing General idea: Get a large array

Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.

Evaluating the Error Resilience of Parallel Programs Bo Fang, Karthik Pattabiraman, Matei Ripeanu, The University of British Columbia Sudhanva Gurumurthi.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

06/10/ Working with Data. 206/10/2015 Learning Objectives Explain the circumstances when the following might be useful: Disabling buttons and.

Eliminating Silent Data Corruptions caused by Soft-Errors Siva Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran, University of Illinois at Urbana-Champaign,

Introduction to Computer Engineering ECE/CS 252, Fall 2010 Prof. Mikko Lipasti Department of Electrical and Computer Engineering University of Wisconsin.

Static Program Analyses of DSP Software Systems Ramakrishnan Venkitaraman and Gopal Gupta.

Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Iteration 1 Looping Structures. Iteration 2 The Plan While not everyone understands: 1.Motivate loops 2.For loops 3.While loops 4.Do-while loops 5.Equivalence.

Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.

Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,

Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,

Variables Bryce Boe 2012/09/05 CS32, Summer 2012 B.

Virtual Memory.

Memory Management.

Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

Course Contents KIIT UNIVERSITY Sr # Major and Detailed Coverage Area

Processes and threads.

User-Written Functions

Software Testing.

How will execution time grow with SIZE?

Java 4/4/2017 Recursion.

Memory Management © 2004, D. J. Foreman.

PREGEL Data Management in the Cloud

Optimization Code Optimization ©SoftMoore Consulting.

Lecture 07 More Repetition Richard Gesick.

Introduction to Micro Controllers & Embedded System Design Stored Program Machine Department of Electrical & Computer Engineering Missouri University.

Stack Data Structure, Reverse Polish Notation, Homework 7

Supporting Fault-Tolerance in Streaming Grid Applications

Lecture 4B More Repetition Richard Gesick

Chapter 5 - Functions Outline 5.1 Introduction

Hwisoo So. , Moslem Didehban#, Yohan Ko

Fault Injection: A Method for Validating Fault-tolerant System

Chapter 9: Virtual-Memory Management

Object Oriented Programming COP3330 / CGS5409

Page Replacement.

Objective of This Course

Fault Tolerance Distributed Web-based Systems

The Von Neumann Model Basic components Instruction processing

Soft Error Detection for Iterative Applications Using Offline Training

Java Programming Arrays

How to improve (decrease) CPI

How can we find data in the cache?

Recall: ROM example Here are three functions, V2V1V0, implemented with an 8 x 3 ROM. Blue crosses (X) indicate connections between decoder outputs and.

Chapter 3 DataStorage Foundations of Computer Science ã Cengage Learning.

int [] scores = new int [10];

Dongyun Jin, Patrick Meredith, Dennis Griffith, Grigore Rosu

MARIE: An Introduction to a Simple Computer

Basics of Recursion Programming with Recursion

Explaining issues with DCremoval( )

Memory management Explain how memory is managed in a typical modern computer system (virtual memory, paging and segmentation should be described.

ECE 352 Digital System Fundamentals

Linking Memories across Time via Neuronal and Dendritic Overlaps in Model Neurons with Active Dendrites George Kastellakis, Alcino J. Silva, Panayiota.

Chapter 12 Pipelining and RISC

Data Structures Introduction

Data Structures & Algorithms

Introduction to Programming

CS 144 Advanced C++ Programming January 31 Class Meeting

COMP755 Advanced Operating Systems

Java Coding 6 David Davenport Computer Eng. Dept.,

Looping Structures.

Storage Classes.

Presentation transcript:

Luanzheng Guo, Dong Li University of California Merced FlipTracker: Understanding Natural Error Resilience in HPC Applications Luanzheng Guo, Dong Li University of California Merced Ignacio Laguna LLNL Martin Schulz TUM In collaboration with UCM, LLNL, and TUM. SC’18, November 13, Dallas, TX

Application natural resilience can tolerate soft errors The danger of soft errors grows dramatically as HPC systems scale Soft errors are transient and difficult to handle We cannot tell when and where the next soft error will occur Application natural resilience can tolerate soft errors Electrical noise or external high-energy particle strikes Bitflip: from 0 to 1 or from 1 to 0. Soft errors are caused by …, which lead to bitflips in storage cells. The error can bypass the hardware detection mechanism and affect the application state. (no we think) Application natural resilience can help handle soft errors Storage cells A soft error

What’s application natural resilience? Capability of the application to tolerate soft errors This capability is natural or inherent No modifications are required to the code Which means that the application handle soft errors all by the application itself without add-on mechanisms such as ABFT or checkpoint-restart or redundancy.

Motivation of our work Previous studies have shown many applications that have natural resilience Why does the application have natural resilience? We don’t have a framework to identify the reason. Examples of such applications are algebraic multi-grid solvers (AMG), Conjugate Gradient (CG) solvers, Monte Carlo simulations, and machine learning algorithms, such as clustering and deep-learning neural networks People have show that a few applications have natural error resilience that can help tolerate soft errors. * Previous work attribute the natural resilience to statistical and iterative structures of applications, We don’t have a framework to figure the reason out!

The goal of FlipTracker To design a code structure model that enables to separate an application into code regions To have a framework that allows us to do fine-grained analysis of error propagation and resilience properties To propose a methodology that helps reason the natural resilience of code segments Remember this!

Fault model We use random fault injection to mimic the effect of real soft errors in the application When an error is injected, we define three possible error manifestations Success SDC (Silent Data Corruption) Interruption Success means that the execution output passed the verification of the application SDC means that the execution terminates but the execution output didn’t pass the verification of the application Interruption means that the execution breaks out in the middle of the execution and crashes

Application Code Region Model We characterize HPC applications as sets of iterative structures or loops where the program spends most of its time HPC Application Non-resilience Code Region y = ... for (...) { x = y * ... double tmp = x + ... result = ... } Resilience Code Region z = result * ... Non-resilience Code Region Our loop-based model also enables a divide-and-conquer approach, where we can identify application subcomponents that may or may not have resilience patterns. while (...) {. ... Resilience Code Region

An overview of FlipTracker Clean Dynamic Instruction Trace Code Regions Application Doing Fault Injection Faulty Dynamic Instruction Trace … 7

An overview of FlipTracker DOWN Dynamic Data Dependence Graph (DDDG) If the output of the two DDDGs are the same Alive Corrupted Locations (ACL) Table Resilience Code Regions TOP Resilience Computation Patterns We will explain each phase in detail in next slides. Output variables are those that are written in the code region and read after the code region. 7

Dynamic Data Dependence Graph For each code region, we generate a DDDG from an instruction trace function operand conj_grad p156 140731587809888 conj_grad beta 140731587809884 value LOAD LOAD conj_grad 2 0.328 conj_grad 3 1.261 operation MUL MUL conj_grad 4 0.413 Input variables STORE As we mentioned, we generate a DDDG for each code region. Here is what a DDDG looks like. In this example, the operand is a register. DDDG has multi-fold usages, such as identifying inputs of the code region. Internal variables conj_grad p156 140731587809888 Output variables

How to identify resilience code regions? We find a resilience code region in the following two cases (1) The value of an input location is corrupted. Code region locations can be the memory locations or registers. The values of all output locations are correct. locations can be the memory locations or registers.

How to identify resilience code regions? We find a resilience code region in the following two cases (2) The value of an input location is corrupted. Code region The value of output locations is corrupted.

How to identify resilience code regions? We find a resilience code region in the following two cases (2) However, the error magnitude in at least one corrupted location becomes smaller. Code region

Alive Corrupted Locations table For any resilience code region, we use DDDG to build a table of Alive Corrupted Locations (ACL) The ACL table stores the number of alive, corrupted locations at each dynamic instruction Location A: Alive Corrupted Location B: Alive Corrupted Location C: Alive Corrupted Location D: Alive Corrupted Runtime Here is the idea of the ACL table We call a location “alive” if the value in that location will be referenced again in the remainder of the computation. For location A, the region marked by the green window represents the time when location A is alive corrupted, before the green window, A is clean; after the green window, A is either never used anymore or overwritten by a clean value. So the corruption at A doesn’t take effect and affect the execution anymore. The long arrow ends with runtime is the timeline of the execution and shows the sequence of the execution. The execution starts from left to right. The first scale represents the first instruction, the second …. At instruction 1, there are 0 ACLs because there are no ACLs at the instruction 1. Instr. 1 #ACLs: 0 Instr. k #ACLs: 3 Instr. n #ACLs: 2

Alive Corrupted Locations The period of alive corrupted Location A Location A Instr 1 Instr 2 Instr 3 Instr k Instr k+1 A A A A A Clean Alive Corrupted: A starts being affected by errors Alive Corrupted: A is used in this or later instructions Alive Corrupted: A is used in this or later instruction Clean: A is not used anymore. Or A is written by a clean value. We explain what does alive corrupted means in detail. White color of A means clean; red color of A means corruption. The corruption in Location A is alive until A is not used anymore or is written by a clean value.

We map and visualize the ACL table in this figure. LagrangeElements() A real example of the ACL table. We map and visualize the ACL table in this figure. It shows the number of ACL-s in LULESH after an error is injected. Number of Alive Corrupted Location LagrangeNodal() Iteration Iteration Iteration The ACL table is mapped and visualized in the figure. After the error is triggered, the error is tolerated in three iterations. We want to understand the reason of the four decreases of the number of Alive corrupted locations, in which error masking happens, which can guide us to find resilience computation patterns. 0 1.00 1.02 1.04 1.06 1.08 1.10 1.12 1.14 Dynamic Instruction number ( ×10 7 )

Resilience computation pattern We define a resilience computation pattern as a combination of series of computations (or instructions) that contribute to error masking events We find six resilience computation patterns after examining the error masking events from 10 representative applications

Resilience pattern---Dead Corrupted Locations Pattern 1: Dead Corrupted Locations (DCL) - In this pattern, corrupted locations are not used anymore The array hourgram[][] is a temporal corrupted array that is dead after the current code snippet. Example of the Dead Corrupted Locations in LULESH

Resilience pattern---Repeated Addition Pattern 2: Repeated Additions (RA) In this pattern, the value of a corrupted location is repeatedly added by other correct values

Resilience pattern---Repeated Addition Example of the Repeated Additions pattern in MG. When u[] is corrupted, … We examine the value of the array element (u[10][10][10]). This code region is iteratively called four times.

Resilience pattern---Conditional Statement, Shifting, Truncation, and Overwriting A conditional statement helps tolerate errors Pattern 4: Shifting Corrupted bits are lost due to shifting operations Pattern 5: Data Truncation Corrupted data is truncated or not presented to users Pattern 6: Data Overwriting Corrupted data is overwritten by a correct value

Case studies Use case 1: Resilience-Aware Application Design Use case 2: Predicting Application Resilience We apply resilience patterns to the CG benchmark, aiming to improve its resilience We build a regression model to predict the success of the error manifestation using resilience patterns Resilience computation patterns have many potential uses. We give two preliminary case studies.

Use case 1: Resilience-Aware Application Design The goal of this use case is to show that resilience computation patterns can guide application designs towards natural resilience We successfully apply three patterns, dead corrupted locations (DCL), data overwriting, and truncation, to the CG benchmark To apply dead corrupted location and data overwriting, we replace two global arrays referenced in sprnvc() with two temporal arrays We then copy the updated values of the temporal arrays back to the global arrays after computation We replace 64-bit floating-point multiplications with 32-bit integer multiplications for some computations.

Use case 1: Applying resilience patterns to CG Applying truncation to CG Applying dead corrupted locations (DCL) and data overwriting to CG static void conj_grad (int colidx[], ..., double p[], double q[]) { ... d = 0.0; for (j = 0; j < lastcol - firstcol + 1; j++) { d = d + p[j]*q[j]; } ... } Original Code static void conj_grad (int colidx[], ..., double p[], double q[]) { ... d = 0.0; for (j = 0; j < lastcol - firstcol + 1; j++) { if(j<=350&&j>=340){ int tmp = p[j]; // truncation int tmp1 = q[j]; // truncation d = d + tmp*tmp1; }else{ d = d + p[j]*q[j]; } ... } Applying truncation void sprnvc(){ while (…){ … v [ nzv ] = vecelt; } } Original Code void sprnvc(){ double v_tmp [] ; // define a temp array for (i =0; i <=NONZER; i ++){ v_tmp [ i ] = v [ i ] ; // initialization } while (…){ … v_tmp [ nzv ] = vecelt; // replace v with v_tmp for (i =0; i <=NONZER; i ++) { v[ i ] = v_tmp [ i ] ; // copy back } Applying DCL and overwriting Don’t say simply, easy and too difficult.

Use case 1: Results The number of fault injections is decided by 99% confidence level and 1% margin of error Comparison of application resilience after applying resilience patterns to CG Resilience patterns applied Application resilience (Success rate) Execution time in average (s) None 0.590 (the baseline) 159.010 DCL and overwriting 0.780 159.167 Truncation 0.614 158.835 All together 0.782 158.859 32.2% improvement in application resilience with less than 0.1% performance loss 4.1% improvement in application resilience with no performance loss 32.5% improvement in application resilience with less than 0.1% performance loss Don’t explain why the result of all together is not the combination of the results of the first two cases? DCL and overwriting and Truncation.

Conclusions We design a framework that enables fine-grained and comprehensive analysis of error propagation to capture application resilience We give an analysis and formal definition of six resilience computation patterns that we discover in these representative programs We think, resilience computation patterns can not only enable a deeper understanding of application resilience, but can guide future application designs towards patterns with resilience Remember this.

Backup slide 1 How many more patterns you didn’t find? Using FlipTracker, we find six resilience computation patterns from 10 benchmarks. But it doesn’t mean that there are not other patterns that you can find from more applications.

Backup slide 2 How do you verify the resilience patterns? We verify the effectiveness of the six resilience patterns in the two use cases. Especially in the use case 2, we count the number of resilience patterns as features to predict the success rate of new applications. We achieve a good prediction accuracy. We also do an importance study of features (resilience patterns) in the use case 2.