LetGo: A Lightweight Continuous Framework for HPC Applications under Failures Bo Fang, Qiang Guan, Nathan DeBardeleben, Karthik Pattabiraman and Matei.

LetGo: A Lightweight Continuous Framework for HPC Applications under Failures
Bo Fang, Qiang Guan, Nathan DeBardeleben, Karthik Pattabiraman and Matei Ripeanu Don’t read i (LA-UR )

Resilience is a critical factor for system to scale.
𝑀𝑇𝐵𝐹𝑠𝑦𝑠𝑡𝑒𝑚= 𝑀𝑇𝐵𝐹𝑛𝑜𝑑𝑒 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠 a. large scale and HPC systems are the main computing platform for HPC workloads, such as scientific applications. b. One of the main focus to maintain the system is the resilience. One metric used here is MTBF. theoritically the system level mtbf, which is the mtbf per node divided by the number of nodes. For systems that have 10,000 to 100,000 nodes, MTBF would be between about 87 hours to 9 hours. This goes down when the system scales larger. c. Error resilience is a key factor to scale. Fault tolerance techniques are necessary. Fault tolerance techniques are needed!

Classical FT Technique: Checkpoint/Restart
Checkpoint Interval CHK Bit-flips SW bugs CHK Crash Fault Error SDC Load CHK a. A classical FT technique used in HPC system is C/R. b. The high level idea of C/R is that the application runs for a certain period of time which is defined as ckpt interval, and takes a checkpoint (which is to store program states into non-voliatile storage), and runs another period of time, takes another checkpoint and go on c. Sometimes, during the normal execution, there might be a fault (bit flips, software bugs) occurring and affecting the application. This fault becomes an error to the application, and later lead to different outcomes. The main failure outcomes are crashes, and Silent Data Corruptions (incorrect output). SDCs are mainly dealt by application specific checks. For crashes, A typical C/R would stop the execution, rollback to previous checkpoints, and restart from the stored program states. d. People configure the C/R for the optimal kept interval, which is a function of ckpt overhead and the MTBF). So the longer MTBF, the longer ckpt interval. e. The efficiency of a C/R system is measured as the summation of the ckpt intervals divided by the total time including the intervals, the total ckpt overhead and the recovery overhead) f. For system level ckpt, the ckpt overhead is usually hardware dependent. Therefore, the question now becomes, can we do something to the MTBF, which leads to a longer optimal checkpoint interval, and a more efficient C/R scheme. 𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦= Σ(𝐶𝑘𝑝𝑡 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙) Σ(𝐶𝑘𝑝𝑡 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙+𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑) 𝑂𝑝𝑡𝑖𝑚𝑎𝑙 𝑐ℎ𝑒𝑐𝑘𝑝𝑜𝑖𝑛𝑡 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙=𝑓 𝐶𝑘𝑝𝑡 𝑜𝑣𝑒𝑟ℎ𝑒𝑎𝑑, 𝑀𝑇𝐵𝐹 𝒀𝒐𝒖𝒏𝒈, 𝑫𝒂𝒍𝒚

Design Space for Recovery Strategies
Application-aware Application-agnostic Rollback [Gamel2013] [Aupy2013] Classical system-level C/R Before answering that, we take a look at the design space for recovery strategies of C/R As the direction of recovery, there are either forward or rollback. Forward recovery means that the application does not need to rebuild the lost states due to crash. Rollback means that applications needs to load previous states from ckpt, and rebuild the lost states As the level of the recovery works. there are application-aware and application-agnostic. Application-aware recovery means that the recovery strategies take into account the application behaviours and data, for a specific design. Application-agnostic provides a more general yet still efficient way of recover. In this work, we are aiming for a forward application agnostic recovery, as we want applications to be able to crash less often (as forwarding) with a general approach. [Wu2016] [Chien2015] LetGo [Fang2017] Forward

Why this could work? Can we detect the error? OS Exceptions
After the state is repaired … Corrupted state is contained [Li2015] Will the crash happen again? Can we expect correct results? Applications with convergent behavior Why we believe this could work? There are four problems associated with our hypothesis. Can we detect the errors? The answer is very likely. Because OS has the mechanisms that caches exceptions such as sigfaults. After the state is repaired by LetGo, will the crash happen again? The answer for now is probably not, because from our prior research , we found that faults that lead to crashes are unlikely to propagate far Can we expect the application still produce correct results after forwarding? Maybe. This is because that applications in HPC do a lot of iterations and require the computation to coverge, which may mask errors. Here, we present a video that shows how an application converges over iterations and masks the fault. This video models how the wave are moved based on the physics rules, and a fault generate an extra splash. But over iterations, this splash is gone, and all the results are correct. What if the result is incorrect, can those errors be detected? Maybe, because many applications contain application specific checks to filter out obvious errors. Can we detect incorrect results? App-specific correctness check

Our Approach: LetGo Design Evaluation Methodology Evaluation Results
I will present our approach: LetGo. The roadmap is as follows: first I will talk about the design of LetGo, and then I will discuss the evaluation methodology for LetGo, and last but not least I will present the evaluation results.

Forward Recovery with LetGo
Design Evaluation Methodology Evaluation Results Forward Recovery with LetGo CHK Without LetGo CHK With LetGo Load Load CHK LetGo CHK CHK let me remind you what LetGo does at high level. Let me first remind you what the current C/R system works without LetGo. Again, here is an application running with the standard C/R system, when a fault occurs that lead to crashes, it rollbacks . Now, instead of loading ckpt, LetGo continues the execution. The expected benefits of LetGo is that because the application does not fail (crash) because of LetGo, then the MTBF is increased, and theoretically the ckpt interval can be increased, so fewer ckpts are needed. In addition, because the application continues, the recovery overhead is ignored. Expected Benefits Lower checkpoint frequency Avoid the overhead of restart Potential risk: Higher rate of incorrect results (SDCs)

LetGo – Design Requirements
Evaluation Methodology EvaluationResults LetGo – Design Requirements Transparency No need to change the application code Convenience No need to recompile the application or make changes to the runtime environment Low overhead Low rate of newly introduced failures For LetGo to be used for HPC, we consider the following design requirements: a. Transparency. LetGo should be able to perform all the tasks such as forwarding, or dealing with system signal handling without changing the application code b. Convenience: just read the text c. LetGo should incur minimum overheads d. LetGo should not introduce many new failures.

LetGo - Design Overview
Evaluation Methodology Evaluation Results LetGo - Design Overview LetGo Modifier Process execution Process State Link this to the fault types  Move he application on top of the OS Monitor OS Signal Handling

Monitor Intercepts OS signals for applications
Design Evaluation Methodology EvaluationResults Monitor Intercepts OS signals for applications Uses GDB to attach to application at launch Configures the signal handling (SIGSEGV, SIGBUS, SIGABORT) Benefits: No assumption about the compilation level No change to the application Low overhead ( 0.1%) Let me first briefly introduce Monitor. The main goal of Monitor is to intercept the OS signals. As the current implementation, we use GDB to attach to application while it launches, and configure the signal handling through GDB interface. We support SIGSEGV, SIGBUS and SIGABORT for now. The benefits of this implementation is that a. we don’t need to make assumption for the compilation level, because gdb works with any of them b. no need to change the application c. we don’t need to run it in debug mode, since the overhead is trivial

Design Evaluation Methodology EvaluationResults Modifier Goal: Increase the probability of successful continued execution (low rate of newly introduced failures) Basic action: move the program counter (PC) to bypass the crash causing instruction Optional: Heuristics to repair state The next component of design is Modifier. Again, the goal of the Modifier is to increase … The basic action of Modifier is to move the program counter ahead after crash. In order to do this, we need to analyze the instruction where the crash occurs, and find the next pc We provide two heuristics to repair the state and forward the application for a higher chance of success.

bit-flip, producing an invalid address
Design Evaluation Methodology EvaluationResults Modifier: example bit-flip, producing an invalid address A bit flip affects the computation logic push %rbp mov %rsp, %rbp sub $0x30, %rsp movsd, -0x28(%rbp), %xmm2 subsd %xmm0, %xmm2 movq %xmm2, %rax mov %rax, -0x20(%rbp) 1-bit full Adder A B C in C out S movsd, -0x28(%rbp), %xmm2 LetGo Let me show you an example about how Modifier works. While the application is running, a bit-flip might occur to affect the computational units of a processor, like the following. Here is a 1-bit full adder. A bit flip could occur at different lines, or inside the logic, and makes the output wrong. Now we look at a real program. A bit flip occurs while calculating this address for mov. and cause an SIGSEGV. LetGo takes over and moves the pc forward. We use two heuristics to increase the probability of successful continued execution. because xmm2 may contain random value in this case, we feed xmm2 a 0 before it is used again. we leverage the possible size of the stack to detect if there is corruption in either rsp or rbp. C out

Evaluation questions Can we detect the error?
Design Evaluation Methodology Evaluation Results Can we detect the error? After the state is repaired Will the crash happen again? Can we expect correct results? Next I will present the evaluation methodology. At previous slides, I show you our hypothesis about LetGo and why we believe it could work. Let me remind you of those questions: Can we detect incorrect results?

Evaluation questions Evaluation methodology
Design Evaluation Methodology EvaluationResults Evaluation questions Q1: What is the chance of the continued execution? Q2: What is the chance of successful state repair leading to results? Q3: What is the [increase in the] rate of (un)detected incorrect results. Q4: What is the performance improvement LetGo brings? What is the impact of system scale? Evaluation methodology Fault injection: Q1, Q2, Q3 Simulations: Q4 Here, allow me to formalize them into four evaluation questions. For the first three questions, we use fault injection For the last one, we use simulation.

Benchmarks Convergent behavior Application Domain
Design Evaluation Methodology EvaluationResults Benchmarks Application Domain # of dyn inst. (billions) Data to check for SDC Criteria used in application check LULESH Hydrodynamics 1.0 Mesh Number of iterations: exactly the same Final origin energy: correct to at least 6 digits Measures of symmetry: smaller than 10−8 CLAMR Adaptive mesh refinement 2.8 Threshold for the mass change per iteration COMD Classical molecular dynamics 5.1 Each atom’s property Energy conservation SNAP Discrete ordinates transport 1.6 Flux solution The flux solution output should be symmetric PENNANT Unstructured mesh physics 1.7 HPL Dense linear solver 1.2 Solution vector Residual check on the solution vector Convergent behavior The benchmarks we studied in this project covers many aspects of scientific domains and they are all DOE mini apps. What we want to highlight is the correctness checks we used to detect errors. For example, energy conservation is a key criteria for applications like COMD and PENNANT to check the results. Also, these five applications contain convergent behaviours, and our results are showing the average of these five applications. For Q1-3, we use fault injection For Q4, we use simulation. I will explain why later.

Fault Injection Methodology
Design Evaluation Methodology EvaluationResults Fault Injection Methodology Profile Randomly pick an instruction Flip a bit in the target register Outcome Description Crash OS exception SDC Undetected incorrect results Benign Correct results Detected Detected incorrect results Observe outcome Normal execution Again, to answer question 1 - 3, we use fault injection. I will briefly explain how our fault injection works. For each application, we first profile it, and generate a fault injection site, which is the dynamic instructions that we will inject. Then, for each FI run, we randomly pick one instruction from the site, and flip the bit in the register (which models the bit-flip in the computation). Then we let it continue, then we observe the outcome We distinguish SDC and detected by using application-specific checks LetGo works at crashes

Evaluation Results (average over all benchmarks with convergent behavior)
Design Evaluation Methodology EvaluationResults Without LetGo, there is x of cases leading to crashes. 41% lead to correct outputs, and Detected and SDCs are around 1%. With LetGo, the crashes go down to 22%, and the benign cases go up to 74%. We also observe a marginal increase in SDCs and detected. If we take a close look at the crash cases, LetGo is converting 62% of crashes into continued cases. The rest of 38% crashes crash again. 57% of crashes lead to benigns. We see slightly increase in detected and SDCs. LetGo allows 62% of crashes continue to execute with marginal increase in producing SDCs. This indicates that with LetGo, the system level MTBF can be increased.

Design Evaluation Methodology EvaluationResults Evaluation questions Q1: What is the chance of the continued execution? Q2: What is the chance of successful state repair leading to results? Q3: What is the [increase in the] rate of (un)detected incorrect results. Q4: What is the performance improvement LetGo brings? What is the impact of system scale? Lets revisit the questions. Q1 - Q3 we have answered. For Q4, because the the faults are rare events, so to show the actual performance improvements, we need to run simulations for longer time.

Simulation Results 𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦= 𝑈𝑠𝑒𝑓𝑢𝑙 𝑡𝑖𝑚𝑒 𝑇𝑜𝑡𝑎𝑙 𝑡𝑖𝑚𝑒
Design Evaluation Methodology Evaluation Results Simulation Results 𝐸𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑐𝑦= 𝑈𝑠𝑒𝑓𝑢𝑙 𝑡𝑖𝑚𝑒 𝑇𝑜𝑡𝑎𝑙 𝑡𝑖𝑚𝑒 To answer the last question, we use simulation and focus on measuring the efficiency of the C/R system. The systems are modelled with state machines, for both with and without LetGo. We take characteristics for one application (estimated from the previous FI experiments), and assumes that it scales at a large number of nodes. At beginning, we assume 100,000 nodes in the system and the MTBFailure is 12 hours. The we measure the efficiency for both without and with LetGo, for different checkpoint overheads. The left bars are for without LetGo, and the right bars are with LetGo in each case. For all cases, With LetGo has higher efficiency than without LetGo. For future systems that checkpoint overheads may increase, using LetGo brings much higher efficiency. Then, we scale the system up from 100,000 to 400,000 nodes. The efficiency for different size decreases, but with LetGo the efficiency decrease slower than without LetGo, All benchmarks show the similar trends. This shows that LetGo offers higher efficiency for without LetGo as the system scales up. As checkpoint overheads increase, LetGo performance advantage increases As the system scales, the LetGo performance advantage increases

Conclusion Can we detect the error? After the state is repaired
Design Evaluation Methodology Evaluation Results Can we detect the error? After the state is repaired Will the crash happen again? 62% Not crash Can we expect correct results? 54% Correct it leaves only 2-3% of incorrect undetected errors. Can we detect incorrect results? 3% detected

Conclusion LetGo improves the overall C/R efficiency and leads to a longer checkpoint interval. LetGo offers “clean” solution for HPC environment Netsyslab UBC: Dependability lab UBC: Finish it with answering those questions

Modifier: example Challenge I: data corruption …
Design Evaluation Methodology EvaluationResult Modifier: example Challenge I: data corruption bit-flip, producing an invalid address … movsd, -0x28(%rbp), %xmm2 subsd %xmm0, %xmm2 push %rbp mov %rsp, %rbp sub $0x30, %rsp movsd, -0x28(%rbp), %xmm2 subsd %xmm0, %xmm2 movq %xmm2, %rax mov %rax, -0x20(%rbp) movsd, $0x0, %xmm2 %xmm2 may contain random value movsd, -0x28(%rbp), %xmm2 Challenge II: pointer corruption push %rbp mov %rsp, %rbp sub $0x30, %rsp … Too much detail Fault to come from the ALU Verbally how to repair states %rbp should be within [%rsp, %rsp+30]

State Machine of a C/R with LetGo
Design Evaluation Methodology EvaluationResult CHK COMP VERIF COMP LET GO CONT Recap the answers to the research question 1- 3 And reacp what are the answer to the r4 States Description COMP Application runs in a checkpoint interval VERIF Application checks the correctness CHK Application takes checkpoint LETGO LetGo fixes the crash CONT Application continues after the fix

Computation Convergence Example
Combine the video with the bullets

LetGo: A Lightweight Continuous Framework for HPC Applications under Failures Bo Fang, Qiang Guan, Nathan DeBardeleben, Karthik Pattabiraman and Matei.

Similar presentations

Presentation on theme: "LetGo: A Lightweight Continuous Framework for HPC Applications under Failures Bo Fang, Qiang Guan, Nathan DeBardeleben, Karthik Pattabiraman and Matei."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LetGo: A Lightweight Continuous Framework for HPC Applications under Failures Bo Fang, Qiang Guan, Nathan DeBardeleben, Karthik Pattabiraman and Matei.

Similar presentations

Presentation on theme: "LetGo: A Lightweight Continuous Framework for HPC Applications under Failures Bo Fang, Qiang Guan, Nathan DeBardeleben, Karthik Pattabiraman and Matei."— Presentation transcript:

Similar presentations

About project

Feedback