Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin Schulz Statistical Fault Detection and Analysis with AutomaDeD
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Reliability is a Critical Challenge in Large Systems Need tools to detect faults, identify causes Fault tolerance : requires fault detection System management: need to know what failed Faults come from various causes Hardware: soft errors, marginal circuits, physical degradation, design bugs Software: coding bugs, misconfigurations
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory In General Fault Detection and Fault Tolerance is Undecidable Option 1: Make all applications fault resilient Application-specific solutions hard to design Many applications How does fault resilience compose? Option 2: Develop approximate fault detection, tolerate via checkpointing et al Statistically model application behavior Look for deviations from model behavior Identify model components that likely caused deviation
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory In General Fault Detection and Fault Tolerance is Undecidable Option 2: Develop approximate fault detection, tolerate via checkpointing et al Statistically model application behavior Look for deviations from model behavior Identify model components that likely caused deviation Application Model
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Focus on Modeling Individual MPI Applications Primary goal is fault detection for HPC applications Model behavior of single MPI application Detect deviations from norm Identify origin of deviation in time/space Other branches of field Model system component interactions Model application as dataflow graph of modules Model micro-architecture state as vulnerable/non- vulnerable (ACE analysis)
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Goal: Detect Unusual Application Behavior, Identify Cause... Single Run - Spatial Differences between behavior of processes Single Run - Temporal Differences between one time point and others Multiple Runs Differences between behavior of runs MPI Application
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Semi-Markov Models SMM - Transition system Nodes: application states Edges: transitions from one state to another Probability of transition Time spent in prior state before transition.2 / 5μs.7 / 15μs.1 / 500μs A B C D
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory SMMs Represent Application Control Flow SMM states correspond to Calls to MPI Code Between MPI Calls Computation main() foo() Send-DBL Computation main() foo() Recv-DBL Computation main() Finalize main() Init main() { MPI_Init() … Computation … MPI_Send(…, 1, MPI_INTEGER, …); for(…) foo(); MPI_Recv(…, 1, MPI_INTEGER, …); MPI_Finalize(); } foo() { MPI_Send(…, 1024, MPI_DOUBLE, …); …Computation… MPI_Recv(…, 1024, MPI_DOUBLE, …); …Computation… } main() { MPI_Init() … Computation … MPI_Send(…, 1, MPI_INTEGER, …); for(…) foo(); MPI_Recv(…, 1, MPI_INTEGER, …); MPI_Finalize(); } foo() { MPI_Send(…, 1024, MPI_DOUBLE, …); …Computation… MPI_Recv(…, 1024, MPI_DOUBLE, …); …Computation… } Application Code Semi-Markov Model main() Send-INT main() Recv-INT Different state for different calling context
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Transitions Represent Time Spent at States During execution each transition observed multiple times Time series of transition times: [t 1, t 2, …, t n ] Represented as probability distribution Gaussian Histogram.2 / 5μs.7 / 15μs.1 / 500μs
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Transitions Represent Time Spent at States Gaussian Histogram Time Values Histogram Bucket Counts Gaussian Tail Line Connectors Time Values Probabilities Data Samples Cheaper Lower Accuracy More Expensive Greater Accuracy
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Using SMMs to Help Detect Faults Hardware faults → behavior abnormalities Given sample runs, learn time distribution on each transition (Top and bottom 0% or 10% of each transition’s times removed) If some transition takes an unusual amount of time, declare it an error Time Values Probabilities
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Detection threshold computed from maximum normal variation Need threshold to separate normal, abnormal timing Threshold = lowest probability observed in set of sample runs (Top and bottom 1% removed) Time Values Probabilities Nothing RemovedTop/Bottom 10% Removed False Positive Rate0%19%
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Evaluated Fault Detector Using Fault Injection NAS Parallel Benchmarks 16-process runs Input class A Used BT, CG, FT, MG,LU and SP (EP and IS use MPI in very simple ways) Local delays (FIN_LOOP): 1, 5, 10 sec MPI message drop (DROP_MESG) or repetition (REP_MESG) Extra CPU-intensive (CPU_THR) or Memory- intensive (MEM_THR) thread
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Rates of Fault Detection Within 1ms of Injection No Detection False Detection Before Injection Detection of Fault Within 1ms Detection After 1ms Filtering Usually Improves Detection Rates Single-Point Events Easier to Detect Than Persistent Changes
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory SMMs used to Help Identify Software Faults in MPI Applications User knows application has fault but needs help to focus on cause Help identify point where fault first manifests as change in application behavior Key tasks on faulty run: Identify time period of manifestation Identify task where fault first manifested Identify code region where fault first manifested
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Focus on the Time Period of Unusual Behavior User marks phase boundaries in code Compute SMM for each task/phase Task 1 Task 2 Task n... Task 1 Task 2 Task n... Task 1 Task 2 Task n Task 1 Task 2 Task n... Task 1 Task 2 Task n...
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Focus on the Time Period of Abnormal Behavior Find phase with most unusual SMMs If sample runs available, compare faulty run’s SMMs to sample runs’ SMMs If none available, compare each phase to others... Faulty Run Sample Run
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Cluster Tasks According to Behavior to Identify Abnormal Task User provides application’s natural cluster count k Use sample execution to compute clustering threshold τ that produces k clusters Use sample runs if available Otherwise, compute τ from start of execution During real runs cluster tasks using threshold τ Task 1 Task 2 Task n... Task 3 Task 4Task 5Task 6 Task 7Task 8Task 9 Task 1 Task 2 Master-Worker Task 3 Task 4Task 5Task 6 Task 7Task 8Task 9 Task 1 Task 2 Bug in Task 9
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Cluster Tasks According to Behavior to Identify Abnormal Task Compare tasks in each cluster to their behavior in Sample runs Start of execution Most abnormal is identified Transition most responsible for difference identified as origin Task 3 Task 4Task 5Task 6 Task 7Task 8Task 9 Task 1 Task 2 Bug in Task 9
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory From Clustering Identify Transition Where Fault First Manifested SMM difference function combines Difference between transition probabilities Task i Task j
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory From Clustering Identify Transition Where Fault First Manifested SMM difference function combines Difference between transition probabilities Difference between transition time distributions Task i Task j Task 3 Task 4Task 5Task 6 Task 7Task 8Task 9 Task 1 Task 2 Transition most responsible for inter-cluster differences: identified as manifestation origin Uses ranking algorithm
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Evaluated Fault Detector Using Fault Injection NAS Parallel Benchmarks 16-task, Class A: BT, CG, FT, MG,LU and SP 2000 injection experiments per application Local livelock/deadlock (FIN_LOOP, INF_LOOP) Message drop (DROP_MESG), repetition (REP_MESG) CPU-intensive (CPU_THR) or Memory-intensive (MEM_THR) thread Examined variants of training runs 20 training runs with no faults 20 training runs, 10% have fault No training runs
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Phase Detection Accuracy Accuracy ~90% for Loops and Message drops, ~60% for Extra threads Training significantly better than no training (10% bug training is close) Histograms better than Gaussians Training vs No Training NoFault Sample vs Some Faults Gaussian vs Histogram
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Cluster Isolation Accuracy Results assume phase detected accurately Accuracy of Cluster Isolation highly variable Depends on propagation of fault’s effects Accuracy upto 90% for extra threads Poor detection elsewhere since no information on event timing
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Cluster Isolation Accuracy Extended cluster isolation with information on event order Focuses on first abnormal transition Significantly better accuracy for loop faults
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Transition Isolation Accuracy: injected transition in top 5 candidates Accuracy ~90% for Loop faults Highly variable for others Less variable if event order information is used
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Abnormality Detection Helps Illuminate MVAPICH Bug Job execution script failed clean up at job end, left runaway processes on nodes Simulated by executing BT (16- and 64-task runs) concurrently with LU, MG or SP (16-task runs) Experiments show Average SMM difference in regular BT runs Difference between BT runs with interference and no-interference runs Overlap execution during initial portion of BT run
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Abnormality Detection Helps Illuminate MVAPICH Bug Experiments show Average SMM difference in regular BT runs Difference between BT runs with interference and no-interference runs
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Abnormality Detection Helps Illuminate MVAPICH Bug Experiments show Average SMM difference in regular BT runs Difference between BT runs with interference and no-interference runs
LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Behavior Modeling is Critical Component of Fault Detection and Analysis Complex behavior of applications and systems Statistical models provide accurate summary Promising results Quick detection of faults Focused localization of root causes Ongoing work Scaling implementations to real HPC systems Improving accuracy through More data Models custom-tailored to applications