Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Applications of one-class classification
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
Thoughts on Shared Caches Jeff Odom University of Maryland.
UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
G. Alonso, D. Kossmann Systems Group
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
Histograms detect imbalances Variable sizes capture variance Lawrence Livermore National Laboratory Center for Applied Scientific Computing NC STATE UNIVERSITY.
Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.
Chapter 19: Network Management Business Data Communications, 4e.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.
Evaluating Hypotheses
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Towards Modelling Information Security with Key-Challenge Petri Nets Teijo Venäläinen
Monté Carlo Simulation MGS 3100 – Chapter 9. Simulation Defined A computer-based model used to run experiments on a real system.  Typically done on a.
1 Software Testing Techniques CIS 375 Bruce R. Maxim UM-Dearborn.
WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales Bowen ZhouJonathan Too Milind KulkarniSaurabh Bagchi Purdue University.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Computer System Lifecycle Chapter 1. Introduction Computer System users, administrators, and designers are all interested in performance evaluation. Whether.
Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
CMSC 345 Fall 2000 Unit Testing. The testing process.
Slide 1/24 Lawrence Livermore National Laboratory AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks Greg Bronevetsky, Bronis R. de Supinski,
INT-Evry (Masters IT– Soft Eng)IntegrationTesting.1 (OO) Integration Testing What: Integration testing is a phase of software testing in which.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.
Statistical Sampling-Based Parametric Analysis of Power Grids Dr. Peng Li Presented by Xueqian Zhao EE5970 Seminar.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
LLNL-PRES Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work performed under the auspices of the U.S. Department.
Scalable Computing on Open Distributed Systems Jon Weissman University of Minnesota National E-Science Center CLADE 2008.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
How Errors Propagate Error in a Series Errors in a Sum Error in Redundant Measurement.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Processor Architecture
Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
This project has received funding from the European Union's Seventh Framework Programme for research, technological development.
CPSC 871 John D. McGregor Module 8 Session 1 Testing.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
CS 351/ IT 351 Modeling and Simulation Technologies Review ( ) Dr. Jim Holten.
Sampling Dynamic Dataflow Analyses Joseph L. Greathouse Advanced Computer Architecture Laboratory University of Michigan University of British Columbia.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Introduction to Real-Time Systems
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.
Kandemir224/MAPLD Reliability-Aware OS Support for FPGA-Based Systems M. Kandemir, G. Chen, and F. Li Department of Computer Science & Engineering.
1
Kandemir224/MAPLD Reliability-Aware OS Support for FPGA-Based Systems M. Kandemir, G. Chen, and F. Li Department of Computer Science & Engineering.
Fail-stutter Behavior Characterization of NFS
TensorFlow– A system for large-scale machine learning
Reference-Driven Performance Anomaly Identification
Soft Error Detection for Iterative Applications Using Offline Training
Stack Trace Analysis for Large Scale Debugging using MRNet
Chapter 10 – Software Testing
Abstraction.
Abstractions for Fault Tolerance
Presentation transcript:

Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin Schulz Statistical Fault Detection and Analysis with AutomaDeD

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Reliability is a Critical Challenge in Large Systems  Need tools to detect faults, identify causes Fault tolerance : requires fault detection System management: need to know what failed  Faults come from various causes Hardware: soft errors, marginal circuits, physical degradation, design bugs Software: coding bugs, misconfigurations

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory In General Fault Detection and Fault Tolerance is Undecidable  Option 1: Make all applications fault resilient Application-specific solutions hard to design Many applications How does fault resilience compose?  Option 2: Develop approximate fault detection, tolerate via checkpointing et al Statistically model application behavior Look for deviations from model behavior Identify model components that likely caused deviation

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory In General Fault Detection and Fault Tolerance is Undecidable  Option 2: Develop approximate fault detection, tolerate via checkpointing et al Statistically model application behavior Look for deviations from model behavior Identify model components that likely caused deviation Application Model

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Focus on Modeling Individual MPI Applications  Primary goal is fault detection for HPC applications Model behavior of single MPI application Detect deviations from norm Identify origin of deviation in time/space  Other branches of field Model system component interactions Model application as dataflow graph of modules Model micro-architecture state as vulnerable/non- vulnerable (ACE analysis)

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Goal: Detect Unusual Application Behavior, Identify Cause... Single Run - Spatial Differences between behavior of processes Single Run - Temporal Differences between one time point and others Multiple Runs Differences between behavior of runs MPI Application

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Semi-Markov Models  SMM - Transition system  Nodes: application states  Edges: transitions from one state to another Probability of transition Time spent in prior state before transition.2 / 5μs.7 / 15μs.1 / 500μs A B C D

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory SMMs Represent Application Control Flow  SMM states correspond to Calls to MPI Code Between MPI Calls Computation main()  foo()  Send-DBL Computation main()  foo()  Recv-DBL Computation main()  Finalize main()  Init main() { MPI_Init() … Computation … MPI_Send(…, 1, MPI_INTEGER, …); for(…) foo(); MPI_Recv(…, 1, MPI_INTEGER, …); MPI_Finalize(); } foo() { MPI_Send(…, 1024, MPI_DOUBLE, …); …Computation… MPI_Recv(…, 1024, MPI_DOUBLE, …); …Computation… } main() { MPI_Init() … Computation … MPI_Send(…, 1, MPI_INTEGER, …); for(…) foo(); MPI_Recv(…, 1, MPI_INTEGER, …); MPI_Finalize(); } foo() { MPI_Send(…, 1024, MPI_DOUBLE, …); …Computation… MPI_Recv(…, 1024, MPI_DOUBLE, …); …Computation… } Application Code Semi-Markov Model main()  Send-INT main()  Recv-INT Different state for different calling context

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Transitions Represent Time Spent at States  During execution each transition observed multiple times Time series of transition times: [t 1, t 2, …, t n ]  Represented as probability distribution Gaussian Histogram.2 / 5μs.7 / 15μs.1 / 500μs

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Transitions Represent Time Spent at States  Gaussian  Histogram Time Values Histogram Bucket Counts Gaussian Tail Line Connectors Time Values Probabilities Data Samples Cheaper Lower Accuracy More Expensive Greater Accuracy

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Using SMMs to Help Detect Faults  Hardware faults → behavior abnormalities  Given sample runs, learn time distribution on each transition (Top and bottom 0% or 10% of each transition’s times removed)  If some transition takes an unusual amount of time, declare it an error Time Values Probabilities

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Detection threshold computed from maximum normal variation  Need threshold to separate normal, abnormal timing  Threshold = lowest probability observed in set of sample runs (Top and bottom 1% removed) Time Values Probabilities Nothing RemovedTop/Bottom 10% Removed False Positive Rate0%19%

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Evaluated Fault Detector Using Fault Injection  NAS Parallel Benchmarks 16-process runs Input class A Used BT, CG, FT, MG,LU and SP (EP and IS use MPI in very simple ways)  Local delays (FIN_LOOP): 1, 5, 10 sec  MPI message drop (DROP_MESG) or repetition (REP_MESG)  Extra CPU-intensive (CPU_THR) or Memory- intensive (MEM_THR) thread

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Rates of Fault Detection Within 1ms of Injection No Detection False Detection Before Injection Detection of Fault Within 1ms Detection After 1ms Filtering Usually Improves Detection Rates Single-Point Events Easier to Detect Than Persistent Changes

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory SMMs used to Help Identify Software Faults in MPI Applications  User knows application has fault but needs help to focus on cause  Help identify point where fault first manifests as change in application behavior  Key tasks on faulty run: Identify time period of manifestation Identify task where fault first manifested Identify code region where fault first manifested

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Focus on the Time Period of Unusual Behavior  User marks phase boundaries in code  Compute SMM for each task/phase Task 1 Task 2 Task n... Task 1 Task 2 Task n... Task 1 Task 2 Task n Task 1 Task 2 Task n... Task 1 Task 2 Task n...

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Focus on the Time Period of Abnormal Behavior  Find phase with most unusual SMMs  If sample runs available, compare faulty run’s SMMs to sample runs’ SMMs  If none available, compare each phase to others... Faulty Run Sample Run

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Cluster Tasks According to Behavior to Identify Abnormal Task  User provides application’s natural cluster count k  Use sample execution to compute clustering threshold τ that produces k clusters Use sample runs if available Otherwise, compute τ from start of execution  During real runs cluster tasks using threshold τ Task 1 Task 2 Task n... Task 3 Task 4Task 5Task 6 Task 7Task 8Task 9 Task 1 Task 2 Master-Worker Task 3 Task 4Task 5Task 6 Task 7Task 8Task 9 Task 1 Task 2 Bug in Task 9

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Cluster Tasks According to Behavior to Identify Abnormal Task  Compare tasks in each cluster to their behavior in Sample runs Start of execution  Most abnormal is identified  Transition most responsible for difference identified as origin Task 3 Task 4Task 5Task 6 Task 7Task 8Task 9 Task 1 Task 2 Bug in Task 9

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory From Clustering Identify Transition Where Fault First Manifested  SMM difference function combines Difference between transition probabilities Task i Task j

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory From Clustering Identify Transition Where Fault First Manifested  SMM difference function combines Difference between transition probabilities Difference between transition time distributions Task i Task j Task 3 Task 4Task 5Task 6 Task 7Task 8Task 9 Task 1 Task 2  Transition most responsible for inter-cluster differences: identified as manifestation origin Uses ranking algorithm

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Evaluated Fault Detector Using Fault Injection  NAS Parallel Benchmarks 16-task, Class A: BT, CG, FT, MG,LU and SP  2000 injection experiments per application Local livelock/deadlock (FIN_LOOP, INF_LOOP) Message drop (DROP_MESG), repetition (REP_MESG) CPU-intensive (CPU_THR) or Memory-intensive (MEM_THR) thread  Examined variants of training runs 20 training runs with no faults 20 training runs, 10% have fault No training runs

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Phase Detection Accuracy  Accuracy ~90% for Loops and Message drops, ~60% for Extra threads Training significantly better than no training (10% bug training is close) Histograms better than Gaussians Training vs No Training NoFault Sample vs Some Faults Gaussian vs Histogram

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Cluster Isolation Accuracy  Results assume phase detected accurately  Accuracy of Cluster Isolation highly variable Depends on propagation of fault’s effects Accuracy upto 90% for extra threads Poor detection elsewhere since no information on event timing

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Cluster Isolation Accuracy  Extended cluster isolation with information on event order  Focuses on first abnormal transition  Significantly better accuracy for loop faults

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Transition Isolation  Accuracy: injected transition in top 5 candidates Accuracy ~90% for Loop faults Highly variable for others Less variable if event order information is used

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Abnormality Detection Helps Illuminate MVAPICH Bug  Job execution script failed clean up at job end, left runaway processes on nodes  Simulated by executing BT (16- and 64-task runs) concurrently with LU, MG or SP (16-task runs)  Experiments show Average SMM difference in regular BT runs Difference between BT runs with interference and no-interference runs Overlap execution during initial portion of BT run

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Abnormality Detection Helps Illuminate MVAPICH Bug  Experiments show Average SMM difference in regular BT runs Difference between BT runs with interference and no-interference runs

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Abnormality Detection Helps Illuminate MVAPICH Bug  Experiments show Average SMM difference in regular BT runs Difference between BT runs with interference and no-interference runs

LLNL-PRES Option:Additional Information Lawrence Livermore National Laboratory Behavior Modeling is Critical Component of Fault Detection and Analysis  Complex behavior of applications and systems  Statistical models provide accurate summary  Promising results Quick detection of faults Focused localization of root causes  Ongoing work Scaling implementations to real HPC systems Improving accuracy through  More data  Models custom-tailored to applications