Slide 1/24 Lawrence Livermore National Laboratory AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks Greg Bronevetsky, Bronis R. de Supinski,

Slides:



Advertisements
Similar presentations
UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.
Advertisements

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
G. Alonso, D. Kossmann Systems Group
Histograms detect imbalances Variable sizes capture variance Lawrence Livermore National Laboratory Center for Applied Scientific Computing NC STATE UNIVERSITY.
CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.
Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Communication Pattern Based Node Selection for Shared Networks
Significance Testing Chapter 13 Victor Katch Kinesiology.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Computer Science Department Jeff Johns Autonomous Learning Laboratory A Dynamic Mixture Model to Detect Student Motivation and Proficiency Beverly Woolf.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Sampling Distributions
Evaluating Hypotheses
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
Testing Metrics Software Reliability
Total Quality Management BUS 3 – 142 Statistics for Variables Week of Mar 14, 2011.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 1 Statistical Methods in Computer Science Data 1: Frequency Distributions Ido.
Profile Guided MPI Protocol Selection for Point-to-Point Communication Calls 5/9/111 Aniruddha Marathe, David K. Lowenthal Department of Computer Science.
WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales Bowen ZhouJonathan Too Milind KulkarniSaurabh Bagchi Purdue University.
Analysis of Simulation Results Andy Wang CIS Computer Systems Performance Analysis.
P51UST: Unix and Software Tools Unix and Software Tools (P51UST) Compilers, Interpreters and Debuggers Ruibin Bai (Room AB326) Division of Computer Science.
LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Designing For Testability. Incorporate design features that facilitate testing Include features to: –Support test automation at all levels (unit, integration,
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Victor Sheng, Foster Provost, Panos Ipeirotis KDD 2008 New York.
Individual values of X Frequency How many individuals   Distribution of a population.
1 Using Multiple Energy Gears in MPI Programs on a Power- Scalable Cluster Vincent W. Freeh, David K. Lowenthal, Feng Pan, and Nandani Kappiah Presented.
Systems Development Lifecycle Testing and Documentation.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
Bug Localization with Machine Learning Techniques Wujie Zheng
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.
Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.
Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),
Statistical Sampling-Based Parametric Analysis of Power Grids Dr. Peng Li Presented by Xueqian Zhao EE5970 Seminar.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
LLNL-PRES Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work performed under the auspices of the U.S. Department.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory ASC STAT Team: Greg Lee, Dong Ahn (LLNL), Dane Gardner (LANL)
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Processor Architecture
Bug Isolation via Remote Sampling. Lemonade from Lemons Bugs manifest themselves every where in deployed systems. Each manifestation gives us the chance.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
CS 351/ IT 351 Modeling and Simulation Technologies Review ( ) Dr. Jim Holten.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
1 How to do Multithreading First step: Sampling and Hotspot hunting Myongji University Sugwon Hong 1.
Hypothesis Testing Steps for the Rejection Region Method State H 1 and State H 0 State the Test Statistic and its sampling distribution (normal or t) Determine.
Slide 1/20 Automatic Problem Localization via Multi- dimensional Metric Profiling Ignacio Laguna 1, Subrata Mitra 2, Fahad A. Arshad 2, Nawanol Theera-Ampornpunt.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
1
TensorFlow– A system for large-scale machine learning
Overview of Research in Dependable Computing Systems Lab
Testing Tutorial 7.
QianZhu, Liang Chen and Gagan Agrawal
Is System X for Me? Cal Ribbens Computer Science Department
Outlier Discovery/Anomaly Detection
Soft Error Detection for Iterative Applications Using Offline Training
Stack Trace Analysis for Large Scale Debugging using MRNet
Hypothesis Testing.
Presentation transcript:

Slide 1/24 Lawrence Livermore National Laboratory AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks Greg Bronevetsky, Bronis R. de Supinski, Dong H. Ahn, Martin Schulz Jun 30 th, th IEEE/IFIP International Conference on Dependable Systems and Networks Ignacio Laguna, Saurabh Bagchi

Slide 2/24 Lawrence Livermore National Laboratory Debugging Large-Scale Parallel Applications is Challenging Large systems will have millions of cores in near future –Increased difficulty for developing correct HPC applications –Traditional debuggers don’t perform well at this scale Faults come from various sources –Hardware: soft errors, physical degradation, design bugs –Software: coding bugs, misconfigurations Sequoia (2011) 1.6 million cores

Slide 3/24 Lawrence Livermore National Laboratory Developer Steps When Debugging a Parallel Application When did it fail? Parallel task that failed? Code region? Line of code? Questions a developer has to answer when an application fails: AutomaDeD Need for tools to help developers find root cause quickly root cause

Slide 4/24 Lawrence Livermore National Laboratory AutomaDeD’s Error Detection Approach Phase Annotation Task 1 Task 2 … Task n Application Model 1 … Clustering (1) Abnormal Phases (2) Abnormal Tasks (3) Characteristic Transitions P N MPI Profiler Offline Online Offline Model 2 Model n

Slide 5/24 Lawrence Livermore National Laboratory Types of Behavioral Differences 1 … MPI Application 23n Tasks Spatial (between tasks) Temporal (between time points) time 1 … MPI Application 23n Tasks time Run 1Run 2 1 … MPI Application 23n Tasks time Run 3 Between runs

Slide 6/24 Lawrence Livermore National Laboratory Semi-Markov Models (SMM) Like a Markov model but with time between transitions –Nodes: application states –Edges: transitions from one state to another 0.2, 5μs 0.7, 15μs 0.1, 500μs A B C D Transition probability Time spent in current state (before transition)

Slide 7/24 Lawrence Livermore National Laboratory SMM Represents Task Control Flow States correspond to: –Calls to MPI routines –Code between MPI routines Computation main()  foo()  Send-DBL Computation main()  foo()  Recv-DBL Computation main()  Finalize main()  Init main() { MPI_Init() … Computation … MPI_Send(…, 1, MPI_INTEGER, …); for(…) foo(); MPI_Recv(…, 1, MPI_INTEGER, …); MPI_Finalize(); } foo() { MPI_Send(…, 1024, MPI_DOUBLE, …); …Computation… MPI_Recv(…, 1024, MPI_DOUBLE, …); …Computation… } Application CodeSemi-Markov Model main()  Send-INT main()  Recv-INT Different state for different calling context

Slide 8/24 Lawrence Livermore National Laboratory Two Approaches for Time Density Estimation: Parametric and Non-parametric Bucket Counts Gaussian Tail Line Connectors Time Values Density Function Data Samples Cheaper Lower Accuracy More Expensive Greater Accuracy Time Values Gaussian Distribution (Parametric model) Histograms (Non-parametric model)

Slide 9/24 Lawrence Livermore National Laboratory AutomaDeD’s Error Detection Approach Phase Annotation Task 1 Task 2 … Task n Application Model 1 … Clustering (1) Abnormal Phases (2) Abnormal Tasks (3) Characteristic Transitions P N MPI Profiler Offline Online Offline Model 2 Model n

Slide 10/24 Lawrence Livermore National Laboratory Phase 1, 2, 3,…. User’s Phase Annotations main() { MPI_Init() … Computation … MPI_Send(…, MPI_INTEGER, …); for(…) { MPI_Send(…, MPI_DOUBLE, …); …Computation… MPI_Recv(…, MPI_DOUBLE, …); MPI_Pcontrol(); } …Computation… MPI_Recv(…, MPI_INTEGER, …); MPI_Finalize(); } Phases denote regions of execution repeated dynamically Developers annotate phases in the code –MPI_Pcontrol is intercepted by wrapper library Sample Code:

Slide 11/24 Lawrence Livermore National Laboratory A Semi-Markov Model per Task, per Phase SMM... SMM... Task 1 Task 2 Task n SMM... SMM... Phase 1 Phase 2 Phase 3Phase 4 time

Slide 12/24 Lawrence Livermore National Laboratory AutomaDeD’s Error Detection Approach Phase Annotation Task 1 Task 2 … Task n Application Model 1 … Clustering (1) Abnormal Phases (2) Abnormal Tasks (3) Characteristic Transitions P N MPI Profiler Offline Online Offline Model 2 Model n

Slide 13/24 Lawrence Livermore National Laboratory Faulty Phase Detection: Find the Time Period of Abnormal Behavior Goal: find phase that differs the most from other phases SMM 1 SMM 2 SMM n Phase 1 SMM 1 SMM 2 SMM n Phase 2 SMM 1 SMM 2 SMM n Phase M … Sample runs available: SMM 1 SMM 2 SMM n Phase 1Phase 2Phase M … SMM 1 SMM 2 SMM n SMM 1 SMM 2 SMM n Sample Runs SMM 1 SMM 2 SMM n Phase 1 SMM 1 SMM 2 SMM n Phase 2 SMM 1 SMM 2 SMM n Phase M … Without sample runs: Compare to counterpart Compare each phase to all others Deviation score

Slide 14/24 Lawrence Livermore National Laboratory Clustering Tasks’ Models: Hierarchical Agglomerative Clustering (HAC) Task 1 SMM Task 2 SMM Task 3 SMM Task 4 SMM Step 2 Task 1 SMM Task 2 SMM Task 3 SMM Task 4 SMM Step 3 Diss(SMM 1, SMM 2 ) = L2 Norm (Transition prob.) + L2 Norm (Time prob.) Task 1 SMM Task 2 SMM Task 3 SMM Task 4 SMM Step 1 Each task starts in its own cluster We need a dissimilarity threshold to decide when to stop Step 4 Do we stop? or, Do we get one cluster? ?

Slide 15/24 Lawrence Livermore National Laboratory How To Select The Number Of Clusters User provides application’s natural cluster count k Use sample runs to compute clustering threshold τ that produces k clusters –Use sample runs if available –Otherwise, compute τ from start of execution –Threshold based on highest increase in dissimilarity During real runs, cluster tasks using threshold τ

Slide 16/24 Lawrence Livermore National Laboratory Cluster Isolation Example Task 3 Task 4Task 5Task 6 Task 7Task 8Task 9 Task 1 Task 2 Master-Worker Application Example Normal Execution Cluster 1 Cluster 2 Task 3 Task 4Task 5Task 6 Task 7Task 8Task 9 Task 1 Task 2 Buggy Execution Cluster 1 Cluster 2 Cluster 3 Bug in Task 9 Cluster Isolation: to separate buggy task in unusual cluster

Slide 17/24 Lawrence Livermore National Laboratory Transition Isolation: Erroneous Code Region Detection Method 1: –Find edge that distinguishes faulty cluster from the others –Recall: SMM dissimilarity is based in part on L2 norm of SMM’s parameters Method 2: –Find unusual individual edge –Edge that takes unusual amount of time (compared to observed times) Visualization of Results Isolated transition (cluster 2)

Slide 18/24 Lawrence Livermore National Laboratory Fault Injections NAS Parallel Benchmarks (MPI programs): –BT, CG, FT, MG, LU and SP –16 tasks, Class A (input) 2000 injection experiments per application: NameDescription FIN_LOOPLocal livelock/deadlock (delay 1,5, 10 sec) INF_LOOPTransient stall (infinite loop) DROP_MESGMPI message loss REP_MESGMPI message duplication CPU_THRCPU-intensive thread MEM_THRMemory-intensive thread

Slide 19/24 Lawrence Livermore National Laboratory Phase Detection Accuracy ~90% for loops and message drops ~60% for extra threads –Training = sample runs available –Training significantly better than no training –Histograms better than Gaussians Training vs. No Training Some Faults vs. NoFault Samples Gaussian vs. Histogram Faults

Slide 20/24 Lawrence Livermore National Laboratory Cluster Isolation Accuracy: Isolating the abnormal task(s) Results assume phase detected accurately Accuracy of Cluster Isolation highly variable Application Faults Accuracy up to 90% for extra threads Poor detection elsewhere because of fault propagation: buggy task  normal task(s)

Slide 21/24 Lawrence Livermore National Laboratory Transition Isolation Accuracy Erroneous transition lies in top 5 candidates (identified by AutomaDeD) –Accuracy ~90% for loop faults –Highly variable for others –Less variable if event order information is used

Slide 22/24 Lawrence Livermore National Laboratory MVAPICH Bug Job execution script failed to clean up at job end –MPI tasks executer ( mpirun, version 0.9.9) –Left runaway processes on nodes Simulation: –Execute BT (affected application) –Run concurrently runaway applications (LU, MG or SP) –Runaway tasks interfere with normal BT execution

Slide 23/24 Lawrence Livermore National Laboratory MVAPICH Bug Results: SMMs Deviation Scores in Affected Application Affected application: BT benchmark Interfering applications: SP, LU, MG benchmarks Abnormal phase detected in phase 1 in SP and LU, and in phase 2 in MG Constant (average) SMM difference in regular BT runs

Slide 24/24 Lawrence Livermore National Laboratory Concluding Remarks Contributions: –Novel way to model and compare parallel tasks’ behavior –Focus debugging efforts on time period, tasks and code region where bug is first manifested –Accuracy up to ~90% for phase detection, cluster and transition isolation (delays and hangs) Ongoing work: –Scaling implementation to work on millions of tasks –Improving accuracy through different statistical models (e.g., Kernel Density Estimation, Gaussian Mixture Models)