WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales Bowen ZhouJonathan Too Milind KulkarniSaurabh Bagchi Purdue University
Ever Changing Behavior of Software Software has to be adaptive to accommodate for different platforms, inputs and configurations. As a side effect, manifestation of a bug may depend on a particular platform, input or configuration. 2
Ever Changing Behavior of Software 3
Software Development Process 4 Develop a new feature and its unit tests Test the new feature on a local machine Push the feature into productoin systems Break production systems Roll back the feature Not tested in production systems!!!
Bugs in Production Run Properties – Remains unnoticed when the application is tested on developer's workstation – Breaks production system when the application is running on a cluster and/or serving real user requests Examples – Configuration Error – Integer Overflow 5
Bugs in Production Run Properties – Remains unnoticed when the application is tested on developer's workstation – Breaks production system when the application is running on a cluster and/or serving real user requests Examples – Configuration Error – Integer Overflow Scale-Dependent Bugs 6
Modeling Program Behavior for Finding Bugs Dubbed as Statistical Debugging [Bronevetsky DSN ‘10] [Mirgorodskiy SC ’06] [Chilimbi ICSE ‘09] [Liblit PLDI ‘03] – Represents program behavior as a set of features that can be measured in runtime – Builds a model to describe and predict the features based on data collected from many runs – Detects abnormal features that deviate from the model's prediction beyond a certain threshold 7
Modeling Program Behavior for Finding Bugs Dubbed as Statistical Debugging [Bronevetsky DSN ‘10] [Mirgorodskiy SC ’06] [Chilimbi ICSE ‘09] [Liblit PLDI ‘03] – Represents program behavior as a set of features that can be measured in runtime – Builds a model to describe and predict the features based on data collected from many runs – Detects abnormal features that deviate from the model's prediction beyond a certain threshold 8 Does not account for scale-induced variation in program behavior
Modeling Scale-dependent Behavior 9 RUN # # OF TIMES LOOP EXECUTES Is there a bug in one of the production runs? Training runsProduction runs
Modeling Scale-dependent Behavior 10 SCALE # OF TIMES LOOP EXECUTES Training runsProduction runs Accounting for scale makes trends clear, errors at large scales obvious
Modeling Scale-dependent Behavior Our Previous Research – Vrisha [HPDC '11] Builds a collective model for all features of a program to detect bugs at any feature – Abhranta [HotDep '12] Tweaks Vrisha's model to allow per-feature bug detection and localization 11
Modeling Scale-dependent Behavior Our Previous Efforts – Vrisha [HPDC '11] Builds a collective model for all features of a program to detect bugs at any feature – Abhranta [HotDep '12] Tweaks Vrisha's model to allow per-feature bug detection and localization 12 They have limitations...
Modeling Scale-dependent Behavior Big gap in scale – e.g. training runs on up to 128 nodes, production runs on 1024 nodes Noisy features – Too many false positives render the model useless 13
Reconstructing Scale-dependent Behavior: the WuKong way Covers a wide range of program features Predicts the expected value in a large-scale run for each feature separately Prunes unpredictable features to improve localization quality Provides a shortlist of suspicious features in its localization roadmap 14
The Workflow 15 APP PIN RUN 1 APP PIN RUN 3 APP PIN RUN 2 APP PIN RUN 4 APP PIN RUN N... SCALE FEATURE RUN 1 SCALE FEATURE RUN 3 SCALE FEATURE RUN 2 SCALE FEATURE RUN 4 SCALE FEATURE RUN N... SCALE FEATURE MODEL SCALE FEATURE Production Training = ?
Feature Collection 16
Features considered by WuKong void foo(int a) { if (a > 0) { } else { } if (a > 100) { int i = 0; while (i < a) { if (i % 2 == 0) { } ++i; } 17
Features considered by WuKong void foo(int a) { 1:if (a > 0) { } else { } 2:if (a > 100) { int i = 0; 3:while (i < a) { 4:if (i % 2 == 0) { } ++i; }
Modeling 19
Predict Feature from Scale X ~ vector of scale parameters X 1...X N Y ~ number of times a particular feature occurs The model to predict Y from X: Compute the prediction error: 20
Predict Feature from Scale X ~ vector of scale parameters X 1...X N Y ~ number of times a particular feature occurs The model to predict Y from X: Compute the prediction error: 21
Bug Localization 22
Locate Buggy Features First, we need to know if the production run is buggy, by doing detection as follows: If there is a bug in this run, we can start looking at the prediction error of each feature: – Rank all features by their prediction error to provide a localization roadmap that contains the top N features 23 Error of feature i in the production run Constant parameterMax error of feature i in all training runs
Improve Localization Quality by Feature Pruning 24
Noisy Feature Pruning Some features cannot be effectively predicted by the above model – Random – Not scale-determined – Discontinuous The trade-off – Keep those feature would pollute the diagnosis by pushing real faults down the list – Remove these features could miss some faults if the faults happens to be in such features 25
Noisy Feature Pruning How to remove them? For each feature: 1.Do a cross validation with training runs 2.Remove the feature if it triggers greater-than- 100% prediction error in more than (100-x)% of training runs Parameter x > 0 is for tolerating outliers in training runs 26
Evaluation Fault injection in Sequoia AMG2006 – Up to 1024 processes – Randomly selected conditionals to be flipped Two case studies – Integer overflow in a MPI library – Deadlock in a P2P file sharing application 27
Evaluation Fault injection in Sequoia AMG2006 – Up to 1024 processes – Randomly selected conditionals to be flipped Two case studies – Integer overflow in a MPI library – Deadlock in a P2P file sharing application 28
Fault Injection Study Fault – Injected at process 0 – Randomly pick a feature to flip Data – Training (w/o fault): 110 runs, processes – Production (w/ fault): 100 runs, 1024 processes 29
Fault Injection Study Result – Total100 – Noncrashing57 – Detected53 – Located49 30 Successful Localized: 92.5%
Evaluation Fault injection in Sequoia AMG2006 – Up to 1024 processes – Randomly selected conditionals to be flipped Two case studies – Integer overflow in a MPI library – Deadlock in a P2P file sharing application 31
Evaluation Fault injection in Sequoia AMG2006 – Up to 1024 processes – Randomly selected conditionals to be flipped Two case studies – Integer overflow in a MPI library – Deadlock in a P2P file sharing application 32
Case Study: A Deadlock in Transmission’s DHT Implemenation 33
Case Study: A Deadlock in Transmission’s DHT Implemenation 34
Case Study: A Deadlock in Transmission’s DHT Implemenation 35 Feature 53, 66
Conclusion Debugging scale-dependent program behavior is a difficult and important problem WuKong incorporates scale of run into a predictive model for each individual program feature for accurate bug diagnosis We demonstrated the effectiveness of WuKong through a large-scale fault injection study and two case studies of real bugs 36
Q&A 37
Backup 38
Runtime Overhead 39 Geometric Mean: 11.4%