WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales Bowen ZhouJonathan Too Milind KulkarniSaurabh Bagchi Purdue University.

Slides:



Advertisements
Similar presentations
Lecture 8: Testing, Verification and Validation
Advertisements

Module 7: Advanced Development  GEM only slides here  Started on page 38 in SC09 version Module 77-0.
TaintScope: A Checksum-Aware Directed Fuzzing Tool for Automatic Software Vulnerability Detection Tielei Wang 1, Tao Wei 1, Guofei Gu 2, Wei Zou 1 1 Peking.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
G. Alonso, D. Kossmann Systems Group
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.
(Quickly) Testing the Tester via Path Coverage Alex Groce Oregon State University (formerly NASA/JPL Laboratory for Reliable Software)
1 Static Testing: defect prevention SIM objectives Able to list various type of structured group examinations (manual checking) Able to statically.
1 In-Network PCA and Anomaly Detection Ling Huang* XuanLong Nguyen* Minos Garofalakis § Michael Jordan* Anthony Joseph* Nina Taft § *UC Berkeley § Intel.
SE 450 Software Processes & Product Metrics Reliability: An Introduction.
Presenter : Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/15 EICE team Model-Level Debugging of Embedded Real-Time Systems Wolfgang Haberl, Markus.
CS590Z Statistical Debugging Xiangyu Zhang (part of the slides are from Chao Liu)
Statistical Debugging: A Tutorial Steven C.H. Hoi Acknowledgement: Some slides in this tutorial were borrowed from Chao Liu at UIUC.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Swami NatarajanJuly 14, 2015 RIT Software Engineering Reliability: Introduction.
Automated Tests in NICOS Nightly Control System Alexander Undrus Brookhaven National Laboratory, Upton, NY Software testing is a difficult, time-consuming.
Automated Diagnosis of Software Configuration Errors
Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Designing For Testability. Incorporate design features that facilitate testing Include features to: –Support test automation at all levels (unit, integration,
Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan, 2005 University of Wisconsin, Stanford University,
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
Chapter 8 – Software Testing Lecture 1 1Chapter 8 Software testing The bearing of a child takes nine months, no matter how many women are assigned. Many.
Slide 1/24 Lawrence Livermore National Laboratory AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks Greg Bronevetsky, Bronis R. de Supinski,
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
Bug Localization with Machine Learning Techniques Wujie Zheng
Debugging. Compile problems Read the whole complaint! Runtime problems –Exceptions –Incorrect behavior.
Lawrence Livermore National Laboratory Greg Bronevetsky in collaboration with Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin.
1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.
Techniques for Finding Scalability Bugs Bowen Zhou.
Which Configuration Option Should I Change? Sai Zhang, Michael D. Ernst University of Washington Presented by: Kıvanç Muşlu.
BE-SECBS FISA 2003 November 13th 2003 page 1 DSR/SAMS/BASP IRSN BE SECBS – IRSN assessment Context application of IRSN methodology to the reference case.
Refactoring1 Improving the structure of existing code.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
CSE 332: C++ debugging Why Debug a Program? When your program crashes –Finding out where it crashed –Examining program memory at that point When a bug.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
Cooperative Concurrency Bug Isolation Guoliang Jin, Aditya Thakur, Ben Liblit, Shan Lu University of Wisconsin–Madison Instrumentation and Sampling Strategies.
Chapter 4: Variability. Variability Provides a quantitative measure of the degree to which scores in a distribution are spread out or clustered together.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Using Loop Invariants to Detect Transient Faults in the Data Caches Seung Woo Son, Sri Hari Krishna Narayanan and Mahmut T. Kandemir Microsystems Design.
How to remove an out layer tester Lucjan Janowski Faculty of Electrical Engineering, Automatics, Computer Science and Electronics Department of Telecommunications.
A Runtime Verification Based Trace-Oriented Monitoring Framework for Cloud Systems Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Wei Dong.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Error Handling Tonga Institute of Higher Education.
Grigore Rosu Founder, President and CEO Professor of Computer Science, University of Illinois
Subrata MitraPurdue University Greg BronevetskyGoogle / LLNL Suhas JavagalPurdue University Saurabh BagchiPurdue University PACT 2015, San Francisco.
AppAudit Effective Real-time Android Application Auditing Andrew Jeong
1
Slide 1/18 Applying Machine Learning to Computer System Dependability Problems Joint Work With: Purdue: Milind Kulkarni, Sam Midkiff, Bowen Zhou, Fahad.
Experience Report: System Log Analysis for Anomaly Detection
Overview of Research in Dependable Computing Systems Lab
CS1101X Programming Methodology
Designing For Testability
It’s All About Me From Big Data Models to Personalized Experience
QianZhu, Liang Chen and Gagan Agrawal
Chapter 8 – Software Testing
Random Numbers Until now, all programs have behaved deterministically - completely predictable and repeatable based on input values Some applications.
runtime verification Brief Overview Grigore Rosu
Supporting Fault-Tolerance in Streaming Grid Applications
Chapter 10 Verification and Validation of Simulation Models
Year 2 Updates.
Soft Error Detection for Iterative Applications Using Offline Training
Center for Resilient Infrastructures, Systems, and Processes (CRISP)
Evaluating Models Part 2
CSE 1020:Software Development
Presentation transcript:

WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales Bowen ZhouJonathan Too Milind KulkarniSaurabh Bagchi Purdue University

Ever Changing Behavior of Software Software has to be adaptive to accommodate for different platforms, inputs and configurations. As a side effect, manifestation of a bug may depend on a particular platform, input or configuration. 2

Ever Changing Behavior of Software 3

Software Development Process 4 Develop a new feature and its unit tests Test the new feature on a local machine Push the feature into productoin systems Break production systems Roll back the feature Not tested in production systems!!!

Bugs in Production Run Properties – Remains unnoticed when the application is tested on developer's workstation – Breaks production system when the application is running on a cluster and/or serving real user requests Examples – Configuration Error – Integer Overflow 5

Bugs in Production Run Properties – Remains unnoticed when the application is tested on developer's workstation – Breaks production system when the application is running on a cluster and/or serving real user requests Examples – Configuration Error – Integer Overflow Scale-Dependent Bugs 6

Modeling Program Behavior for Finding Bugs Dubbed as Statistical Debugging [Bronevetsky DSN ‘10] [Mirgorodskiy SC ’06] [Chilimbi ICSE ‘09] [Liblit PLDI ‘03] – Represents program behavior as a set of features that can be measured in runtime – Builds a model to describe and predict the features based on data collected from many runs – Detects abnormal features that deviate from the model's prediction beyond a certain threshold 7

Modeling Program Behavior for Finding Bugs Dubbed as Statistical Debugging [Bronevetsky DSN ‘10] [Mirgorodskiy SC ’06] [Chilimbi ICSE ‘09] [Liblit PLDI ‘03] – Represents program behavior as a set of features that can be measured in runtime – Builds a model to describe and predict the features based on data collected from many runs – Detects abnormal features that deviate from the model's prediction beyond a certain threshold 8 Does not account for scale-induced variation in program behavior

Modeling Scale-dependent Behavior 9 RUN # # OF TIMES LOOP EXECUTES Is there a bug in one of the production runs? Training runsProduction runs

Modeling Scale-dependent Behavior 10 SCALE # OF TIMES LOOP EXECUTES Training runsProduction runs Accounting for scale makes trends clear, errors at large scales obvious

Modeling Scale-dependent Behavior Our Previous Research – Vrisha [HPDC '11] Builds a collective model for all features of a program to detect bugs at any feature – Abhranta [HotDep '12] Tweaks Vrisha's model to allow per-feature bug detection and localization 11

Modeling Scale-dependent Behavior Our Previous Efforts – Vrisha [HPDC '11] Builds a collective model for all features of a program to detect bugs at any feature – Abhranta [HotDep '12] Tweaks Vrisha's model to allow per-feature bug detection and localization 12 They have limitations...

Modeling Scale-dependent Behavior Big gap in scale – e.g. training runs on up to 128 nodes, production runs on 1024 nodes Noisy features – Too many false positives render the model useless 13

Reconstructing Scale-dependent Behavior: the WuKong way Covers a wide range of program features Predicts the expected value in a large-scale run for each feature separately Prunes unpredictable features to improve localization quality Provides a shortlist of suspicious features in its localization roadmap 14

The Workflow 15 APP PIN RUN 1 APP PIN RUN 3 APP PIN RUN 2 APP PIN RUN 4 APP PIN RUN N... SCALE FEATURE RUN 1 SCALE FEATURE RUN 3 SCALE FEATURE RUN 2 SCALE FEATURE RUN 4 SCALE FEATURE RUN N... SCALE FEATURE MODEL SCALE FEATURE Production Training = ?

Feature Collection 16

Features considered by WuKong void foo(int a) { if (a > 0) { } else { } if (a > 100) { int i = 0; while (i < a) { if (i % 2 == 0) { } ++i; } 17

Features considered by WuKong void foo(int a) { 1:if (a > 0) { } else { } 2:if (a > 100) { int i = 0; 3:while (i < a) { 4:if (i % 2 == 0) { } ++i; }

Modeling 19

Predict Feature from Scale X ~ vector of scale parameters X 1...X N Y ~ number of times a particular feature occurs The model to predict Y from X: Compute the prediction error: 20

Predict Feature from Scale X ~ vector of scale parameters X 1...X N Y ~ number of times a particular feature occurs The model to predict Y from X: Compute the prediction error: 21

Bug Localization 22

Locate Buggy Features First, we need to know if the production run is buggy, by doing detection as follows: If there is a bug in this run, we can start looking at the prediction error of each feature: – Rank all features by their prediction error to provide a localization roadmap that contains the top N features 23 Error of feature i in the production run Constant parameterMax error of feature i in all training runs

Improve Localization Quality by Feature Pruning 24

Noisy Feature Pruning Some features cannot be effectively predicted by the above model – Random – Not scale-determined – Discontinuous The trade-off – Keep those feature would pollute the diagnosis by pushing real faults down the list – Remove these features could miss some faults if the faults happens to be in such features 25

Noisy Feature Pruning How to remove them? For each feature: 1.Do a cross validation with training runs 2.Remove the feature if it triggers greater-than- 100% prediction error in more than (100-x)% of training runs Parameter x > 0 is for tolerating outliers in training runs 26

Evaluation Fault injection in Sequoia AMG2006 – Up to 1024 processes – Randomly selected conditionals to be flipped Two case studies – Integer overflow in a MPI library – Deadlock in a P2P file sharing application 27

Evaluation Fault injection in Sequoia AMG2006 – Up to 1024 processes – Randomly selected conditionals to be flipped Two case studies – Integer overflow in a MPI library – Deadlock in a P2P file sharing application 28

Fault Injection Study Fault – Injected at process 0 – Randomly pick a feature to flip Data – Training (w/o fault): 110 runs, processes – Production (w/ fault): 100 runs, 1024 processes 29

Fault Injection Study Result – Total100 – Noncrashing57 – Detected53 – Located49 30 Successful Localized: 92.5%

Evaluation Fault injection in Sequoia AMG2006 – Up to 1024 processes – Randomly selected conditionals to be flipped Two case studies – Integer overflow in a MPI library – Deadlock in a P2P file sharing application 31

Evaluation Fault injection in Sequoia AMG2006 – Up to 1024 processes – Randomly selected conditionals to be flipped Two case studies – Integer overflow in a MPI library – Deadlock in a P2P file sharing application 32

Case Study: A Deadlock in Transmission’s DHT Implemenation 33

Case Study: A Deadlock in Transmission’s DHT Implemenation 34

Case Study: A Deadlock in Transmission’s DHT Implemenation 35 Feature 53, 66

Conclusion Debugging scale-dependent program behavior is a difficult and important problem WuKong incorporates scale of run into a predictive model for each individual program feature for accurate bug diagnosis We demonstrated the effectiveness of WuKong through a large-scale fault injection study and two case studies of real bugs 36

Q&A 37

Backup 38

Runtime Overhead 39 Geometric Mean: 11.4%