Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan University of Wisconsin, Stanford University, and.

Slides:



Advertisements
Similar presentations
Modular and Verified Automatic Program Repair Francesco Logozzo, Thomas Ball RiSE - Microsoft Research Redmond.
Advertisements

Delta Debugging and Model Checkers for fault localization
Imbalanced data David Kauchak CS 451 – Fall 2013.
Building a Better Backtrace: Techniques for Postmortem Program Analysis Ben Liblit & Alex Aiken.
Building a Better Backtrace: Techniques for Postmortem Program Analysis Ben Liblit & Alex Aiken.
1 Bug Isolation via Remote Program Sampling Ben LiblitAlex Aiken Alice X. ZhengMichael Jordan Presented By : Arpita Gandhi.
Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Statistical Debugging Ben Liblit, University of Wisconsin–Madison.
Statistical Debugging Ben Liblit, University of Wisconsin–Madison.
Bug Isolation via Remote Program Sampling Ben Liblit, Alex Aiken, Alice X.Zheng, Michael I.Jordan Presented by: Xia Cheng.
Workloads Experimental environment prototype real sys exec- driven sim trace- driven sim stochastic sim Live workload Benchmark applications Micro- benchmark.
Bug Isolation in the Presence of Multiple Errors Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. Jordan UC Berkeley and Stanford University.
The Future of Correct Software George Necula. 2 Software Correctness is Important ► Where there is software, there are bugs ► It is estimated that software.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Software Testing and Quality Assurance
CS590Z Statistical Debugging Xiangyu Zhang (part of the slides are from Chao Liu)
Statistical Debugging: A Tutorial Steven C.H. Hoi Acknowledgement: Some slides in this tutorial were borrowed from Chao Liu at UIUC.
16/27/2015 3:38 AM6/27/2015 3:38 AM6/27/2015 3:38 AMTesting and Debugging Testing The process of verifying the software performs to the specifications.
EE694v-Verification-Lect5-1- Lecture 5 - Verification Tools Automation improves the efficiency and reliability of the verification process Some tools,
Bug Isolation via Remote Program Sampling Ben LiblitAlex Aiken Alice ZhengMike Jordan.
Chapter 8 Hypothesis testing 1. ▪Along with estimation, hypothesis testing is one of the major fields of statistical inference ▪In estimation, we: –don’t.
Let’s flip a coin. Making Data-Based Decisions We’re going to flip a coin 10 times. What results do you think we will get?
Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan, 2005 University of Wisconsin, Stanford University,
The Effects of Ranging Noise on Multihop Localization: An Empirical Study from UC Berkeley Abon.
This One Time, at PL Camp... Summer School on Language-Based Techniques for Integrating with the External World University of Oregon Eugene, Oregon July.
Bug Localization with Machine Learning Techniques Wujie Zheng
School of Electrical Engineering and Computer Science University of Central Florida Anomaly-Based Bug Prediction, Isolation, and Validation: An Automated.
Scalable Statistical Bug Isolation Authors: B. Liblit, M. Naik, A.X. Zheng, A. Aiken, M. I. Jordan Presented by S. Li.
Testing and Debugging Version 1.0. All kinds of things can go wrong when you are developing a program. The compiler discovers syntax errors in your code.
Introduction to Software Testing. Types of Software Testing Unit Testing Strategies – Equivalence Class Testing – Boundary Value Testing – Output Testing.
DEBUGGING. BUG A software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected.
Unit Testing 101 Black Box v. White Box. Definition of V&V Verification - is the product correct Validation - is it the correct product.
Chapter 22 Developer testing Peter J. Lane. Testing can be difficult for developers to follow  Testing’s goal runs counter to the goals of the other.
Testing and Debugging Session 9 LBSC 790 / INFM 718B Building the Human-Computer Interface.
Controlling Execution Programming Right from the Start with Visual Basic.NET 1/e 8.
Software Testing and Quality Assurance Software Quality Assurance 1.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
Cooperative Concurrency Bug Isolation Guoliang Jin, Aditya Thakur, Ben Liblit, Shan Lu University of Wisconsin–Madison Instrumentation and Sampling Strategies.
What is Testing? Testing is the process of finding errors in the system implementation. –The intent of testing is to find problems with the system.
“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.
Bug Isolation via Remote Sampling. Lemonade from Lemons Bugs manifest themselves every where in deployed systems. Each manifestation gives us the chance.
CPSC 873 John D. McGregor Session 9 Testing Vocabulary.
Scientific Debugging. Errors in Software Errors are unexpected behaviors or outputs in programs As long as software is developed by humans, it will contain.
CPSC 871 John D. McGregor Module 8 Session 1 Testing.
Aditya Thakur Rathijit Sen Ben Liblit Shan Lu University of Wisconsin–Madison Workshop on Dynamic Analysis 2009 Cooperative Crug Isolation.
Week 5-6 MondayTuesdayWednesdayThursdayFriday Testing III No reading Group meetings Testing IVSection ZFR due ZFR demos Progress report due Readings out.
Statistical Debugging CS Motivation Bugs will escape in-house testing and analysis tools –Dynamic analysis (i.e. testing) is unsound –Static analysis.
Cooperative Bug Isolation CS Outline Something different today... Look at monitoring deployed code –Collecting information from actual user runs.
Week 6 MondayTuesdayWednesdayThursdayFriday Testing III Reading due Group meetings Testing IVSection ZFR due ZFR demos Progress report due Readings out.
Automated Adaptive Bug Isolation using Dyninst Piramanayagam Arumuga Nainar, Prof. Ben Liblit University of Wisconsin-Madison.
CPSC 372 John D. McGregor Module 8 Session 1 Testing.
Bug Isolation via Remote Program Sampling Ben LiblitAlex Aiken Alice X. ZhengMichael I. Jordan UC Berkeley.
Mobile Testing - Bug Report
14 Compilers, Interpreters and Debuggers
John D. McGregor Session 9 Testing Vocabulary
Unit 5: Hypothesis Testing
John D. McGregor Session 9 Testing Vocabulary
Statistical Debugging
Roberto Battiti, Mauro Brunato
John D. McGregor Session 9 Testing Vocabulary
Sampling User Executions for Bug Isolation
Public Deployment of Cooperative Bug Isolation
Making Data-Based Decisions
Statistical Debugging
CSE 303 Concepts and Tools for Software Development
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
that focus first on areas of risk.
Software Testing.
Presentation transcript:

Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan University of Wisconsin, Stanford University, and UC Berkeley

Post-Deployment Monitoring

Goal: Measure Reality Where is the black box for software? –Crash reporting systems are a start Actual runs are a vast resource –Number of real runs >> number of testing runs –Real-world executions are most important This talk: post-deployment bug hunting –Mining feedback data for causes of failure

What Should We Measure? Function return values Control flow decisions Minima & maxima Value relationships Pointer regions Reference counts Temporal relationships err = fetch(file, &obj); if (!err && count < size) list[count++] = obj; else unref(obj); In other words, lots of things

Our Model of Behavior Any interesting behavior is expressible as a predicate P on program state at a particular program point. Count how often “P observed true” and “P observed” using sparse but fair random samples of complete behavior.

Bug Isolation Architecture Program Source Compiler Sampler Predicates Shipping Application Counts & /       ‚  Statistical Debugging Top bugs with likely causes

Find Causes of Bugs Gather information about many predicates –298,482 predicates in bc –857,384 predicates in Rhythmbox Most are not predictive of anything How do we find the useful bug predictors? –Data is incomplete, noisy, irreproducible, …

Look For Statistical Trends How likely is failure when P happens? F(P) F(P) + S(P) Failure(P) = F(P) = # of failures where P observed true S(P) = # of successes where P observed true

if (f == NULL) { x = 0; *f; } Good Start, But Not Enough Failure( f == NULL ) = 1.0 Failure( x == 0 ) = 1.0 Predicate x == 0 is an innocent bystander –Program is already doomed

Context What is the background chance of failure regardless of P’s truth or falsehood? F(P observed) F(P observed) + S(P observed) Context(P) = F(P observed) = # of failures observing P S(P observed) = # of successes observing P

Isolate the Predictive Value of P Does P being true increase the chance of failure over the background rate? Increase(P) = Failure(P) – Context(P) (a form of likelihood ratio testing)

if (f == NULL) { x = 0; *f; } Increase() Isolates the Predictor Increase( f == NULL ) = 1.0 Increase( x == 0 ) = 0.0

void more_arrays () { … /* Copy the old arrays. */ for (indx = 1; indx < old_count; indx++) arrays[indx] = old_ary[indx]; /* Initialize the new elements. */ for (; indx < v_count; indx++) arrays[indx] = NULL; … } Isolating a Single Bug in bc #1: indx > scale #2: indx > use_math #1: indx > scale #2: indx > use_math #1: indx > scale #2: indx > use_math #3: indx > opterr #4: indx > next_func #5: indx > i_base #1: indx > scale #2: indx > use_math #3: indx > opterr #4: indx > next_func #5: indx > i_base

It Works! …for programs with just one bug. Need to deal with multiple, unknown bugs Redundant predictors are a major problem Goal: Isolate the best predictor for each bug, with no prior knowledge of the number of bugs.

Multiple Bugs: Some Issues A bug may have many redundant predictors –Only need one, provided it is a good one Bugs occur on vastly different scales –Predictors for common bugs may dominate, hiding predictors of less common problems

Guide to Visualization Multiple interesting & useful predicate metrics Graphical representation helps reveal trends Increase(P) S(P) error bound log(F(P) + S(P)) Context(P)

Bad Idea #1: Rank by Increase(P) High Increase() but very few failing runs! These are all sub-bug predictors –Each covers one special case of a larger bug Redundancy is clearly a problem

Bad Idea #2: Rank by F(P) Many failing runs but low Increase()! Tend to be super-bug predictors –Each covers several bugs, plus lots of junk

A Helpful Analogy In the language of information retrieval –Increase(P) has high precision, low recall –F(P) has high recall, low precision Standard solution: –Take the harmonic mean of both –Rewards high scores in both dimensions

Rank by Harmonic Mean It works! –Large increase, many failures, few or no successes But redundancy is still a problem

Redundancy Elimination One predictor for a bug is interesting –Additional predictors are a distraction –Want to explain each failure once Similar to minimum set-cover problem –Cover all failed runs with subset of predicates –Greedy selection using harmonic ranking

Simulated Iterative Bug Fixing 1.Rank all predicates under consideration 2.Select the top-ranked predicate P 3.Add P to bug predictor list 4.Discard P and all runs where P was true Simulates fixing the bug predicted by P Reduces rank of similar predicates 5.Repeat until out of failures or predicates

Simulated Iterative Bug Fixing 1.Rank all predicates under consideration 2.Select the top-ranked predicate P 3.Add P to bug predictor list 4.Discard P and all runs where P was true Simulates fixing the bug predicted by P Reduces rank of similar predicates 5.Repeat until out of failures or predicates

Experimental Results: exif 3 bug predictors from 156,476 initial predicates Each predicate identifies a distinct crashing bug All bugs found quickly using analysis results

Experimental Results: Rhythmbox 15 bug predictors from 857,384 initial predicates Found and fixed several crashing bugs

Lessons Learned Can learn a lot from actual executions –Users are running buggy code anyway –We should capture some of that information Crash reporting is a good start, but… –Pre-crash behavior can be important –Successful runs reveal correct behavior –Stack alone is not enough for 50% of bugs

Public Deployment in Progress

The Cooperative Bug Isolation Project Join the Cause!

How Many Runs Are Needed? Failing Runs For Bug #n #1#2#3#4#5#6#9 Moss ccrypt26 bc40 Rhythmbox2235 exif281213

How Many Runs Are Needed? Total Runs For Bug #n #1#2#3#4#5#6#9 Moss5003K2K K600 ccrypt200 bc200 Rhythmbox exif2K30021K