Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan, 2005 University of Wisconsin, Stanford University, and UC Berkeley Mustafa Dajani 27 Nov 2006 CMSC 838P
Overview of the Paper explained a statistical debugging algorithm that is able to isolate bugs in programs containing multiple undiagnosed bugs showed a practical, scalable algorithm for isolating multiple bugs in many software systems 1 outline: Introduction Background Cause Isolation Algorithm Experiments
Objective of the Study: To develop a statistical algorithm to hunt for causes of failures Crash reporting systems are useful in collecting data Actual executions are a vast resource Using feedback data for causes of failures
Introduction Statistical debugging - a dynamic analysis for detecting the causes of run failures. - an instrumentation program basically monitor program behavior by sampling information - this involves testing of predicates in particular events during the run Predicates, P - bug predictors; large programs may consist of thousands of predicates Feedback Report, R - contains information whether a run has succeeded or failed.
Introduction the study’s model of behavior: “ If P is observed to be true at least once during run R then R(P) = 1, otherwise R(P) = 0.” - In other words, it counts how often “P observed true” and “P observed” using random sampling previous works involved the use of regularized logistic regression (it tries to select predicates to determine outcome of every run) - but this algorithm creates redundancy in finding predicates as well as difficulty in predicting multiple bugs
Introduction Study design: - determine all possible predicates - eliminate predicates that have no predictive power - loop { - rank the surviving predicates by importance - remove all top-ranked predicates - discard all runs where the run passed, R(P)=1 - go to top of loop until set of runs or set of predicates are empty
Bug Isolation Architecture Program Source Compiler Sampler Predicates Shipping Application Counts & / ‚ Statistical Debugging Top bugs with likely causes
Depicting failures through P F(P) F(P) + S(P) Failure(P) = F(P) = # of failures where P observed true S(P) = # of successes where P observed true
Consider this code fragment: if (f == NULL) { x = 0; *f; } When does a program fail?
Valid pointer assignment If (…) f = …some valid pointer…; *f;
Predicting P’s truth or falsehood F(P observed) F(P observed) + S(P observed) Context(P) = F(P observed) = # of failures observing P S(P observed) = # of successes observing P
Notes Two predicates are redundant if they predict the same or nearly the same set of failing ones Because of elimination is iterative, it is only necessary that Importance selects a good predictor at each step and not necessarily the best one.
Guide to Visualization Increase(P) S(P) error bound log(F(P) + S(P)) Context(P) ldi-2005/
Rank by Increase(P) High Increase() but very few failing runs! These are all sub-bug predictors –Each covers one special case of a larger bug Redundancy is clearly a problem
Rank by F(P) Many failing runs but low Increase()! Tend to be super-bug predictors –Each covers several bugs, plus lots of junk
Notes In the language of information retrieval –Increase(P) has high precision, low recall –F(P) has high recall, low precision Standard solution: –Take the harmonic mean of both –Rewards high scores in both dimensions
Rank by Harmonic Mean It works! –Large increase, many failures, few or no successes But redundancy is still a problem
Lessons Learned Can learn a lot from actual executions –Users are running buggy code anyway –We should capture some of that information Crash reporting is a good start, but… –Pre-crash behavior can be important –Successful runs reveal correct behavior –Stack alone is not enough for 50% of bugs