Statistical Debugging: A Tutorial Steven C.H. Hoi Acknowledgement: Some slides in this tutorial were borrowed from Chao Liu at UIUC.
Motivations Software is full of bugs Windows 2000 had about 63,000 known bugs at its time of release, 2 bugs per 1000 lines A study by the National Institute of Standards and Technology showed that software faults cost the U.S. economy about $59.5 billion annually Testing and debugging are laborious and expensive “50% of my company employees are testers, and the rest spends 50% of their time testing!” --Bill Gates, in 1995
Expedite Debugging Manual debugging Trace the executions step-by-step. Verify observations against expectations Automated debugging Collect runtime behaviors as the program executes. Identify bug-relevant points by contrasting the correct and incorrect executions Best efforts so far: bug localization
An Example Symptoms 563 lines of C code 130 out of 5542 test cases fail to give correct outputs No crashes Conventional debugging Few hints Step-by-step tracing Better method Pinpoint the buggy line void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; } void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ putsub(lin, i, m, sub); lastm = m; } if ((m == -1) || (m == i)){ fputc(lin[i], stdout); i = i + 1; } else i = m; }
Review of Recent Work SOBER Algorithm Cause Transition Algorithm Statistical Debugging: Liblit05 Statistical Debugging: Simultaneous Identification of Multiple Bugs
SOBER: Statistical Model-based Bug Localization Program Predicates Predicate Rankings Experimental Results
Program Predicates A predicate is a proposition about any program properties e.g., idx 0 … Each can be evaluated multiple times during one execution Every evaluation gives either true or false Therefore, a predicate is simply a boolean random variable, which encodes program executions from a particular aspect.
Evaluation Bias of Predicate P Evaluation bias Def’n: the probability of being evaluated as true within one execution Maximum likelihood estimation: Number of true evaluations over the total number of evaluations in one run Each run gives one observation of evaluation bias for predicate P Suppose we have n correct and m incorrect executions, for any predicate P, we end up with An observation sequence for correct runs S_p = (X’_1, X’_2, …, X’_n) An observation sequence for incorrect runs S_f = (X_1, X_2, …, X_m) Can we infer whether P is suspicious based on S_p and S_f?
Underlying Populations Imagine the underlying distribution of evaluation bias for correct and incorrect executions are and S_p and S_f can be viewed as a random sample from the underlying populations respectively One major heuristic is The larger the divergence between and, the more relevant the predicate P is to the bug 01 Prob Evaluation bias 01 Prob Evaluation bias
Major Challenges No knowledge of the closed forms of both distributions Usually, we do not have sufficient incorrect executions to estimate reliably. 01 Prob Evaluation bias 01 Prob Evaluation bias
SOBER’s Approach
Algorithm Outputs A ranked list of program predicates w.r.t. the bug relevance score s(P) Higher-ranked predicates are regarded more relevant to the bug What’s the use? Top-ranked predicates suggest the possible buggy regions Several predicates may point to the same region … …
Cause Transition (CT) “Locating Causes of Program Failures”, Cleve et al., published in ICSE’05, May 15, 2005 A variant of delta debugging [Z02] Previous state-of-the-art performance holder on Siemens suite Cons: it relies on memory abnormality, hence its performance is restricted.
Statistical Debugging: Liblit05 “Scalable Statistical bug isolation”, Liblit et al., published in PLDI’05, June 12, 2005 Main idea: rank predicates according to their correlation with program crashes
Statistical Debugging: Liblit05 Context (P) = Pr(Crash | P observed) Failure (P) = Pr(Crash | P observed as true) The probability difference Increase (P) = Failure (P) – Context (P) Limitation: Ignores evaluation patterns of predicates within each execution
Experiment Results Localization quality metric Software bug benchmark Quantitative metric Related works Cause Transition (CT), [CZ05] Statistical Debugging, [LN+05] Performance comparisons
Bug Benchmark Bug benchmark Dreaming benchmark Large number of known bugs on large-scale programs with adequate test suite Siemens Program Suite 130 variants of 7 subject programs, each of LOC 130 known bugs in total mainly logic (or semantic) bugs Advantages Known bugs, thus judgments are objective Large number of bugs, thus comparative study is statistically significant. Disadvantages Small-scaled subject programs State-of-the-art performance, so far claimed in literature, Cause-transition approach, [CZ05]
Localization Quality Metric [RR03]
1st Example T-score = 70%
2nd Example T-score = 20% 8
Localized bugs w.r.t. Examined Code
Cumulative Effects w.r.t. Code Examination
Top-k Selection Regardless of specific selection of k, both Liblit05 and SOBER are better than CT, the current state-of-the-art holder From k=2 to 10, SOBER is better than Liblit05 consistently
Conclusion and Discussion A tutorial on statistical debugging Discussion on Future Work Better Statistical Models… Identification of Multiple Bugs Robust to Sampling …