50.530: Software Engineering Sun Jun SUTD. Week 3: Delta Debugging.

50.530: Software Engineering Sun Jun SUTD

Week 3: Delta Debugging

Debugging Question 1: Bug Localization/Identification – How do we know where the bugs are? Question 2: Bug Fixing – How do we fix the bugs automatically? the behaviors we wanted A B C the behaviors we have

“YESTERDAY, MY PROGRAM WORKED. TODAY, IT DOES NOT. WHY?” Andreas Zeller, ESEC/FSE 1999

Motivation “The GDB people have done it again. The new release 4.17 of the GNU debugger [6] brings several new features, languages, and platforms, but for some reason, it no longer integrates properly with my graphical front-end DDD [10]: the arguments specified within DDD are not passed to the debugged program. Something has changed within GDB such that it no longer works for me. Something? Between the 4.16 and 4.17 releases, no less than 178,000 lines have changed. How can I isolate the change that caused the failure and make GDB work again?” Andrew Zeller, 1999

Research Question Assume that we know the bug(s) is due to a finite set of changes, how do we know which change(s) is responsible?

Regression Containment let O be the original working program; let N be the new buggy program; let changes be the sequence of changes from O to N; while (true) { apply first half of the changes on O and get ON; if (ON is working) { let O := ON; } else { let N := ON } if (the number of changes from O to N is 1) return; } let O be the original working program; let N be the new buggy program; let changes be the sequence of changes from O to N; while (true) { apply first half of the changes on O and get ON; if (ON is working) { let O := ON; } else { let N := ON } if (the number of changes from O to N is 1) return; } What assumptions are needed for this to work?

Complications Interference – There may not be one single change responsible for a failure, but a combination of several changes. Inconsistency – (in parallel development) combinations of changes may not result in a testable program Granularity – A single logical change may affect several hundred or even thousand lines of code, but only a few lines may be responsible for the failure.

Delta Debugging Find a minimal set of changes that cause a program to fail a test case, through automatically generating test cases Changes could be anything

Delta Debugging O: applied an empty set of changes N: applied all changes Let {a, b, c, …, n} be the set of changes. ON: applied any subset X of the changes

Testing We assume a function test which produces three outputs – PASS: the test succeeds – FAIL: the test produced the failure it was indented to capture – ?: the test produced indeterminate results Let X be a set of changes and test(X) to denote the testing result with the changes in X.

Objective Find a set of changes X such that – test(X) != PASS, i.e., failure-inducing – test(Y) != FAIL for all subset Y of X, i.e., minimum set ON: minimum failure-inducing changes The complexity is exponential. Why?

Assumptions for Simplification Monotonicity If test(X) = FAIL, test(Y) != PASS for all Y which is a superset of X. Unambiguity If test(X) = FAIL and test(Y) = FAIL, test(X intersect Y) != PASS, i.e., a failure is caused by one change set (and not independently by two disjoint sets) Consistency test(X) != ? for all X Justified?

DD Algorithm: Example

DD Algorithm algorithm DD(U, R) { if (U has one element only) { return U; } partition U into X and Y equally; if (test(X union R) = FAIL) { return DD(X, R) } if (test(Y union R) = FAIL) { return DD(Y, R) } return DD(X, Y union R) union DD(Y, X union R); } C: a set of changes; R: changes remain to be applied

Exercise: Complete the Call Graph DD({1..8}, {}) DD({1..4}, {5..8}) DD({5..8}, {1..4})

Theory Theorem: Assume monotonicity, unambiguity and consistency, algorithm DD always returns the minimum failure-inducing set of changes.

Informal Proof algorithm DD(U, R) { if (U has one element only) { return U; } partition U into X and Y equally; if (test(X union R) = FAIL) { return DD(X, R) } if (test(Y union R) = FAIL) { return DD(Y, R) } return DD(X, Y union R) union DD(Y, X union R); } Justified by monotonicity and unambiguity Justified by consistency Justified by monotonicity and unambiguity

Ambiguity Unambiguity: If test(X) = FAIL and test(Y) = FAIL, test(X intersect Y) != PASS, i.e., a failure is caused by one change set (and not independently by two disjoint sets) What if it is ambiguous? DD will find one failure-inducing changes. Remove this set and apply DD again to find another.

Not Monotonic Monotonicity: If test(X) = FAIL, test(Y) != PASS for all Y which is a superset of X. What if it is not monotonic? For instance, what if test({1,2}) = FAIL but test({1..4}) = PASS? DD will find some failure-inducing changes.

Inconsistency Consistency: test(X) != ? for all X Integration failure. A change may require earlier changes that are not included in the configuration Construction failure. Although all changes can be applied, the resulting program has syntactical or semantic errors, such that construction fails. Execution failure. The program does not execute correctly; the test outcome is unresolved. Shall we enforce consistency for each commit of changes?

Research Question How do we modify the DD algorithm so as to handle inconsistency? Found: If test(X) = FAIL, X contains a failure- inducing subset. Interference: If test(X) = PASS = test(Y), X and Y form an interference. What if test(X) =? or test(Y) =? or both?

Example At step 3, can we omit {5..8}?

Example Preference: If test(X) =? and test(U-X) = PASS, X contains a failure-inducing subset and is preferred. Preference: If test(X) =? and test(U-X) = PASS, X contains a failure-inducing subset and is preferred.

Example Scenario: there are 1..8 changes. Change 8 is failure-inducing, and changes 2, 3 and 7 imply each other – that is, they only can be applied as a whole.

Example

Is it necessary to have {5,6}?

DD+ Algorithm algorithm DD+(U, R, N) { if (U has one element only) {//found return U; } partition U into N sets X1, X2, …, Xn equally; if (test(Xi union R ) = FAIL for some Xi) {//found in Xi return DD+(Xi, R, 2); } if (test(Xi union R) = PASS and test((U – Xi) union R) = PASS for some Xi) {//interference return DD+(Xi, (U-Xi) union R, 2) union DD+(U-Xi-R, Xi union R, 2); } if (test(Xi union R) = ? and test((U – Xi) union R) = PASS for some Xi) {//preference return DD+(Xi, (U – Xi) union R, 2); } let U’ = U intersect {U – Xi – R| test((U – Xi) union R) = FAIL}; if (N < |U|) {//try again return DD+(U’, R union {Xi | test(Xi union R) = PASS}, min(|U’|, 2N)); } return U’; //nothing left }

Exercise 1 Convince yourself by applying the algorithm (i.e., DD+({1..8}, {}, 2)) to the following scenario. There are 1..8 changes. Change 8 is failure- inducing, and changes 2, 3 and 7 imply each other – that is, they only can be applied as a whole.

Reducing Inconsistency What if we know that 2 and 3 and 7 imply each other? Partition the set into two sets {1,2,3,7} and {4,5,6,8}. How do we know what changes imply each other?

Grouping Related Changes To determine whether changes are related, one can use process criteria: common change dates or sources, location criteria: the affected file or directory, lexical criteria: common referencing of identifiers, syntactic criteria: common syntactic entities (functions, modules) affected by the change, semantic criteria: common program statements affected by the changed control or data flow.

Example Assume that each change requires all earlier changes to be consistent.

Case Study: DDD 3.1.2 Dumps Core DDD 3.1.2, released in December, 1998, exhibited a nasty behavioral change: When invoked with a the name of a non-existing file, DDD 3.1.2 dumped core, while its predecessor DDD 3.1.1 simply gave an error message. The DDD configuration management archive lists 116 logical changes between the 3.1.1 and 3.1.2 releases. These changes were split into 344 textual changes to the DDD source.

Random Clustering start with 344 changes reduced to 172 changes 8 subsets of changes reduced to 16 changes reduced to 1 change 31 tests in total

Grouping Changes 1.Changes were grouped according to the date they were applied. 2.Each change implied all earlier changes. 12 test runs and 58 minutes

Case Study: GDB 178,000 changed GDB lines, grouped into 8721 textual changes in the GDB source, with any two textual changes separated by at least two unchanged lines. No configuration management archive to obtain change dates, etc.

Random Clustering Most of the first 457 tests result in ? At test 458, one set containing 36 changes resulted in FAIL 470 tests in total; 48 hours.

Grouping Changes At top-level, changes were grouped according to directories. This was motivated by the observation that several GDB directories contain a separate library whose interface remains more or less consistent across changes. Within one directory, changes were grouped according to common files. The idea was to identify compilation units whose interface was consistent with both “yesterday’s” and “today’s” version. Within a file, changes were grouped according to common usage of identifiers. This way, we could keep changes together that operated on common variables or functions. Finally, failure resolution loop: After a failing construction, scans the error messages for identifiers, adds all changes that reference these identifiers and tries again. This is repeated until construction is possible, or until there are no more changes to add.

Grouping Changes After 9 tests, 2547 changes left After 280 tests, 18 changes left (of two files only) 289 tests in total, 20 hours

Question Is Delta Debugging an over-kill?

“SIMPLIFYING AND ISOLATING FAILURE-INDUCING INPUT” Andreas Zeller, IEEE Transactions on Software Engineering, 2002

Motivation In July 1999, Bugzilla, the Mozilla bug database, listed more than 370 open bug reports—bug reports that were not even simplified. With this queue growing further, the Mozilla engineers “faced imminent doom”. Overwhelmed with work, the Netscape product manager sent out the Mozilla BugAThon call for volunteers: people who would help simplify bug reports. “Simplifying” meant: turning these bug reports into minimal test cases, where every part of the input would be significant in reproducing the failure.

Motivation: Example All Windows 3.1 Windows 95 Windows 98 Windows ME Windows 2000 Windows NT Mac System 7 Mac System 7.5 Mac System 7.6.1 Mac System 8.0 Mac System 8.5 Mac System 8.6 Mac System 9.x MacOS X Linux BSDI FreeBSD NetBSD OpenBSD AIX BeOS HP-UX<OPTION VALUE="IRIX">IRIX Neutrino OpenVMS OS/2 OSF/1 Solaris SunOS other -- P1 P2 P3 P4 P5 blocker critical major<OPTION VALUE="normal">normal minor trivial enhancement Loading this HTML page into Mozilla and printing it causes a segmentation fault.

Research Question How do we automatically minimize a test case?

Research Question How do we automatically minimize a test case? Solution: using Delta Debugging Yesterday: test(“”) = PASS Today: test(that html page) = FAIL The set of changes = {insert one character} How do we automatically minimize a test case? Solution: using Delta Debugging Yesterday: test(“”) = PASS Today: test(that html page) = FAIL The set of changes = {insert one character} Is this the best definition?

Example

Generalize DD A change can be anything which changes the circumstances of a program run, e.g., program code changes (as we have seen) program input changes (as we’re about to see) system configurations?

Generalize DD DD Requires a set of primitive changes (which can be composed) a test function (with different outputs like: PASS, FAIL, ?) What should be the primitive changes?

Minimal Test Case Let U be the set of changes containing the primitive change of inserting one character at a position in the input. The problem is to find the minimum set X (which corresponds to the minimum input) such that test(X) = FAIL.

Minimality Local Minimum: a test case (represented as a set of changes X) is local minimum if test(X) = FAIL and test(Y) != FAIL for all subset Y of X. 1-Minimal test case: a test case X is 1-minimal if test(X) = FAIL and test(Y) != FAIL for all set Y which is one-element less of X. Why? Because global minimum is hard to achieve.

DDmin Algorithm algorithm DDmin(U, N) { if (|U| = 1) {//minimum return U; } partition U into X1, X2, …, Xn equally; if (test(Xi) = FAIL for some Xi) {//reduce to subset return DDmin(Xi, 2); } if (test(U - Xi) = FAIL for some Xi) {//reduce to complement return DDmin(U - Xi, max(N-1, 2)); } if (N < |U|) {//increase granularity return DDmin(U, min(|U|, 2N)); } return U; }

Example DDmin({1..8}, 2) DDmin({5..8}, 2) DDmin({7..8}, 2) DDmin({7}, 2)

Exercise Show how DDmin works for this example.

Theory DDmin(U, 2) is guaranteed to return a 1-minimal test case. algorithm DDmin(U, N) { if (|U| = 1) {//minimum return U; } partition U into X1, X2, …, Xn equally; if (test(Xi) = FAIL) {//reduce to subset return DDmin(Xi, 2); } if (test(U - Xi) = FAIL) {//reduce to complement return DDmin(U - Xi, max(N-1, 2)); } if (N < |U|) {//increase granularity return DDmin(U, min(|U|, 2N)); } return U; } Reduce U in these cases If U can’t be reduced further, eventually we would test every sunset of U with one less element.

Research Question Why not simply apply the DD+ algorithm?

Case Study: GCC This program causes GNU C compiler (GCC version 2.95.2 on Intel-Linux with optimization enabled) to crash. Each change inserts the i-th character into the program.

Case Study: GCC After 731 tests generated by DDmin (and 34 seconds) Does it make sense to reduce the method name?

Case Study: GCC GCC has 31 options Can we find out which option is relevant? Yes, simply run DDmin again. Can we find out which option is relevant? Yes, simply run DDmin again.

Isolation The DDmin algorithm is designed to find a failing test case with “minimal” difference from the known passing test case test({}). The problem of isolation: find a failing test case with “minimal” difference from any passing test case.

Example Test 4 and 7 show one character difference which changes the testing result from PASS to FAIL. Useful? Test 4 and 7 show one character difference which changes the testing result from PASS to FAIL. Useful?

Isolation: Definition Let P and F be two sets of changes such that test(P) = PASS and test(F) = FAIL and P is a subset of F. F-P is minimal if test(P union X) != PASS and test(F-X) != FAIL for all proper subset X of F-P. F-P is 1-minimal if test(P union {x}) != PASS and test(F-{x}) != FAIL for all element x of F-P.

DD# Algorithm U {} Initially, F = U and P = {}

DD# Algorithm U {} Gradually grow P and reduce F P F

DD# Algorithm algorithm DD#(P, F, N) { partition F-P into X1, X2, …, Xn equally; if (test(P union Xi) = FAIL for some Xi) {//reduce F return DD#(P, P union Xi, 2); } if (test(F - Xi) = PASS for some Xi) {//grow P return DD#(F-Xi, F, 2); } if (test(P union Xi = PASS)) {//grow P return DD#(P union Xi, F, max(N-1, 2))); } if (test(F - Xi = FAIL)) {//reduce F return DD#(P, F-Xi, max(N-1, 2))); } if (N < |F-P|) {//increase granularity return DD#(P, F, min(2N, |F-P|)); } return (P, F); }

Example DD#(T1, T2, 2) DD#(T3, T2, 2) DD#(T3, T4, 2) DD#(T5, T4, 2) DD#(T6, T4, 2) DD#(T7, T4, 2)

Theory DD#({},U,2) is guaranteed to return a 1-minimal difference test case. algorithm DD#(P, F, N) { partition F-P into X1, X2, …, Xn equally; if (test(P union Xi) = FAIL for some Xi) {//reduce F return DD#(P, P union Xi, 2); } if (test(F - Xi) = PASS for some Xi) {//grow P return DD#(F-Xi, F, 2); } if (test(P union Xi = PASS)) {//grow P return DD#(P union Xi, F, max(N-1, 2))); } if (test(F - Xi = FAIL)) {//reduce F return DD#(P, F-Xi, max(N-1, 2))); } if (N < |F-P|) {//increase granularity return DD#(P, F, min(2N, |F-P|)); } return (P, F); } Why?

Case Study: GCC DDmin: after 731 test cases, the result is DD#: (PASS means compilation succeeded; FAIL means compiler crashed; ? means all other cases) after 59 tests, the result is (see next page)

Case Study: GCC This suggests a problem with inlining the expression i+j+1 in the array Accesses z[i] on the following line This suggests a problem with inlining the expression i+j+1 in the array Accesses z[i] on the following line

Exercise 2 Take this program and this input as example. Apply delta debugging to locate the bug.this program this input

Research Question How would you improve Delta Debugging?

Research Question How would you improve Delta Debugging? Are all generated test cases meaningful? How do we partition the input? Are all generated test cases meaningful? How do we partition the input?

Research Question How would you improve Delta Debugging? Which test case to run first?

Research Question How would you improve Delta Debugging? What if there are multiple failure-inducing inputs?

Discussion Assume we given the following program and Delta Debugging run. How do we make of the result?

Research Question How would you improve Delta Debugging? How can Delta Debugging be used to identify which statement in a program is buggy? (for simplicity, assume there is only one statement which is buggy).

50.530: Software Engineering Sun Jun SUTD. Week 3: Delta Debugging.

Similar presentations

Presentation on theme: "50.530: Software Engineering Sun Jun SUTD. Week 3: Delta Debugging."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

50.530: Software Engineering Sun Jun SUTD. Week 3: Delta Debugging.

Similar presentations

Presentation on theme: "50.530: Software Engineering Sun Jun SUTD. Week 3: Delta Debugging."— Presentation transcript:

Similar presentations

About project

Feedback