Presentation is loading. Please wait.

Presentation is loading. Please wait.

50.530: Software Engineering

Similar presentations


Presentation on theme: "50.530: Software Engineering"— Presentation transcript:

1 50.530: Software Engineering
Sun Jun SUTD

2 Week 3: Delta Debugging

3 Debugging Question 1: Bug Localization/Identification
How do we know where the bugs are? Question 2: Bug Fixing How do we fix the bugs automatically? the behaviors we wanted A B C the behaviors we have

4 “Yesterday, My program worked. Today, it does not. Why?”
Andreas Zeller, ESEC/FSE 1999 “Yesterday, My program worked. Today, it does not. Why?”

5 Motivation “The GDB people have done it again. The new release 4.17 of the GNU debugger [6] brings several new features, languages, and platforms, but for some reason, it no longer integrates properly with my graphical front-end DDD [10]: the arguments specified within DDD are not passed to the debugged program. Something has changed within GDB such that it no longer works for me. Something? Between the 4.16 and 4.17 releases, no less than 178,000 lines have changed. How can I isolate the change that caused the failure and make GDB work again?” Andrew Zeller, 1999

6 Research Question Assume that we know the bug(s) is due to a finite set of changes, how do we know which change(s) is responsible?

7 Regression Containment
let O be the original working program; let N be the new buggy program; let changes be the sequence of changes from O to N; while (true) { apply first half of the changes on O and get ON; if (ON is working) { let O := ON; } else { let N := ON if (the number of changes from O to N is 1) return; What assumptions are needed for this to work?

8 Complications Interference Inconsistency Granularity
There may not be one single change responsible for a failure, but a combination of several changes. Inconsistency (in parallel development) combinations of changes may not result in a testable program Granularity A single logical change may affect several hundred or even thousand lines of code, but only a few lines may be responsible for the failure.

9 Changes could be anything
Delta Debugging Find a minimal set of changes that cause a program to fail a test case, through automatically generating test cases Changes could be anything

10 Delta Debugging Let {a, b, c, …, n} be the set of changes.
ON: applied any subset X of the changes O: applied an empty set of changes N: applied all changes

11 Testing We assume a function test which produces three outputs
PASS: the test succeeds FAIL: the test produced the failure it was indented to capture ?: the test produced indeterminate results Let X be a set of changes and test(X) to denote the testing result with the changes in X.

12 Objective Find a set of changes X such that
test(X) != PASS, i.e., failure-inducing test(Y) != FAIL for all subset Y of X, i.e., minimum set ON: minimum failure-inducing changes The complexity is exponential. Why?

13 Assumptions for Simplification
Monotonicity If test(X) = FAIL, test(Y) != PASS for all Y which is a superset of X. Unambiguity If test(X) = FAIL and test(Y) = FAIL, test(X intersect Y) != PASS, i.e., a failure is caused by one change set (and not independently by two disjoint sets) Consistency test(X) != ? for all X Justified?

14 DD Algorithm: Example

15 DD Algorithm: Example

16 DD Algorithm algorithm DD(U, R) { if (U has one element only) {
return U; } partition U into X and Y equally; if (test(X union R) = FAIL) { return DD(X, R) if (test(Y union R) = FAIL) { return DD(Y, R) return DD(X, Y union R) union DD(Y, X union R); C: a set of changes; R: changes remain to be applied

17 Exercise 0: Complete the Call Graph
DD({1..8}, {}) DD({1..4}, {5..8}) DD({5..8}, {1..4})

18 Theory Theorem: Assume monotonicity, unambiguity and consistency, algorithm DD always returns the minimum failure-inducing set of changes.

19 Informal Proof Justified by monotonicity and unambiguity
algorithm DD(U, R) { if (U has one element only) { return U; } partition U into X and Y equally; if (test(X union R) = FAIL) { return DD(X, R) if (test(Y union R) = FAIL) { return DD(Y, R) return DD(X, Y union R) union DD(Y, X union R); Justified by consistency Justified by monotonicity and unambiguity Justified by monotonicity and unambiguity

20 Ambiguity Unambiguity: If test(X) = FAIL and test(Y) = FAIL, test(X intersect Y) != PASS, i.e., a failure is caused by one change set (and not independently by two disjoint sets) What if it is ambiguous? DD will find one failure-inducing changes. Remove this set and apply DD again to find another.

21 Not Monotonic Monotonicity: If test(X) = FAIL, test(Y) != PASS for all Y which is a superset of X. What if it is not monotonic? For instance, what if test({1,2}) = FAIL but test({1..4}) = PASS? DD will find some failure-inducing changes.

22 Inconsistency Consistency: test(X) != ? for all X Integration failure. A change may require earlier changes that are not included in the configuration Construction failure. Although all changes can be applied, the resulting program has syntactical or semantic errors, such that construction fails. Execution failure. The program does not execute correctly; the test outcome is unresolved. Shall we enforce consistency for each commit of changes?

23 Research Question How do we modify the DD algorithm so as to handle inconsistency? Found: If test(X) = FAIL, X contains a failure-inducing subset. Interference: If test(X) = PASS = test(Y), X and Y form an interference. What if test(X) =? or test(Y) =? or both?

24 Example At step 3, can we omit {5..8}?

25 Example Preference: If test(X) =? and test(U-X) = PASS, X contains a failure-inducing subset and is preferred.

26 Example Scenario: there are 1..8 changes. Change 8 is failure-inducing, and changes 2, 3 and 7 imply each other – that is, they only can be applied as a whole.

27 Example

28 Example Is it necessary to have {5,6}?

29 DD+ Algorithm algorithm DD+(U, R, N) {
if (U has one element only) {//found return U; } partition U into N sets X1, X2, …, Xn equally; if (test(Xi union R ) = FAIL for some Xi) {//found in Xi return DD+(Xi, R, 2); if (test(Xi union R) = PASS and test((U – Xi) union R) = PASS for some Xi) {//interference return DD+(Xi, (U-Xi) union R, 2) union DD+(U-Xi-R, Xi union R, 2); if (test(Xi union R) = ? and test((U – Xi) union R) = PASS for some Xi) {//preference return DD+(Xi, (U – Xi) union R, 2); let U’ = U intersect {U – Xi – R| test((U – Xi) union R) = FAIL}; if (N < |U|) {//try again return DD+(U’, R union {Xi | test(Xi union R) = PASS}, min(|U’|, 2N)); return U’; //nothing left

30 Exercise 1 Convince yourself by applying the algorithm (i.e., DD+({1..8}, {}, 2)) to the following scenario. There are 1..8 changes. Change 8 is failure-inducing, and changes 2, 3 and 7 imply each other – that is, they only can be applied as a whole.

31 Reducing Inconsistency
What if we know that 2 and 3 and 7 imply each other? Partition the set into two sets {1,2,3,7} and {4,5,6,8}. How do we know what changes imply each other?

32 Grouping Related Changes
To determine whether changes are related, one can use process criteria: common change dates or sources, location criteria: the affected file or directory, lexical criteria: common referencing of identifiers, syntactic criteria: common syntactic entities (functions, modules) affected by the change, semantic criteria: common program statements affected by the changed control or data flow.

33 Example Assume that each change requires all earlier changes to be consistent.

34 Case Study: DDD 3.1.2 Dumps Core
DDD 3.1.2, released in December, 1998, exhibited a nasty behavioral change: When invoked with a the name of a non-existing file, DDD dumped core, while its predecessor DDD simply gave an error message. The DDD configuration management archive lists 116 logical changes between the and releases. These changes were split into 344 textual changes to the DDD source.

35 Random Clustering start with 344 changes reduced to 172 changes
8 subsets of changes reduced to 16 changes reduced to 1 change 31 tests in total

36 Grouping Changes Changes were grouped according to the date they were applied. Each change implied all earlier changes. 12 test runs and 58 minutes

37 Case Study: GDB 178,000 changed GDB lines, grouped into 8721 textual changes in the GDB source, with any two textual changes separated by at least two unchanged lines. No configuration management archive to obtain change dates, etc.

38 Random Clustering Most of the first 457 tests result in ?
At test 458, one set containing 36 changes resulted in FAIL 470 tests in total; 48 hours.

39 Grouping Changes At top-level, changes were grouped according to directories. This was motivated by the observation that several GDB directories contain a separate library whose interface remains more or less consistent across changes. Within one directory, changes were grouped according to common files. The idea was to identify compilation units whose interface was consistent with both “yesterday’s” and “today’s” version. Within a file, changes were grouped according to common usage of identifiers. This way, we could keep changes together that operated on common variables or functions. Finally, failure resolution loop: After a failing construction, scans the error messages for identifiers, adds all changes that reference these identifiers and tries again. This is repeated until construction is possible, or until there are no more changes to add.

40 Grouping Changes After 9 tests, 2547 changes left
After 280 tests, 18 changes left (of two files only) 289 tests in total, 20 hours

41 Is Delta Debugging an over-kill?
Question Is Delta Debugging an over-kill?

42 “Simplifying and isolating failure-inducing input”
Andreas Zeller, IEEE Transactions on Software Engineering, 2002 “Simplifying and isolating failure-inducing input”

43 Motivation In July 1999, Bugzilla, the Mozilla bug database, listed more than 370 open bug reports—bug reports that were not even simplified. With this queue growing further, the Mozilla engineers “faced imminent doom”. Overwhelmed with work, the Netscape product manager sent out the Mozilla BugAThon call for volunteers: people who would help simplify bug reports. “Simplifying” meant: turning these bug reports into minimal test cases, where every part of the input would be significant in reproducing the failure.

44 Motivation: Example <td align=left valign=top> <SELECT NAME="op sys" MULTIPLE SIZE=7> <OPTION VALUE="All">All<OPTION VALUE="Windows 3.1">Windows 3.1<OPTION VALUE="Windows 95">Windows 95<OPTION VALUE="Windows 98">Windows 98<OPTION VALUE="Windows ME">Windows ME<OPTION VALUE="Windows 2000">Windows 2000<OPTION VALUE="Windows NT">Windows NT<OPTION VALUE="Mac System 7">Mac System 7<OPTION VALUE="Mac System 7.5">Mac System 7.5<OPTION VALUE="Mac System 7.6.1">Mac System 7.6.1<OPTION VALUE="Mac System 8.0">Mac System 8.0<OPTION VALUE="Mac System 8.5">Mac System 8.5<OPTION VALUE="Mac System 8.6">Mac System 8.6<OPTION VALUE="Mac System 9.x">Mac System 9.x<OPTION VALUE="MacOS X">MacOS X<OPTION VALUE="Linux">Linux<OPTION VALUE="BSDI">BSDI<OPTION VALUE="FreeBSD">FreeBSD<OPTION VALUE="NetBSD">NetBSD<OPTION VALUE="OpenBSD">OpenBSD<OPTION VALUE="AIX">AIX<OPTION VALUE="BeOS">BeOS<OPTION VALUE="HP-UX">HP-UX<OPTION VALUE="IRIX">IRIX<OPTION VALUE="Neutrino">Neutrino<OPTION VALUE="OpenVMS">OpenVMS<OPTION VALUE="OS/2">OS/2<OPTION VALUE="OSF/1">OSF/1<OPTION VALUE="Solaris">Solaris<OPTION VALUE="SunOS">SunOS<OPTION VALUE="other">other</SELECT> </td> <td align=left valign=top> <SELECT NAME="priority" MULTIPLE SIZE=7> <OPTION VALUE="--">--<OPTION VALUE="P1">P1<OPTION VALUE="P2">P2<OPTION VALUE="P3">P3<OPTION VALUE="P4">P4<OPTION VALUE="P5">P5</SELECT></td><td align=left valign=top> <SELECT NAME="bug severity" MULTIPLE SIZE=7> <OPTION VALUE="blocker">blocker<OPTION VALUE="critical">critical<OPTION VALUE="major">major<OPTION VALUE="normal">normal<OPTION VALUE="minor">minor<OPTION VALUE="trivial">trivial<OPTION VALUE="enhancement">enhancement</SELECT> </tr> </table> Loading this HTML page into Mozilla and printing it causes a segmentation fault.

45 How do we automatically minimize a test case?
Research Question How do we automatically minimize a test case?

46 Research Question How do we automatically minimize a test case?
Solution: using Delta Debugging Yesterday: test(“”) = PASS Today: test(that html page) = FAIL The set of changes = {insert one character} Is this the best definition?

47 Example

48 Generalize DD A change can be anything which changes the circumstances of a program run, e.g., program code changes (as we have seen) program input changes (as we’re about to see) system configurations?

49 Generalize DD DD Requires
a set of primitive changes (which can be composed) a test function (with different outputs like: PASS, FAIL, ?) What should be the primitive changes?

50 Minimal Test Case Let U be the set of changes containing the primitive change of inserting one character at a position in the input. The problem is to find the minimum set X (which corresponds to the minimum input) such that test(X) = FAIL.

51 Minimality Local Minimum: a test case (represented as a set of changes X) is local minimum if test(X) = FAIL and test(Y) != FAIL for all subset Y of X. 1-Minimal test case: a test case X is 1-minimal if test(X) = FAIL and test(Y) != FAIL for all set Y which is one-element less of X. Why? Because global minimum is hard to achieve.

52 DDmin Algorithm algorithm DDmin(U, N) { if (|U| = 1) {//minimum
return U; } partition U into X1, X2, …, Xn equally; if (test(Xi) = FAIL for some Xi) {//reduce to subset return DDmin(Xi, 2); if (test(U - Xi) = FAIL for some Xi) {//reduce to complement return DDmin(U - Xi, max(N-1, 2)); if (N < |U|) {//increase granularity return DDmin(U, min(|U|, 2N));

53 Example DDmin({1..8}, 2) DDmin({5..8}, 2) DDmin({7..8}, 2)

54

55 Exercise Show how DDmin works for this example.

56 Theory DDmin(U, 2) is guaranteed to return a 1-minimal test case.
algorithm DDmin(U, N) { if (|U| = 1) {//minimum return U; } partition U into X1, X2, …, Xn equally; if (test(Xi) = FAIL) {//reduce to subset return DDmin(Xi, 2); if (test(U - Xi) = FAIL) {//reduce to complement return DDmin(U - Xi, max(N-1, 2)); if (N < |U|) {//increase granularity return DDmin(U, min(|U|, 2N)); Reduce U in these cases If U can’t be reduced further, eventually we would test every sunset of U with one less element.

57 Research Question Why not simply apply the DD+ algorithm?

58 Case Study: GCC This program causes GNU C compiler (GCC version on Intel-Linux with optimization enabled) to crash. Each change inserts the i-th character into the program.

59 Case Study: GCC After 731 tests generated by DDmin (and 34 seconds)
Does it make sense to reduce the method name?

60 Case Study: GCC GCC has 31 options
Can we find out which option is relevant? Yes, simply run DDmin again.

61 Case Study: GCC DDmin: after 731 test cases, the result is

62 Case Study: GCC This suggests a problem with inlining the expression i+j+1 in the array Accesses z[i] on the following line

63 Exercise 2 Take this program and this input as example. Apply delta debugging to locate the bug.

64 How would you improve Delta Debugging?
Research Question How would you improve Delta Debugging?

65 Research Question How would you improve Delta Debugging?
Are all generated test cases meaningful? How do we partition the input?

66 Research Question How would you improve Delta Debugging?
Which test case to run first?

67 Research Question How would you improve Delta Debugging?
What if there are multiple failure-inducing inputs?

68 Discussion Assume we given the following program and Delta Debugging run. How do we make of the result?

69 How would you improve Delta Debugging?
Research Question How would you improve Delta Debugging? How can Delta Debugging be used to identify which statement in a program is buggy? (for simplicity, assume there is only one statement which is buggy).


Download ppt "50.530: Software Engineering"

Similar presentations


Ads by Google