Data Mining: Concepts and Techniques — Chapter 11 — —Software Bug Mining— Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign ©2006 Jiawei Han and Micheline Kamber. All rights reserved. Acknowledgement: Chao Liu
Outline Automated Debugging and Failure Triage SOBER: Statistical Model-Based Fault Localization Fault Localization-Based Failure Triage Copy and Paste Bug Mining Conclusions & Future Research
Software Bugs Are Costly Software is "full of bugs" Windows 2000, 35 million lines of code 63,000 known bugs at the time of release, 2 per 1000 lines Software failure costs Ariane 5 explosion due to "errors in the software of the inertial reference system" (Ariaen-5 flight 501 inquiry board report A study by the National Institute of Standards and Technology found that software errors cost the U.S. economy about $59.5 billion annually Testing and debugging are laborious and expensive "50% of my company employees are testers, and the rest spends 50% of their time testing!" —Bill Gates, in 1995
Automated Failure Reporting End-users as Beta testers Valuable information about failure occurrences in reality 24.5 million/day in Redmond (if all users send) – John Dvorak, PC Magazine Widely adopted because of its usefulness Microsoft Windows, Linux Gentoo, Mozilla applications … Any applications can implement this functionality
After Failures Collected …: Failure triage Failure prioritization: What are the most severe bugs? Failure assignment: Which developers should debug a given set of failures? Automated debugging Where is the likely bug location?
A Glimpse on Software Bugs Crashing bugs Symptoms: segmentation faults Reasons: memory access violations Tools: Valgrind, CCured Noncrashing bugs Symptoms: unexpected outputs Reasons: logic or semantic errors if ((m >= 0)) vs. if ((m >= 0) && (m != lastm)) < vs. <=, > vs. >=, etc .. j = i vs. j= i+1 Tools: No sound tools
Semantic Bugs Dominate Memory-related Bugs: Many are detectable Others Concurrency bugs Semantic Bugs: Application specific Only few detectable Mostly require annotations or specifications Bug Distribution [Li et al., ICSE'07] 264 bugs in Mozilla and 98 bugs in Apache manually checked 29,000 bugs in Bugzilla automatically checked
Hacking Semantic Bugs is HARD Major challenge: No crashes! No failure signatures No debugging hints Major Methods Statistical debugging of semantic bugs [Liu et al., FSE'05, TSE'06] Triage noncrashing failures through statistical debugging [Liu et al., FSE'06]
Outline Automated Debugging and Failure Triage SOBER: Statistical Model-Based Fault Localization Fault Localization-Based Failure Triage Copy and Paste Bug Mining Conclusions & Future Research
A Running Example void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0) && (lastm != m) ){ lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if (m >= 0){ lastm = m; } if ((m == -1) || (m == i)){ i = i + 1; } else i = m; void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0)){ lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m; Predicate # of true # of false (lin[i] != ENDSTR)==true 5 1 Ret_amatch < 0 Ret_amatch == 0 Ret_amatch > 0 5 1 1 5 4 2 2 4 (m >= 0) == true (m == i) == true (m >= -1) == true Predicate evaluation as tossing a coin 130 of 5542 test cases fail, no crashes
Profile Executions as Vectors Two passing executions 5 1 4 2 19 1 18 2 One failing execution 9 1 8 2 Extreme case Always false in passing and always true in failing … Generalized case Different true probability in passing and failing executions
Estimated Head Probability Evaluation bias Estimated head probability from every execution Specifically, where and are the number of true and false evaluations in one execution. Defined for each predicate and each execution
Divergence in Head Probability Multiple evaluation biases from multiple executions Evaluation bias as generated from models 1 Prob Head Probability
Major Challenges 1 Prob Head Probability 1 Prob Head Probability No closed form of either model No sufficient number of failing executions to estimate
We proposed a hypothesis testing-based indirect approach to quantify the model divergence.
SOBER in Summary J L SOBER Test Suite Pred2 Pred6 Pred1 Pred3 Source Code Pred2 Pred6 Pred1 Pred3
Previous State of the Art [Liblit et al, 2005] Correlation analysis Context(P) = Prob(fail | P ever evaluated) Failure(P) = Prob(fail | P ever evaluated as true) Increase(P) = Failure(P) – Context(P) How more likely the program fails when a predicate is ever evaluated true
Liblit05 in Illustration Context(P) = Prob(fail | P ever evaluated) = 4/10 = 2/5 + + Failing + + + Increase(P) = Failure(P) – Context(P) = 3/7 – 2/5 = 1/35 + O Passing O O O O Failure(P) = Prob(fail | P ever evaluated as true) = 3/7 O O O O O
SOBER in Illustration 1 Prob Evaluation bias + + Failing + + + + 1 Prob Evaluation bias O Passing O O O O O O O O O
Difference between SOBER and Liblit05 Methodology: Liblit05: Correlation analysis SOBER: Model-based approach Utilized information Liblit05: Ever true? SOBER: What percentage is true? void subline(char *lin, char *pat, char *sub) { 1 int i, lastm, m; 2 lastm = -1; 3 i = 0; 4 while((lin[i] != ENDSTR)) { 5 m = amatch(lin, i, pat, 0); 6 if (m >= 0){ 7 putsub(lin, i, m, sub); 8 lastm = m; } 11 } Liblit05: Line 6 is ever true in most passing and failing exec. SOBER: Prone to be true in failing exec. Prone to be false in passing exec.
T-Score: Metric of Debugging Quality How close is the blamed to the real bug location? T-score = 70% void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0)){ lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m;
A Better Debugging Result T-score = 40% void subline(char *lin, char *pat, char *sub) { int i, lastm, m; lastm = -1; i = 0; while((lin[i] != ENDSTR)) { m = amatch(lin, i, pat, 0); if ((m >= 0)){ lastm = m; } if ((m == -1) || (m == i)) { i = i + 1; } else i = m;
Evaluation 1: Siemens Program Suite 130 buggy versions of 7 small (<700LOC) programs What percentage bugs can be located with no more than % code examination T-Score <= 20% is meaningful
Evaluation 2: Reasonably Large Programs Bug Type Failure Number T-Score Flex 2.4.7 (8,834 LOC) Bug 1 Misuse >= for > 163/525 0.5% Bug 2 Misuse of = for == 356/525 1.6% Bug 3 Mis-assign value true for false 69/525 7.6% Bug 4 Mis-parenthesize ((a||b)&&c) as (a || (b && c)) 22/525 15.4% Bug 5 Off-by-one 92/525 45.6% Grep 2.2 (11,826 LOC) 48/470 0.6% Subclause-missing 88/470 0.2% Gzip 1.2 (6,184 LOC) 65/217 17/217 2.9% Software-artifact Infrastructure Repository (SIR):
A Glimpse of Bugs in Flex-2.4.7
Evaluation 2: Reasonably Large Programs Bug Type Failure Number T-Score Flex 2.4.7 (8,834 LOC) Bug 1 Misuse >= for > 163/525 0.5% Bug 2 Misuse of = for == 356/525 1.6% Bug 3 Mis-assign value true for false 69/525 7.6% Bug 4 Mis-parenthesize ((a||b)&&c) as (a || (b && c)) 22/525 15.4% Bug 5 Off-by-one 92/525 45.6% Grep 2.2 (11,826 LOC) 48/470 0.6% Subclause-missing 88/470 0.2% Gzip 1.2 (6,184 LOC) 65/217 17/217 2.9% Software-artifact Infrastructure Repository (SIR):
A Close Look: Grep-2.2: Bug 1 11,826 lines of C code 3,136 predicates instrumented 48 out of 470 cases fail
Grep-2.2: Bug 2 11,826 lines of C code 3,136 predicates instrumented 88 out of 470 cases fail
No Silver Bullet: Flex Bug 5 8,834 lines of C code 2,699 predicates instrumented No wrong value in chk[offset -1] chk[offset] is not used here but later
Experiment Result in Summary Effective for bugs demonstrating abnormal control flows Bug Type Failure Number T-Score Flex 2.4.7 (8,834 LOC) Bug 1 Misuse >= for > 163/525 0.5% Bug 2 Misuse of = for == 356/525 1.6% Bug 3 Mis-assign value true for false 69/525 7.6% Bug 4 Mis-parenthesize ((a||b)&&c) as (a || (b && c)) 22/525 15.4% Bug 5 Off-by-one 92/525 45.6% Grep 2.2 (11,826 LOC) 48/470 0.6% Subclause-missing 88/470 0.2% Gzip 1.2 (6,184 LOC) 65/217 17/217 2.9%
SOBER Handles Memory Bugs As Well bc 1.06: Two memory bugs found with SOBER One of them is unreported Blamed location is NOT the crashing venue
Outline Automated Debugging and Failure Triage SOBER: Statistical Model-Based Fault Localization Fault Localization-Based Failure Triage Copy and Paste Bug Mining Conclusions & Future Research
Major Problems in Failure Triage Failure Prioritization What failures are likely due to the same bug What bugs are the most severe Worst 1% bugs = 50% failures Failure Assignment Which developer should debug which set of failures?
A Solution: Failure Clustering Failure indexing Identify failures likely due to the same bug Y Fault in Fault in function initialize()? + Failure Reports Most sever Less Severe Least Severe X
The Central Question: A Distance Measure between Failures Different measures render different clusterings Y O + Dist. defined on X-axis Dist. defined on Y-axis O + Y O + X X
How to Define a Distance Previous work [Podgurski et al., 2003] T-Proximity: Distance defined on literal trace similarity Our approach [Liu et al., 2006] R-Proximity: Distance defined on likely bug location
Why Our Approach is Reasonable Optimal proximity: defined on root causes (RC) Our approach: defined on likely causes (LC) F P = Automated Fault Localization + X Y +
R-Proximity: An Instantiation with SOBER Likely causes (LCs) are predicate rankings F P Pred2 Pred6 Pred1 Pred3 SOBER Pred2 Pred3 Pred1 Pred6 Pred2 Pred3 Pred1 Pred6 Pred2 Pred6 Pred1 Pred3 A distances between rankings is needed + X Y
Distance between Rankings Traditional Kendall's tau distance Number of preference disagreements E.g. NOT all predicates need to be considered? Predicates are uniformly instrumented Only fault-relevant predicates count
Predicate Weighting in a Nutshell Fault-relevant predicates receive higher weights Fault-relevance is implied by rankings Mostly favored predicates receive higher weights Pred2 Pred6 Pred1 Pred3 11/22/2018 Data Mining: Principles and Algorithms
Automated Failure Assignment Most-favored predicates indicate the agreed bug location for a group of failures Predicate spectrum graph Pred2 Pred6 Pred1 Pred3 Pred2 Pred1 Pred3 Pred6 Pred2 Pred1 Pred3 Pred6 Pred2 Pred6 Pred1 Pred3 Y 4 2 1 2 3 4 5 6 Pred. Index 11/22/2018 Data Mining: Principles and Algorithms
Data Mining: Principles and Algorithms Case Study 1: Grep-2.2 Our first case study is with Grep-2.2. 470 test cases in total 136 cases fail due to both faults, no crashes 48 fail due to Fault 1, 88 fail due to Fault 2 11/22/2018 Data Mining: Principles and Algorithms
Failure Proximity Graphs T-Proximity R-Proximity For both R-Proximity and T-Proximity, we calculate the pair-wise distances between failures, and then present these failures on 2-d dimensional space such that the pair-wise distances are best preserved on the 2-d space. Red crosses are failures due to Fault 1 Blue circles are failures due to Fault 2 Divergent behaviors due to the same fault Better clustering result under R-Proximity 11/22/2018 Data Mining: Principles and Algorithms
Guided Failure Assignment What predicates are favored in each group? No matter what you circle, the predicates 1470 and 1484 always dominate 11/22/2018 Data Mining: Principles and Algorithms
Assign Failures to Appropriate Developers The 21 failing cases in Cluster 1 are assigned to developers responsible for the function grep The 112 failing cases in Cluster 2 are assigned to developers responsible for the function comsub 11/22/2018 Data Mining: Principles and Algorithms
Data Mining: Principles and Algorithms Case Study 2: Gzip-1.2.3 217 test cases in total 82 cases fail due to both faults, no crashes 65 fail due to Fault 1, 17 fail due to Fault 2 11/22/2018 Data Mining: Principles and Algorithms
Failure Proximity Graphs T-Proximity R-Proximity Red crosses are for failures due to Fault 1 Blue circles are for failures due to Fault 2 Nearly perfect clustering under R-Proximity Accurate failure assignment 11/22/2018 Data Mining: Principles and Algorithms
Data Mining: Principles and Algorithms Outline Automated Debugging and Failure Triage SOBER: Statistical Model-Based Fault Localization Fault Localization-Based Failure Triage Copy and Paste Bug Mining Conclusions & Future Research Here is the outline We start with a general discussion about automated debugging and failure triage, and then discuss our approaches to these two problems separately. After that, I discuss my proposed work to finish before graduation And finally discuss future research directions and draw conclusions. 11/22/2018 Data Mining: Principles and Algorithms
Mining Copy-Paste Bugs Copy-pasting is common 12% in Linux file system [Kasper2003] 19% in X Window system [Baker1995] Copy-pasted code is error prone Among 35 errors in Linux drivers/i2o, 34 are caused by copy-paste [Chou2001] void __init prom_meminit(void) { …… for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } Forget to change! for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; } (Simplified example from linux-2.6.6/arch/sparc/prom/memory.c) 11/22/2018 Data Mining: Principles and Algorithms
An Overview of Copy-Paste Bug Detection Parse source code & build a sequence database Mine for basic copy-pasted segments Compose larger copy-pasted segments Prune false positives 11/22/2018 Data Mining: Principles and Algorithms
Data Mining: Principles and Algorithms Parsing Source Code Purpose: building a sequence database Idea: statement number Tokenize each component Different operators/constant/key words different tokens Handle identifier renaming: same type of identifiers same token old = 3; new = 3; Tokenize 5 61 20 5 61 20 Hash Hash 16 16 11/22/2018 Data Mining: Principles and Algorithms
Building Sequence Database Program a long sequence Need a sequence database Cut the long sequence Naïve method: fixed length Our method: basic block Hash values 65 16 16 71 … 65 16 71 for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } …… for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; } Final sequence DB: (65) (16, 16, 71) … (65) (16, 16, 71) 11/22/2018 Data Mining: Principles and Algorithms
Mining for Basic Copy-pasted Segments Apply frequent sequence mining algorithm on the sequence database Modification Constrain the max gap Frequent subsequence total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; Insert 1 statement (gap = 1) (16, 16, 71) …… (16, 16, 71) (16, 16, 71) …… (16, 16, 10, 71) taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; 11/22/2018 Data Mining: Principles and Algorithms
Composing Larger Copy-Pasted Segments Combine the neighboring copy-pasted segments repeatedly Hash values 65 16 16 71 65 for (i=0; i<n; i++) { total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; } for (i=0; i<n; i++) { combine 16 16 71 total[i].adr = list[i].addr; total[i].bytes = list[i].size; total[i].more = &total[i+1]; …… copy-pasted 65 16 16 71 65 for (i=0; i<n; i++) { for (i=0; i<n; i++) { taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; } combine 16 16 71 taken[i].adr = list[i].addr; taken[i].bytes = list[i].size; taken[i].more = &total[i+1]; 11/22/2018 Data Mining: Principles and Algorithms
Pruning False Positives Unmappable segments Identifier names cannot be mapped to corresponding ones Tiny segments For more detail, see Zhenmin Li, Shan Lu, Suvda Myagmar, Yuanyuan Zhou. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code, in Proc. 6th Symp. Operating Systems Design and Implementation, 2004 f (a1); f (a2); f (a3); f1 (b1); f1 (b2); f2 (b3); conflict 11/22/2018 Data Mining: Principles and Algorithms
Some Test Results of C-P Bug Detection Software Verified Bugs Potential Bugs (careless programming) Linux 28 21 FreeBSD 23 8 Apache 5 PostgreSQL 2 Software # LOC Linux 4.4 M FreeBSD 3.3 M Apache 224 K PostgreSQL 458 K Space (MB) Time Software 57 38 secs PostgreSQL 30 15 secs Apache 459 20 mins FreeBSD 527 Linux 11/22/2018 Data Mining: Principles and Algorithms
Data Mining: Principles and Algorithms Outline Automated Debugging and Failure Triage SOBER: Statistical Model-Based Fault Localization Fault Localization-Based Failure Triage Copy and Paste Bug Mining Conclusions & Future Research Here is the outline We start with a general discussion about automated debugging and failure triage, and then discuss our approaches to these two problems separately. After that, I discuss my proposed work to finish before graduation And finally discuss future research directions and draw conclusions. 11/22/2018 Data Mining: Principles and Algorithms
Data Mining: Principles and Algorithms Conclusions Data mining into software and computer systems Identify incorrect executions from program runtime behaviors Classification dynamics can give away “backtrace” for noncrashing bugs without any semantic inputs A hypothesis testing-like approach is developed to localize logic bugs in software No prior knowledge about the program semantics is assumed Lots of other software bug mining methods should be and explored 11/22/2018 Data Mining: Principles and Algorithms
Future Research: Mining into Computer Systems Huge volume of data from computer systems Persistent state interactions, event logs, network logs, CPU usage, … Mining system data for … Reliability Performance Manageability … Challenges in data mining Statistical modeling of computer systems Online, scalability, interpretability … Most of these problems are noncrashing failures. 11/22/2018 Data Mining: Principles and Algorithms
