A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques Sarah Heckman and Laurie Williams Department of Computer Science.

A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques Sarah Heckman and Laurie Williams Department of Computer Science North Carolina State University

Motivation Automated static analysis can find a large number of alerts –Empirically observed alert density of 40 alerts/KLOC [HW08] Alert inspection required to determine if developer should (and could) fix –Developer may only fix 9% [HW08] to 65% [KAY04] of alerts –Suppose 1000 alerts – 5 minute inspection per alert – 10.4 work days to inspect all alerts –Potential savings of 3.6-9.5 days by only inspecting alerts the developer will fix Fixing 3-4 alerts that could lead to field failures justifies the cost of static analysis [WDA08] PROMISE 2013 (c) Sarah Heckman 2

Coding Problem? Actionable: alerts the developer wants to fix –Faults in the code –Conformance to coding standards –Developer action: fix the alert in the source code Unactionable: alerts the developer does not want to fix –Static analysis false positive –Developer knowledge that alert is not a problem –Inconsequential coding problems (style) –Fixing the alert may not be worth effort –Developer action: suppress the alert PROMISE 2013 (c) Sarah Heckman 3

Actionable Alert Identification Techniques Supplement automated static analysis –Classification: predict actionability –Prioritization: order by predicted actionability AAIT utilize additional information about the alert, code, and other artifacts –Artifact Characteristics Can we determine a “best” AAIT? PROMISE 2013 (c) Sarah Heckman 4

Research Objective to inform the selection of an actionable alert identification technique for ranking the output of automated static analysis through a comparative evaluation of six actionable alert identification techniques. PROMISE 2013 (c) Sarah Heckman 5

Related Work Comparative evaluation of AAIT [AAH12] –Languages: Java and Smalltalk –ASA: PMD, FindBugs, SmallLint –Benchmark: FAULTBENCH –Evaluation Metrics Effort – “average number of alerts one must inspect to find an actionable one” Fault Detection Rate Curve – number of faults detected against number alerts inspected. –Selected AAIT: APM, FeedbackRank, LRM, ZRanking, ATL-D, EFindBugs PROMISE 2013 (c) Sarah Heckman 6

Comparative Evaluation Considered AAIT in literature [HW11][SFZ11] Selection Criteria –AAIT classify or prioritize alerts generated by automated static analysis for the Java programming language –An implementation of the AAIT is described allowing for replication –The AAIT is fully automated and does not require manual intervention or inspection of alerts as part of the process PROMISE 2013 (c) Sarah Heckman 7

Selected AAIT (1) Actionable Prioritization Models (APM) [HW08] –ACs: code location, alert type Alert Type Lifetime (ATL) [KE07a] –AC: alert type lifetime –ATL-D: measures the lifetime in days –ATL-R: measures the lifetime in revisions Check ‘n’ Crash (CnC) [CSX08] –AC: test failures –Generates tests that try to cause RuntimeExceptions PROMISE 2013 (c) Sarah Heckman 8

Selected AAIT (2) History-Based Warning Prioritization (HWP) [KE07b] –ACs: commit messages that identify fault/non-fault fixes Logistic Regression Models (LRM) [RPM08] –ACs: 33 including two proprietary/internal AC Systematic Actionable Alert Identification (SAAI) [HW09] –ACs: 42 –Machine learning PROMISE 2013 (c) Sarah Heckman 9

FAULTBENCH v0.3 3 Subject Programs: jdom, runtime, logging Procedure 1.Gather Alert and Artifact Characteristic Data Sources 2.Artifact Characteristic and Alert Oracle Generation 3.Training and Test Sets 4.Model Building 5.Model Evaluation PROMISE 2013 (c) Sarah Heckman 10

Gather Data Download from repo Compile ASA – FindBugs & Check ‘n’ Crash (ESC/Java) Source Metrics – JavaNCSS Repository History – CVS & SVN Difficulties –Libraries – changed over time –Not every revision would build (especially early ones) PROMISE 2013 (c) Sarah Heckman 11

Artifact Characteristics Independent Variables Alert Identifier and History Alert information (type, location) Number of alert modifications Source Code Metrics Size and complexity metrics Source Code History Developers File creation, deletion, and modification revisions Source Code Churn Added and deleted lines of code Aggregate Characteristics Alert lifetime, alert counts, staleness Dependent Variable – Alert Classification PROMISE 2013 (c) Sarah Heckman 12 Alert Info Surrounding Code Alert Actionable Alert Unactionable Alert

Alert Oracle Generation PROMISE 2013 (c) Sarah Heckman 13 Iterate through all revisions, starting with the earliest, and compare alerts between revisions Closed Actionable Filtered Unactionable Deleted Open –Inspection –All unactionable Open Deleted Closed Filtered

Training and Test Sets Simulate how AAIT would be used in practice Training set: first X% of revisions to train the models –70%, 80%, and 90% Test set: use remaining 100-X% of revisions to test the models Overlapping alerts –Alerts open at the cutoff revision Deleted alerts –If an alert is deleted, the alert is not considered UNLESS the alert isn’t deleted in the training set. In that case the alert is used in model building. PROMISE 2013 (c) Sarah Heckman 14

Model Building & Model Evaluation PROMISE 2013 (c) Sarah Heckman 15 Classification Statistics: –Precision = TP / (TP + FP) –Recall = TP / (TP + FN) –Accuracy = (TP + TN) / (TP + TN + FP + FN) PredictedActual True Positive (TP)Actionable False Positive (FP)ActionableUnactionable False Negative (FN)UnactionableActionable True Negative (TN)Unactionable All AAIT are built using the training data and evaluated by predicting the actionability of the test data

Results - jdom PROMISE 2013 (c) Sarah Heckman 16 Accuracy (%)Precision (%)Recall (%) Rev.708090708090708090 APM808387464209100 ATL-D7283882620 2223 ATL-R7781863224 11813 CnC738095100 0690 HWP31353219159736757 LRM727683373532645559 SAAI838690921006716137

Results - runtime PROMISE 2013 (c) Sarah Heckman 17 Accuracy (%)Precision (%)Recall (%) AAIT708090708090708090 APM362350887047321757 ATL-D1817559282100843 ATL-R344359939455273660 HWP686646888545747383 LRM888753888749100 SAAI4965839091100486663

Results - logging Accuracy (%)Precision (%)Recall (%) AAIT708090708090708090 APM858992000000 ATL-D9297100000000 ATL-R9297100000000 CnC67100 000000 HWP323533840100 0 LRM77848325140100 0 SAAI9097100000000 PROMISE 2013 (c) Sarah Heckman 18

Threats to Validity Internal Validity –Automation of data generation, collection, and artifact characteristic generation –Alert oracle – uninspected alerts are considered unactionable –Alert closure is not an explicit action by the developer –Alert continuity not perfect Close and open a new alert if both the line number and source hash of the alert change –Number of revisions External Validity –Generalizability of results –Limitations of the AAIT in comparative evaluation Construct Validity –Calculations for artifact characteristics PROMISE 2013 (c) Sarah Heckman 19

Future Work Incorporate additional projects into FAULTBENCH –Emphasis on adding projects that actively use ASA and include filter files –Allow for evaluation of AAIT with different goals Identification of most predictive artifact characteristics Evaluate different windows for generating test data –A full project history may not be as predictive as the most recent history PROMISE 2013 (c) Sarah Heckman 20

Conclusions SAAI found to be the best overall model when considering accuracy –Highest accuracy, or tie, for 6 of 9 treatments ATL-D, ATL-R, and LRM were also predictive when considering accuracy –CnC also performed well, but only considered alerts from one ASA LRM and HWP had the highest recall PROMISE 2013 (c) Sarah Heckman 21

References [AAH12] S. Allier, N. Anquetil, A. Hora, S. Ducasse, “A Framework to Compare Alert Ranking Algorithms,” 2012 19 th Working conference on Reverse Engineering, Kingston, Ontario, Canada, October 15-18, 2012, p. 277-285. [CSX08] C. Csallner, Y. Smaragdakis, and T. Xie, "DSD-Crasher: A Hybrid Analysis Tool for Bug Finding," ACM Transactions on Software Engineering and Methodology, vol.17, no. 2, pp. 1-36, April, 2008. [HW08] S. Heckman and L. Williams, "On Establishing a Benchmark for Evaluating Static Analysis Alert Prioritization and Classification Techniques," Proceedings of the 2nd International Symposium on Empirical Software Engineering and Measurement, Kaiserslautern, Germany, October 9-10, 2008, pp. 41-50. [HW09] S. Heckman and L. Williams, "A Model Building Process for Identifying Actionable Static Analysis Alerts," Proceedings of the 2nd IEEE International Conference on Software Testing, Verification and Validation, Denver, CO, USA, 2009, pp. 161-170. [HW11] S. Heckman and L. Williams, "A Systematic Literature Review of Actionable Alert Identification Techniques for Automated Static Code Analysis," Information and Software Technology, vol. 53, no. 4, April 2011, p. 363-387. [KE07a] S. Kim and M. D. Ernst, "Prioritizing Warning Categories by Analyzing Software History," Proceedings of the International Workshop on Mining Software Repositories, Minneapolis, MN, USA, May 19-20, 2007, p27. [KE07b] S. Kim and M. D. Ernst, "Which Warnings Should I Fix First?," Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Dubrovnik, Croatia, September 3-7, 2007, pp. 45-54. [KAY04] T. Kremenek, K. Ashcraft, J. Yang, and D. Engler, "Correlation Exploitation in Error Ranking," Proceedings of the 12th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Newport Beach, CA, USA, 2004, pp. 83-93. [RPM08] J. R. Ruthruff, J. Penix, J. D. Morgenthaler, S. Elbaum, G. Rothermel, “Predicting Accurate and Actionable Static Analysis Warnings: An Experimental Approach,” Proceedings of the 30 th International Conference on Software Engineering, Leipzig, Germany, May 10-18, 2008, pp. 341-350. [SFZ11] H. Shen, J. Fang, J. Zhao, “EFindBugs: Effective Error Ranking for FindBugs,” 2011 IEEE 4 th International Conference on Software Testing, Verification and Validation, Berlin, Germany, March 21-25, 2011, p. 299-308. [WDA08] S. Wagner, F. Deissenboeck, M. Aichner, J. Wimmer, M. Schwalb, “An Evaluation of Two Bug Pattern Tools for Java,” Proceedings of the 1 st International Conference on Software Testing, Verification, and Validation, … PROMISE 2013 (c) Sarah Heckman 22

A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques Sarah Heckman and Laurie Williams Department of Computer Science.

Similar presentations

Presentation on theme: "A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques Sarah Heckman and Laurie Williams Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques Sarah Heckman and Laurie Williams Department of Computer Science.

Similar presentations

Presentation on theme: "A Comparative Evaluation of Static Analysis Actionable Alert Identification Techniques Sarah Heckman and Laurie Williams Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback