Theory and Practice, Do They Match ? A Case with Spectrum-Based Fault Localization Tien-Duy B. Le, Ferdian Thung, and David Lo School of Information Systems.

Slides:



Advertisements
Similar presentations
The Effect of a Specific Versus Nonspecific Subconscious Goal Gary P. Latham University of Toronto Ronald F. Piccolo University of Central Florida.
Advertisements

Nonparametric Methods
CUBELSI : AN EFFECTIVE AND EFFICIENT METHOD FOR SEARCHING RESOURCES IN SOCIAL TAGGING SYSTEMS Bin Bi, Sau Dan Lee, Ben Kao, Reynold Cheng The University.
The Proper Conclusion to a Significance Test. Luke Wilcox is an “acorn” at the 2013 AP Statistics reading. After three days of scoring, his table leader.
Comprehensive Evaluation of Association Measures for Software Fault Localization LUCIA, David LO, Lingxiao JIANG, Aditya BUDI Singapore Management University.
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Empirically Assessing End User Software Engineering Techniques Gregg Rothermel Department of Computer Science and Engineering University of Nebraska --
Test statistic: Group Comparison Jobayer Hossain Larry Holmes, Jr Research Statistics, Lecture 5 October 30,2008.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 17: Nonparametric Tests & Course Summary.
An Experimental Evaluation on Reliability Features of N-Version Programming Xia Cai, Michael R. Lyu and Mladen A. Vouk ISSRE’2005.
Wilcoxon Tests What is the Purpose of Wilcoxon Tests? What are the Assumptions? How does the Wilcoxon Rank-Sum Test Work? How does the Wilcoxon Matched-
Chapter 19 Data Analysis Overview
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 13 Using Inferential Statistics.
State coverage: an empirical analysis based on a user study Dries Vanoverberghe, Emma Eyckmans, and Frank Piessens.
Automated Diagnosis of Software Configuration Errors
Performance Evaluation International Investments Professor Cam Harvey Universal Investments: Christian Delay Noppaporn Supmonchai Tassanee Ratanaruangrai.
Hypothesis Testing II The Two-Sample Case.
Chapter 2 The Research Enterprise in Psychology. n Basic assumption: events are governed by some lawful order  Goals: Measurement and description Understanding.
Empirical Research Methods in Computer Science Lecture 4 November 2, 2005 Noah Smith.
Independent samples- Wilcoxon rank sum test. Example The main outcome measure in MS is the expanded disability status scale (EDSS) The main outcome measure.
An Automated Approach to Predict Effectiveness of Fault Localization Tools Tien-Duy B. Le, and David Lo School of Information Systems Singapore Management.
1 1 Slide © 2005 Thomson/South-Western AK/ECON 3480 M & N WINTER 2006 n Power Point Presentation n Professor Ying Kong School of Analytic Studies and Information.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Statistical Analysis. Statistics u Description –Describes the data –Mean –Median –Mode u Inferential –Allows prediction from the sample to the population.
Experimental Research Methods in Language Learning Chapter 15 Non-parametric Versions of T-tests and ANOVAs.
What Do We Know about Defect Detection Methods P. Runeson et al.; "What Do We Know about Defect Detection Methods?", IEEE Software, May/June Page(s):
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
--He Xiangnan PhD student Importance Estimation of User-generated Data.
The Effect of Interviewer on Rank List: An Imperfect Science Becomes More Imperfect Daniel Vargo, MD Program Director, General Surgery Associate Professor,
Wilcoxon Matched-Pairs Signed Rank Test A nonparametric alternative to the t test for related samples Before and After studies Studies in which measures.
Computer Science 1 Mining Likely Properties of Access Control Policies via Association Rule Mining JeeHyun Hwang 1, Tao Xie 1, Vincent Hu 2 and Mine Altunay.
Chapter Twelve The Two-Sample t-Test. Copyright © Houghton Mifflin Company. All rights reserved.Chapter is the mean of the first sample is the.
Isolating Failure-Inducing Combinations in Combinatorial Testing using Test Augmentation and Classification Kiran Shakya Tao Xie North Carolina State University.
Jing Ye 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of Computer System and Architecture Institute of Computing Technology Chinese Academy of Sciences.
Bug Localization with Association Rule Mining Wujie Zheng
Multi-Abstraction Concern Localization Tien-Duy B. Le, Shaowei Wang, and David Lo School of Information Systems Singapore Management University 1.
When Tests Collide: Evaluating and Coping with the Impact of Test Dependence Wing Lam, Sai Zhang, Michael D. Ernst University of Washington.
For her project, Maria bought 50 small bags of jelly beans. The average number of jelly beans in a small bag is 27. It is known that the standard deviation.
Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Nonparametric Tests BPS chapter 26 © 2006 W.H. Freeman and Company.
Chapter 21prepared by Elizabeth Bauer, Ph.D. 1 Ranking Data –Sometimes your data is ordinal level –We can put people in order and assign them ranks Common.
QNT 561 Entire Course For more course tutorials visit QNT 561 Week 1 Individual Practice Problems (Chapter 2 and 4) QNT 561 Week 1.
EPSY 5210 Ed. Statistics Instructor: Hector Ponce Background: Research Interest Experience with Quantitative Analysis Additional comments.
Test Case Purification for Improving Fault Localization presented by Taehoon Kwak SoftWare Testing & Verification Group Jifeng Xuan, Martin Monperrus [FSE’14]
Comments: CEO hedging opportunities and the weighting of performance measures in compensation.
Effects of Word Concreteness and Spacing on EFL Vocabulary Acquisition 吴翼飞 (南京工业大学,外国语言文学学院,江苏 南京211816) Introduction Vocabulary acquisition is of great.
Tung Dao* Lingming Zhang+ Na Meng* Virginia Tech*
Scoring the Technical Evaluation Maximum possible score
Parametric vs Non-Parametric
Ask the Mutants: Mutating Faulty Programs for Fault Localization
Parametric and non parametric tests
Hypothesis Theory examples.
Improving Test Suites for Efficient Fault Localization
Test Case Purification for Improving Fault Localization
MS Excel Scaffolding START.
Chapter 4 One-Group t-Test for the Mean
Reasoning in Psychology Using Statistics
St. Edward’s University
The Wilcoxon signed rank test
Using Automated Program Repair for Evaluating the Effectiveness of
Introduction to Statistics for the Social Sciences SBS200 - Lecture Section 001, Spring 2019 Room 150 Harvill Building 9:00 - 9:50 Mondays, Wednesdays.
Homework: pg. 500 #41, 42, 47, )a. Mean is 31 seconds.
QoI: Assessing Participation in Threat Information Sharing
Mean theoretical and practical scores of participants with and without formal training in postdoctoral program or continuing education coursesNote: Scores.
Epithelial and stromal percentage clusterin staining in normal nonadjacent colonic tissue and matched tumor tissue in 202 patients with stage II colorectal.
Presentation transcript:

Theory and Practice, Do They Match ? A Case with Spectrum-Based Fault Localization Tien-Duy B. Le, Ferdian Thung, and David Lo School of Information Systems Singapore Management University 1

Spectrum-Based Fault Localization Locating buggy program elements by –Analyzing two sets of execution traces Normal traces and faulty traces –Assigning suspiciousness scores to program elements Two well-known SBFL formulas –Tarantula –Ochiai 2

Spectrum-Based Fault Localization Xie et al. “A Theoretical Analysis of the Risk Evaluation Formulas for Spectrum-based Fault Localization” (TOSEM, 2013) –Two families of SBFL formulas ER1 and ER5 (5 formulas in total) Theoretically proven to outperform Ochiai and Tarantula –Under the assumption: test coverage is 100% 3

Our Goal 4 vs. Benchmark Programs Theoretically Best SBFL Formula by Xie et. al. Popular SBFL Formula ?

Notations NotionDescription Number of successful test cases that execute e Number of failing test cases that execute e Number of successful test cases that do not execute e Number of failing test cases that do not execute e 5

Popular SBFL Formulas Tarantula Ochiai 6

Theoretically Best SBFL Formulas 7

Dataset 10 programs, 199 faulty versions –Siemens test suite –Space, NanoXML, XML-Security Evaluation Metric –The lower the EXAM score, the better the performance 8

Results TechniqueAverage % InspectedStandard Deviation Tarantula23.37%23.44% Ochiai21.02%21.96% ER1 a 33.34%35.22% ER1 b 21.09%19.48% ER5 a 43.04%19.63% ER5 b 43.04%19.63% ER5 c 54.95%26.83% 9 Ochiai has the lowest EXAM score (21.02%)

Results TechniqueAverage % InspectedStandard Deviation Tarantula23.37%23.44% Ochiai21.02%21.96% ER1 a 33.34%35.22% ER1 b 21.09%19.48% ER5 a 43.04%19.63% ER5 b 43.04%19.63% ER5 c 54.95%26.83% 10 Tarantula‘s EXAM score is lower than 4 out of the 5 theoretically best SBFL formulas

Results TechniqueAverage % InspectedStandard Deviation Tarantula23.37%23.44% Ochiai21.02%21.96% ER1 a 33.34%35.22% ER1 b 21.09%19.48% ER5 a 43.04%19.63% ER5 b 43.04%19.63% ER5 c 54.95%26.83% 11 Wilcoxon signed rank test (significance level of 0.05)  Ochiai is statistically better than ER5 a, ER5 b, ER5 c

100% Test Coverage Assumption For 135 out of the 199 faulty versions –Test coverage < 100% Average test coverage of the 199 versions –84.97%  Theoretically best SBFL formulas cannot outperform popular SBFL formulas 12

Conclusion We conduct an empirical study on 10 programs with 199 versions –Compare performance of 5 theoretically best SBFL formulas with Tarantula and Ochiai We find that: –Ochiai outperforms all theoretically best formulas –Tarantula outperforms 4 out of the 5 formulas –Assumption of 100% test coverage is not valid in many cases 13

Future work In-depth study how test coverage and other factors affect effectiveness of SBFL formulas Theoretically analyze performance of SBFL formulas –Assumption: test coverage < 100% 14

15 Thank you! Questions? Comments? Advice? {btdle.2012, ferdiant.2013,