How to Predict More with Less: Defect Prediction Using Machine Learners in an Implicitly Data Starved Domain Kim Kaminsky Gary D. Boetticher Department of Computer Science University of Houston - Clear Lake Houston, Texas, USA
Preamble The maturing of Software Engineering as a discipline requires a better understanding of the complexity of the software process. Empirical-based modeling is one mechanism for improving the understanding, and thus management of the software process.
Data Starvation Issues in Software Engineering Heavily context dependent Measure A from Project X Measure B from Project Y Unreliable data due to poor processes Organizations do not share data Projects are large project estimation data occurs infrequently
Implicitly Data Starved Domains Lots of this Number of modules Little of that Defect counts
Equalized Learning Balance Data by Replicating Sparse Instances [Mizuno99] 300 Instances of 0 Defects of 5 Defects of 9 Defects 3 Colors = 3 Diff. Instances 300 Instances of 0 Defects 20 Instances/ 5 Defects 10 Instances/9 Defects
Genetic Programming Process - 1 Fitness Value = Model performance on data. 2 (of many) Chromosomes Data + A B * - 3 D 888 out of 1000 913 out of 1000
Genetic Programming Process - 2 Mutation 2 Chromosomes Crossover + B - 3 D * A + A B + B - D 3.1 * A - 3 D
} NASA KC2 Defect Dataset Equalized produces 3013 samples 379 Unique tuples } Output: Defect Count Input: Product Metrics (Size, Complexity, Vocabulary)
Original versus Equalized Data Experiment Configuration 2000 Characters 1000 Chromosomes 50 Generations Max. 20 Trials
Original versus Equalized Data t-test Results
Conclusions Equalized learning spawns large datasets Equalized learning produces better models
Future Directions Apply to other NASA datasets Improve Performance: Distributed GP