Download presentation
Presentation is loading. Please wait.
Published byMagne Austad Modified over 6 years ago
1
How to Predict More with Less: Defect Prediction Using Machine Learners in an Implicitly Data Starved Domain Kim Kaminsky Gary D. Boetticher Department of Computer Science University of Houston - Clear Lake Houston, Texas, USA
2
Preamble The maturing of Software Engineering as a discipline requires a better understanding of the complexity of the software process. Empirical-based modeling is one mechanism for improving the understanding, and thus management of the software process.
3
Data Starvation Issues in Software Engineering
Heavily context dependent Measure A from Project X Measure B from Project Y Unreliable data due to poor processes Organizations do not share data Projects are large project estimation data occurs infrequently
4
Implicitly Data Starved Domains
Lots of this Number of modules Little of that Defect counts
5
Equalized Learning Balance Data by Replicating Sparse Instances [Mizuno99] 300 Instances of 0 Defects of 5 Defects of 9 Defects 3 Colors = 3 Diff. Instances 300 Instances of 0 Defects 20 Instances/ 5 Defects 10 Instances/9 Defects
6
Genetic Programming Process - 1
Fitness Value = Model performance on data. 2 (of many) Chromosomes Data + A B * - 3 D 888 out of 1000 913 out of 1000
7
Genetic Programming Process - 2
Mutation 2 Chromosomes Crossover + B - 3 D * A + A B + B - D 3.1 * A - 3 D
8
} NASA KC2 Defect Dataset Equalized produces 3013 samples 379 Unique
tuples } Output: Defect Count Input: Product Metrics (Size, Complexity, Vocabulary)
9
Original versus Equalized Data Experiment Configuration
2000 Characters 1000 Chromosomes 50 Generations Max. 20 Trials
10
Original versus Equalized Data t-test Results
11
Conclusions Equalized learning spawns large datasets
Equalized learning produces better models
12
Future Directions Apply to other NASA datasets
Improve Performance: Distributed GP
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.