Download presentation
Presentation is loading. Please wait.
Published byAvice Robertson Modified over 6 years ago
1
Applying Combinatorial Testing to Data Mining Algorithms
Jaganmohan Chandrasekaran(UTA), Huadong Feng(UTA), Yu Lei(UTA), D. Richard Kuhn(NIST), Raghu Kacker(NIST) March 13, 2017 Good afternoon every one, it’s great to see you all here, thank you for having me here. My name is Huadong Feng, friends call me Jack. I am a software engineering PhD student from the University of Texas at Arlington . This session, we are here to know about how we apply combinatorial testing to data mining algorithms. Firstly, we will look at why we want to do that, next we will discuss how our approach applies CT to DMA, then we will look into the experiment results to see how effective is CT when applied to DMA.
2
Outline Introduction Experimental Design Experimental Results
Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work
3
Introduction Data Mining Algorithms Combinatorial Testing(CT)
Widely developed and used Large amounts of data as input Intensive and complex computing Combinatorial Testing(CT) Proven method for more effective software testing at lower cost How effective is Combinatorial Testing when applied to Data Mining Algorithms? Good afternoon every one, it’s great to see you all here, thank you for having me here. My name is Huadong Feng, friends call me Jack. I am a software engineering PhD student from the University of Texas at Arlington . This session, we are here to know about how we apply combinatorial testing to data mining algorithms. Firstly, we will look at why we want to do that, next we will discuss how our approach applies CT to DMA, then we will look into the experiment results to see how effective is CT when applied to DMA.
4
Outline Introduction Experimental Design Experimental Results
Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work
5
Experimental Design Research Questions
How effective is CT applied to data mining algorithms? How do different datasets impact test coverage? Is branch coverage a good indicator of fault detection effectiveness for data mining algorithms?
6
Experimental Design Subject Programs
Top 5 most influential data mining algorithms* C4.5, K-Means, SVM, Apriori, EM Implementations from WEKA CT tests are applied on the configuration options of the subject algorithms We selected five data mining algorithms for this experiments, they were identified by ICDM International Conference of Data Mining to be the five most influential data mining algorithms The implementations of the selected algorithms we used in this experiments are from the WEKA data mining tool. Weka is a collection of machine learning algorithms for data mining tasks
7
Experimental Design Datasets 51 bench marking datasets
Datasets provided by WEKA, UC Irvine Machine Learning Repository Not all datasets are applicable to all algorithms We have 51 candidate datasets for this experiment. These 51 datasets were provided by the WEKA data mining tool and UC Irvine. Please note that not all datasets are applicable to all the selected algorithms. We analyzed each dataset against each selected data mining algorithm, and determined their applicability. These bench marking datasets contains varies type of data in different format, and in different size.
8
Experimental Design
9
Experimental Design Input Parameter Modeling(IPM)
Applied on configuration options Equivalence partitioning base on domain knowledge Identify representative values of equivalence partitions Constrains
10
Experimental Design Test Generation 1-way to 6-way positive tests
Generated using ACTS with extend mode Negative 1-way test
11
Experimental Design Metrics Branch Coverage by JaCoCo
A free code coverage library for Java. Mutation Coverage by PIT Mutation testing tool developed by Henry Coles.
12
Outline Introduction Experimental Design Experimental Results
Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work
13
Impact of Datasets
14
Impact of Datasets
15
Impact of Datasets Finding Implication:
Larger datasets do not necessarily achieve higher branch coverage. In some cases, smaller datasets can achieve higher branch coverage than larger datasets. Implication: The size of a dataset is not a dominating factor for determining test effectiveness of a dataset. Other characteristics must be considered, e.g., the dataset structure, and the relationship between different data instances. It is possible to create small datasets that are effective for testing data mining algorithms.
16
Branch Coverage of T-way Testing
17
Branch Coverage of T-way Testing
18
Branch Coverage of T-way Testing
Finding: Branch coverage increases progressively slower as test strength increases. The coverage increase stops at a test strength that is relatively low. Implication: During CT, data mining algorithms display similar behavior as general software applications. CT has the potential to be effective for testing data mining algorithms.
19
Mutation Coverage of T-way Testing
20
Mutation Coverage of T-way Testing
21
Branch Coverage of T-way Testing
22
Mutation Coverage of T-way Testing
23
Mutation Coverage of T-way Testing
Finding: Higher branch coverage seems to imply higher mutation coverage, and vice versa. Implication: Branch coverage could be used as a good indicator of fault detection effectiveness for data mining algorithms, since mutation coverage is expensive to measure.
24
Outline Introduction Experimental Design Experimental Results
Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work
25
Conclusion Larger datasets do not necessarily achieve higher test coverage than smaller datasets. Test coverage of CT test set increases progressively slower with respect to increase of test strength. Branch coverage correlates well with mutation coverage. The experiment allows us to obtain some initial understandings about the effectiveness of CT on data mining algorithms. In particular, the results of our experiment indicate that data mining algorithms behave in a way that is similar to general software. This suggests that CT has the potential to be effectively applied to data mining algorithms.
26
Future Work Detailed Code Analysis
Why some branches are not covered by our test cases? Apply CT to create or reduce datasets for data mining algorithms Further investigation and experiments on negative testing of data mining algorithms.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.