Applying Combinatorial Testing to Data Mining Algorithms Jaganmohan Chandrasekaran(UTA), Huadong Feng(UTA), Yu Lei(UTA), D. Richard Kuhn(NIST), Raghu Kacker(NIST) March 13, 2017 Good afternoon every one, it’s great to see you all here, thank you for having me here. My name is Huadong Feng, friends call me Jack. I am a software engineering PhD student from the University of Texas at Arlington . This session, we are here to know about how we apply combinatorial testing to data mining algorithms. Firstly, we will look at why we want to do that, next we will discuss how our approach applies CT to DMA, then we will look into the experiment results to see how effective is CT when applied to DMA.
Outline Introduction Experimental Design Experimental Results Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work
Introduction Data Mining Algorithms Combinatorial Testing(CT) Widely developed and used Large amounts of data as input Intensive and complex computing Combinatorial Testing(CT) Proven method for more effective software testing at lower cost How effective is Combinatorial Testing when applied to Data Mining Algorithms? Good afternoon every one, it’s great to see you all here, thank you for having me here. My name is Huadong Feng, friends call me Jack. I am a software engineering PhD student from the University of Texas at Arlington . This session, we are here to know about how we apply combinatorial testing to data mining algorithms. Firstly, we will look at why we want to do that, next we will discuss how our approach applies CT to DMA, then we will look into the experiment results to see how effective is CT when applied to DMA.
Outline Introduction Experimental Design Experimental Results Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work
Experimental Design Research Questions How effective is CT applied to data mining algorithms? How do different datasets impact test coverage? Is branch coverage a good indicator of fault detection effectiveness for data mining algorithms?
Experimental Design Subject Programs Top 5 most influential data mining algorithms* C4.5, K-Means, SVM, Apriori, EM Implementations from WEKA CT tests are applied on the configuration options of the subject algorithms We selected five data mining algorithms for this experiments, they were identified by ICDM International Conference of Data Mining to be the five most influential data mining algorithms The implementations of the selected algorithms we used in this experiments are from the WEKA data mining tool. Weka is a collection of machine learning algorithms for data mining tasks
Experimental Design Datasets 51 bench marking datasets Datasets provided by WEKA, UC Irvine Machine Learning Repository Not all datasets are applicable to all algorithms We have 51 candidate datasets for this experiment. These 51 datasets were provided by the WEKA data mining tool and UC Irvine. Please note that not all datasets are applicable to all the selected algorithms. We analyzed each dataset against each selected data mining algorithm, and determined their applicability. These bench marking datasets contains varies type of data in different format, and in different size.
Experimental Design
Experimental Design Input Parameter Modeling(IPM) Applied on configuration options Equivalence partitioning base on domain knowledge Identify representative values of equivalence partitions Constrains
Experimental Design Test Generation 1-way to 6-way positive tests Generated using ACTS with extend mode Negative 1-way test
Experimental Design Metrics Branch Coverage by JaCoCo A free code coverage library for Java. Mutation Coverage by PIT Mutation testing tool developed by Henry Coles.
Outline Introduction Experimental Design Experimental Results Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work
Impact of Datasets
Impact of Datasets
Impact of Datasets Finding Implication: Larger datasets do not necessarily achieve higher branch coverage. In some cases, smaller datasets can achieve higher branch coverage than larger datasets. Implication: The size of a dataset is not a dominating factor for determining test effectiveness of a dataset. Other characteristics must be considered, e.g., the dataset structure, and the relationship between different data instances. It is possible to create small datasets that are effective for testing data mining algorithms.
Branch Coverage of T-way Testing
Branch Coverage of T-way Testing
Branch Coverage of T-way Testing Finding: Branch coverage increases progressively slower as test strength increases. The coverage increase stops at a test strength that is relatively low. Implication: During CT, data mining algorithms display similar behavior as general software applications. CT has the potential to be effective for testing data mining algorithms.
Mutation Coverage of T-way Testing
Mutation Coverage of T-way Testing
Branch Coverage of T-way Testing
Mutation Coverage of T-way Testing
Mutation Coverage of T-way Testing Finding: Higher branch coverage seems to imply higher mutation coverage, and vice versa. Implication: Branch coverage could be used as a good indicator of fault detection effectiveness for data mining algorithms, since mutation coverage is expensive to measure.
Outline Introduction Experimental Design Experimental Results Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work
Conclusion Larger datasets do not necessarily achieve higher test coverage than smaller datasets. Test coverage of CT test set increases progressively slower with respect to increase of test strength. Branch coverage correlates well with mutation coverage. The experiment allows us to obtain some initial understandings about the effectiveness of CT on data mining algorithms. In particular, the results of our experiment indicate that data mining algorithms behave in a way that is similar to general software. This suggests that CT has the potential to be effectively applied to data mining algorithms.
Future Work Detailed Code Analysis Why some branches are not covered by our test cases? Apply CT to create or reduce datasets for data mining algorithms Further investigation and experiments on negative testing of data mining algorithms.