Applying Combinatorial Testing to Data Mining Algorithms

Slides:

Advertisements

Similar presentations

Machine Learning Homework

Advertisements

An Evaluation of MC/DC Coverage for Pair-wise Test Cases By David Anderson Software Testing Research Group (STRG)

ELEC7250: VLSI Testing Spring 2004 Experimental Analysis of Fault Collapsing Methods Dixit, Ayoush M.

Fall 2005CSE 115/503 Introduction to Computer Science I1 Association Also called “knows a”. A relationship of knowing (e.g. Dog-Collar as opposed to Dog-Tail)

Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.

How good are your testers? An assessment of testing ability Liang Huang, Chris Thomson and Mike Holcombe Department of Computer Science, University of.

AMOST Experimental Comparison of Code-Based and Model-Based Test Prioritization Bogdan Korel Computer Science Department Illinois Institute of Technology.

How Significant Is the Effect of Faults Interaction on Coverage Based Fault Localizations? Xiaozhen Xue Advanced Empirical Software Testing Group Department.

Department of Biophysical and Electronic Engineering (DIBE)- Università di Genova- ITALY QUALITY ASSESSMENT OF DESPECKLED SAR IMAGES Elena Angiati, Silvana.

1 Research Groups : KEEL: A Software Tool to Assess Evolutionary Algorithms for Data Mining Problems SCI 2 SMetrology and Models Intelligent.

1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.

Mining and Analysis of Control Structure Variant Clones Guo Qiao.

Behavioral Entropy of a Cellular Phone User Santi Phithakkitnukoon Husain Husna Ram Dantu (Presenter) Computer Science & Engineering University of North.

Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection.

Second Line Intrusion Detection Using Personalization DISA Sponsored GWU-CS.

White Box-based Coverage Testing (© 2012 Professor W. Eric Wong, The University of Texas at Dallas) 111 W. Eric Wong Department of Computer Science The.

Enabling Reuse-Based Software Development of Large-Scale Systems IEEE Transactions on Software Engineering, Volume 31, Issue 6, June 2005 Richard W. Selby,

A Goal Based Methodology for Developing Domain-Specific Ontological Frameworks Faezeh Ensan, Weichang Du Faculty of Computer Science, University of New.

1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng

Lecture 11 Data Structures, Algorithms & Complexity Introduction Dr Kevin Casey BSc, MSc, PhD GRIFFITH COLLEGE DUBLIN.

Speed up the local inhibition Hideaki Suzuki September 5, 2013.

CS 478 – Tools for Machine Learning and Data Mining SVM.

Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.

Sotarat Thammaboosadee, Ph.D. EGIT563- Data Mining Course Outline.

A PRELIMINARY EMPIRICAL ASSESSMENT OF SIMILARITY FOR COMBINATORIAL INTERACTION TESTING OF SOFTWARE PRODUCT LINES Stefan Fischer Roberto E. Lopez-Herrejon.

WEKA: A Practical Machine Learning Tool WEKA ： A Practical Machine Learning Tool.

UC Marco Vieira University of Coimbra

Profiling: What is it? Notes and reflections on profiling and how it could be used in process mining.

Experience Report: System Log Analysis for Anomaly Detection

WIS/COLLNET’2016 Nancy, France

Learning to Detect and Classify Malicious Executables in the Wild by J

Prepared by: Fatih Kızkun

Rapidly-Exploring Random Trees

Welcome to M301 P2 Software Systems & their Development

3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.

Testing Tutorial 7.

Predicting Interface Failures For Better Traffic Management.

Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.

Tutorial: Big Data Algorithms and Applications Under Hadoop

Rule Induction for Classification Using

WSRec: A Collaborative Filtering Based Web Service Recommender System

Daniil Chivilikhin and Vladimir Ulyantsev

Presented by: Dr Beatriz de la Iglesia

Parallel Density-based Hybrid Clustering

Introduction CSE 1310 – Introduction to Computers and Programming

Waikato Environment for Knowledge Analysis

3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.

Trevor Savage, Bogdan Dit, Malcom Gethers and Denys Poshyvanyk

EEL4930/5934 Reconfigurable Computing

An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.

DATA CACHING IN WSN Mario A. Nascimento Univ. of Alberta, Canada

Sergiy Vilkomir January 20, 2012

Farzaneh Mirzazadeh Fall 2007

Chapter 5: Software effort estimation

Predicting Fault-Prone Modules Based on Metrics Transitions

IPOG: A General Strategy for T-Way Software Testing

CS 8532: Advanced Software Engineering

A Metric for Evaluating Static Analysis Tools

Paper ID: XX Track: Track Name

TECHNOLOGY ASSESSMENT

Software Cost Estimation

Using Uneven Margins SVM and Perceptron for IE

Java Code Coverage Tools - EclEmma and JaCoCo

M. Kezunovic (P.I.) S. S. Luo D. Ristanovic Texas A&M University

Retrieval Performance Evaluation - Measures

By Hyunsook Do, Sebastian Elbaum, Gregg Rothermel

Scalable light field coding using weighted binary images

Using the RUMM2030 outputs as feedback on learner performance in Communication in English for Adult learners Nthabeleng Lepota 13th SAAEA Conference.

Why do we refactor? Technical Debt Items Versus Size: A Forensic Investigation of What Matters Hello everyone, I’m Ehsan Zabardast I am a PhD candidate.

Logical Architecture & UML Package Diagrams

Presentation transcript:

Applying Combinatorial Testing to Data Mining Algorithms Jaganmohan Chandrasekaran(UTA), Huadong Feng(UTA), Yu Lei(UTA), D. Richard Kuhn(NIST), Raghu Kacker(NIST) March 13, 2017 Good afternoon every one, it’s great to see you all here, thank you for having me here. My name is Huadong Feng, friends call me Jack. I am a software engineering PhD student from the University of Texas at Arlington . This session, we are here to know about how we apply combinatorial testing to data mining algorithms. Firstly, we will look at why we want to do that, next we will discuss how our approach applies CT to DMA, then we will look into the experiment results to see how effective is CT when applied to DMA.

Outline Introduction Experimental Design Experimental Results Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work

Introduction Data Mining Algorithms Combinatorial Testing(CT) Widely developed and used Large amounts of data as input Intensive and complex computing Combinatorial Testing(CT) Proven method for more effective software testing at lower cost How effective is Combinatorial Testing when applied to Data Mining Algorithms? Good afternoon every one, it’s great to see you all here, thank you for having me here. My name is Huadong Feng, friends call me Jack. I am a software engineering PhD student from the University of Texas at Arlington . This session, we are here to know about how we apply combinatorial testing to data mining algorithms. Firstly, we will look at why we want to do that, next we will discuss how our approach applies CT to DMA, then we will look into the experiment results to see how effective is CT when applied to DMA.

Outline Introduction Experimental Design Experimental Results Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work

Experimental Design Research Questions How effective is CT applied to data mining algorithms? How do different datasets impact test coverage? Is branch coverage a good indicator of fault detection effectiveness for data mining algorithms?

Experimental Design Subject Programs Top 5 most influential data mining algorithms* C4.5, K-Means, SVM, Apriori, EM Implementations from WEKA CT tests are applied on the configuration options of the subject algorithms We selected five data mining algorithms for this experiments, they were identified by ICDM International Conference of Data Mining to be the five most influential data mining algorithms The implementations of the selected algorithms we used in this experiments are from the WEKA data mining tool. Weka is a collection of machine learning algorithms for data mining tasks

Experimental Design Datasets 51 bench marking datasets Datasets provided by WEKA, UC Irvine Machine Learning Repository Not all datasets are applicable to all algorithms We have 51 candidate datasets for this experiment. These 51 datasets were provided by the WEKA data mining tool and UC Irvine. Please note that not all datasets are applicable to all the selected algorithms. We analyzed each dataset against each selected data mining algorithm, and determined their applicability. These bench marking datasets contains varies type of data in different format, and in different size.

Experimental Design

Experimental Design Input Parameter Modeling(IPM) Applied on configuration options Equivalence partitioning base on domain knowledge Identify representative values of equivalence partitions Constrains

Experimental Design Test Generation 1-way to 6-way positive tests Generated using ACTS with extend mode Negative 1-way test

Experimental Design Metrics Branch Coverage by JaCoCo A free code coverage library for Java. Mutation Coverage by PIT Mutation testing tool developed by Henry Coles.

Outline Introduction Experimental Design Experimental Results Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work

Impact of Datasets

Impact of Datasets

Impact of Datasets Finding Implication: Larger datasets do not necessarily achieve higher branch coverage. In some cases, smaller datasets can achieve higher branch coverage than larger datasets. Implication: The size of a dataset is not a dominating factor for determining test effectiveness of a dataset. Other characteristics must be considered, e.g., the dataset structure, and the relationship between different data instances. It is possible to create small datasets that are effective for testing data mining algorithms.

Branch Coverage of T-way Testing

Branch Coverage of T-way Testing

Branch Coverage of T-way Testing Finding: Branch coverage increases progressively slower as test strength increases. The coverage increase stops at a test strength that is relatively low. Implication: During CT, data mining algorithms display similar behavior as general software applications. CT has the potential to be effective for testing data mining algorithms.

Mutation Coverage of T-way Testing

Mutation Coverage of T-way Testing

Branch Coverage of T-way Testing

Mutation Coverage of T-way Testing

Mutation Coverage of T-way Testing Finding: Higher branch coverage seems to imply higher mutation coverage, and vice versa. Implication: Branch coverage could be used as a good indicator of fault detection effectiveness for data mining algorithms, since mutation coverage is expensive to measure.

Outline Introduction Experimental Design Experimental Results Research Questions Subject Programs Datasets Input Parameter Modeling Test Generation Metrics Experimental Results Impact of Datasets Branch Coverage Results of T-Way Testing Mutation Coverage Results of T-Way Testing Conclusion & Future Work

Conclusion Larger datasets do not necessarily achieve higher test coverage than smaller datasets. Test coverage of CT test set increases progressively slower with respect to increase of test strength. Branch coverage correlates well with mutation coverage. The experiment allows us to obtain some initial understandings about the effectiveness of CT on data mining algorithms. In particular, the results of our experiment indicate that data mining algorithms behave in a way that is similar to general software. This suggests that CT has the potential to be effectively applied to data mining algorithms.

Future Work Detailed Code Analysis Why some branches are not covered by our test cases? Apply CT to create or reduce datasets for data mining algorithms Further investigation and experiments on negative testing of data mining algorithms.