Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.

Slides:



Advertisements
Similar presentations
Lecture 8: Testing, Verification and Validation
Advertisements

SOFTWARE TESTING. INTRODUCTION  Software Testing is the process of executing a program or system with the intent of finding errors.  It involves any.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT
Towards Self-Testing in Autonomic Computing Systems Tariq M. King, Djuradj Babich, Jonatan Alava, and Peter J. Clarke Software Testing Research Group Florida.
Prachi Saraph, Mark Last, and Abraham Kandel. Introduction Black-Box Testing Apply an Input Observe the corresponding output Compare Observed output with.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
TEMPLATE DESIGN © Genetic Algorithm and Poker Rule Induction Wendy Wenjie Xu Supervised by Professor David Aldous, UC.
1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.
Software Testing Using Model Program DESIGN BY HONG NGUYEN & SHAH RAZA Dec 05, 2005.
Properties of Machine Learning Applications for Use in Metamorphic Testing Chris Murphy, Gail Kaiser, Lifeng Hu, Leon Wu Columbia University.
Automatic System Testing of Programs without Test Oracles
1 An Approach to Software Testing of Machine Learning Applications Chris Murphy, Gail Kaiser, Marta Arias Columbia University.
On Effective Testing of Health Care Simulation Software Christian Murphy, M.S. Raunak, Andrew King, Sanjian Chen, Christopher Imbriano, Gail Kaiser, Insup.
An Experimental Evaluation on Reliability Features of N-Version Programming Xia Cai, Michael R. Lyu and Mladen A. Vouk ISSRE’2005.
Applications of Metamorphic Testing Chris Murphy University of Pennsylvania November 17, 2011.
Using JML Runtime Assertion Checking to Automate Metamorphic Testing in Applications without Test Oracles Christian Murphy, Kuang Shen, Gail Kaiser Columbia.
1 Validation and Verification of Simulation Models.
Validating and Improving Test-Case Effectiveness Author: Yuri Chernak Presenter: Lam, Man Tat.
Automatic Detection of Previously-Unseen Application States for Deployment Environment Testing and Analysis Chris Murphy, Moses Vaughan, Waseem Ilahi,
Testing an individual module
University of Southern California Center for Software Engineering CSE USC ©USC-CSE 3/11/2002 Empirical Methods for Benchmarking High Dependability The.
1 Software Testing and Quality Assurance Lecture 5 - Software Testing Techniques.
Using Runtime Testing to Detect Defects in Applications without Test Oracles Chris Murphy Columbia University November 10, 2008.
SOFTWARE PROJECT MANAGEMENT Project Quality Management Dr. Ahmet TÜMAY, PMP.
OHT 9.1 Galin, SQA from theory to implementation © Pearson Education Limited 2004 Chapter 9.3 Software Testing Strategies.
State coverage: an empirical analysis based on a user study Dries Vanoverberghe, Emma Eyckmans, and Frank Piessens.
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 9 Functional Testing
Test coverage Tor Stålhane. What is test coverage Let c denote the unit type that is considered – e.g. requirements or statements. We then have C c =
Software Testing Verification and validation planning Software inspections Software Inspection vs. Testing Automated static analysis Cleanroom software.
CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.
Dr. Pedro Mejia Alvarez Software Testing Slide 1 Software Testing: Building Test Cases.
Testing. Definition From the dictionary- the means by which the presence, quality, or genuineness of anything is determined; a means of trial. For software.
Class Specification Implementation Graph By: Njume Njinimbam Chi-Chang Sun.
What is Software Testing? And Why is it So Hard J. Whittaker paper (IEEE Software – Jan/Feb 2000) Summarized by F. Tsui.
Testing Theory cont. Introduction Categories of Metrics Review of several OO metrics Format of Presentation CEN 5076 Class 6 – 10/10.
Chapter 8 Introduction to Hypothesis Testing
© SERG Dependable Software Systems (Mutation) Dependable Software Systems Topics in Mutation Testing and Program Perturbation Material drawn from [Offutt.
1 Software testing. 2 Testing Objectives Testing is a process of executing a program with the intent of finding an error. A good test case is in that.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
Scalable Statistical Bug Isolation Authors: B. Liblit, M. Naik, A.X. Zheng, A. Aiken, M. I. Jordan Presented by S. Li.
Introduction to Software Testing. Types of Software Testing Unit Testing Strategies – Equivalence Class Testing – Boundary Value Testing – Output Testing.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.
Today’s Agenda  HW #1  Finish Introduction  Input Space Partitioning Software Testing and Maintenance 1.
Today’s Agenda  Reminder: HW #1 Due next class  Quick Review  Input Space Partitioning Software Testing and Maintenance 1.
PROCESSING OF DATA The collected data in research is processed and analyzed to come to some conclusions or to verify the hypothesis made. Processing of.
Chap. 5 Building Valid, Credible, and Appropriately Detailed Simulation Models.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
MODES-650 Advanced System Simulation Presented by Olgun Karademirci VERIFICATION AND VALIDATION OF SIMULATION MODELS.
Chapter 1 Introduction. Chapter 1 - Introduction 2 The Goal of Chapter 1 Introduce different forms of language translators Give a high level overview.
“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.
Week 14 Introduction to Computer Science and Object-Oriented Programming COMP 111 George Basham.
Data Mining and Decision Support
Software Quality Assurance and Testing Fazal Rehman Shamil.
NTU & MSRA Ming-Feng Tsai
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
ANALYSIS PHASE OF BUSINESS SYSTEM DEVELOPMENT METHODOLOGY.
 Software Testing Software Testing  Characteristics of Testable Software Characteristics of Testable Software  A Testing Life Cycle A Testing Life.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Phase Testing. Janice Regan, For each group of units Overview of Implementation phase Create Class Skeletons Define Implementation Plan (+ determine.
Mutation Testing Breaking the application to test it.
C++ for Engineers and Scientists, Second Edition 1 Problem Solution and Software Development Software development procedure: method for solving problems.
PROGRAMMING FUNDAMENTALS INTRODUCTION TO PROGRAMMING. Computer Programming Concepts. Flowchart. Structured Programming Design. Implementation Documentation.
Software Testing. SE, Testing, Hans van Vliet, © Nasty question  Suppose you are being asked to lead the team to test the software that controls.
Laurea Triennale in Informatica – Corso di Ingegneria del Software I – A.A. 2006/2007 Andrea Polini XVIII. Software Testing.
A Review of Software Testing - P. David Coward
Software Testing.
Testing and Debugging PPT By :Dr. R. Mall.
Testing Approaches.
Using Automated Program Repair for Evaluating the Effectiveness of
Automatically Diagnosing and Repairing Error Handling Bugs in C
Presentation transcript:

Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University

What is random testing? This is not part of the talk!!!! Random testing is the notion of using “random” input to test the application As opposed to using pre-determined and manually selected “equivalence classes” or “partitions”

Introduction We are investigating the quality assurance of Machine Learning (ML) applications Currently we are concerned with a real- world application for potential future use in predicting electrical device failures  Using ranking instead of classification Our concern is not whether an algorithm predicts well but whether an implementation operates correctly

Data Set Options Real-world data sets  Not always accessible/available  May not necessarily contain the separation or combination of traits that we desire to test Hand-generation of data  Only useful for small tests Random testing  Limited by the lack of a reliable test oracle  ML applications of interest fall into the category of “non-testable programs”

Motivation Without a reliable test oracle, we can only:  Look for obvious faults  Consider intermediate results  Detect discrepancies in the specification We need to restrict some properties of random test data generation

Our Solution Parameterized Random Test Data Generation Automatically generate random data sets, but parameterized to control the range and characteristics of those random values Parameterization allows us to create a hybrid between equivalence class partitioning and random testing

Overview Machine Learning Background Data Generation Framework Findings and Results Evaluation and Observations Conclusions and Future Work

Machine Learning Fundamentals Data sets consist of a number of examples, each of which has attributes and a label In the first phase (“training”), a model is generated that attempts to generalize how attributes relate to the label In the second phase (“validation”), the model is applied to a previously-unseen data set with unknown labels to produce a classification (or, in our case, a ranking)

Problems Faced in Testing The testing input should be based on the problem domain Need to consider a way to mimic all of the traits of the real-world data sets Also need to keep in mind that we do not have a reliable test oracle

Analyzing the Problem Domain Consider properties of data sets in general  Data set size: number of attributes and examples  Range of values: attributes and labels  Precision of floating-point numbers  Whether values can repeat Consider properties of real-world data sets in the domain of interest  How alphanumeric attributes are to be interpreted  Whether data values might be missing

Equivalence Classes Data sizes of different orders of magnitude Repeating vs. non-repeating attribute values Missing vs. no-missing attribute values Categorical vs. non-categorical data 0/1 labels vs. non-negative integer labels Predictable vs. non-predictable data sets Used data set generator to parameterize test case selection criteria

How Data Are Generated M attributes and N examples No-repeat mode:  Generate a list of integers from 1 to M*N and then randomly permute them Repeat mode:  Each value in the data set is simply a random integer between 1 and M*N  Tool ensures at least one set of repeating numbers

Generating Labels Specify percentage of “positive examples” to include in the data set  positive examples have a label of 1  negative examples have a label of 0 Data generation framework guarantees that the number of positive examples comes out to be the right number, even though the values are randomly placed throughout the data set Labels are never unknown/missing

Categorical Data For some alphanumeric attributes, data pre-processing is used to expand K distinct values to K attributes  Same as in real-world ranking application Input parameter to data generation tool is of the format (a 1, a 2,..., a K-1, a K, m)  a 1 through a K represent the percentage distribution of those values for the categorical attribute  m is the percentage of unknown values

Data Set Generator - Parameters # of examples # of attributes % positive examples (label = 1) % missing any categorical data repeat/no-repeat modes

Sample Data Sets 10 examples, 10 attributes, 40% positive examples, 20% missing, repeats allowed 27,81,88,59, ?,16,88, ?,41, ?,0 15,70,91,41, ?, 3, ?, ?, ?,64,0 82, ?,51,47, ?, 4, 1,99, ?,51,0 22,72,11, ?,96,24,44,92, ?,11,1 57,77, ?,86,89,77,61,76,96,98,1 76,11, 4,51,43, ?,79,21,28, ?,0 6,33, ?, ?,52,63,94,75, 8,26,0 77,36,91, ?,47, 3,85,71,35,45,1 ?,17,15, 2,90,70, ?, 7,41,42,0 8,58,42,41,74,87,68,68, 1,15,1 35, 3,20,41,91, ?,32,11,43, ?,1 19,50,11,57,36,94, ?,96, 7,23,1 24,36,36,79,78,33,34, ?,32, ?,0 ?,15, ?,19,65,80,17,78,43, ?,0 40,31,89,50,83,55,25, ?, ?,45,1 52, ?, ?, ?, ?,39,79,82,94, ?,0 86,45, ?, ?,74,68,13,66,42,56,0 ?,53,91,23,11, ?,47,61,79, 8,0 77,11,34,44,92, ?,63,62,51,51,1 21, 1,70,14,16,40,63,94,69,83,0

The Testing Framework Data set generator Model comparison Ranking comparison: includes metrics like normalized equivalence and AUCs Tracing options: for generating and comparing outputs of debugging statements

MartiRank and SVM MartiRank was specifically designed for the real-world device failure application  Seeks to find the sequence of attributes to segment and sort the data to produce the best result SVM is typically a classification algorithm  Seeks to find a hyperplane that separates examples from different classes  SVM-Light has a ranking mode based on the distance from the hyperplane

Findings Testing approach and framework were developed for MartiRank then applied to SVM Only the findings most related to parameterized random testing are presented here  More details and case studies about the testing of MartiRank can be found in our tech report

Issue #1: Repeating Values One version of MartiRank did not use “stable” sorting... 91,41,19, 3,57,11,20,64, ,73,47, 3,85,71,35,45, ,73,47, 3,85,71,35,45,1 91,41,19, 3,57,11,20,64, ,41,19, 3,57,11,20,64,0 36,73,47, 3,85,71,35,45,1... stable unstable

Issue #2: Sparse Data Sets Not specifically addressed in specification 41,91, ?,32,11,43, ?,1 57,36,94, ?,96, 7,23,1 79,78,33,34, ?,31, ?,0 19,65,80,17,78,46, ?,0 50,83,55,25, ?, ?,45,1 ?, ?,39,79,82,94, ?,0 41,91, ?,32,11,43, ?,1 19,65,80,17,78,46, ?,0 79,78,33,34, ?,31, ?,0 ?, ?,39,79,82,94, ?,0 50,83,55,25, ?, ?,45,1 57,36,94, ?,96, 7,23,1 41,91, ?,32,11,43, ?,1 19,65,80,17,78,46, ?,0 ?, ?,39,79,82,94, ?,0 57,36,94, ?,96, 7,23,1 79,78,33,34, ?,31, ?,0 50,83,55,25, ?, ?,45,1 sort “around” missing values put missing values at end 41,91, ?,32,11,43, ?,1 50,83,55,25, ?, ?,45,1 19,65,80,17,78,46, ?,0 79,78,33,34, ?,31, ?,0 ?, ?,39,79,82,94, ?,0 57,36,94, ?,96, 7,23,1 randomly insert missing values

Issue #3: Categorical Data Discovered that refactoring had introduced a bug into an important calculation  A global variable was being used incorrectly This bug did not appear in any of the tests only with repeating values or only with missing values However, categorical data necessarily has repeating values and may have missing

Issue #4: Permuted Input Data Randomly permuting the input data led to different models (and then different rankings) generated by SVM-Light Caused by “chunking” data for use by an approximating variant of optimization algorithm

Observations Parameterized random testing allowed us to isolate the traits of the data sets These traits may appear in real-world data but not necessarily in the desired combinations Algorithm’s failure to address specific data set traits can lead to discrepancies

Related Work – Machine Learning There has been much research into applying Machine Learning techniques to software testing, but not the other way around Reusable real-world data sets and Machine Learning frameworks are available for checking how well a Machine Learning algorithm predicts, but not for testing its correctness

Related Work – Random Testing Parameterization generally refers to specifying data type or range of values Our work differs from that of Thénevod- Fosse et al. [’91] on “structural statistical testing”, which focuses on path selection and coverage testing, not system testing Also differs from “uniform statistical testing” because although we do select random data over a uniform distribution, we parameterize it according to equivalence classes

Limitations and Future Work Test suite adequacy for coverage not addressed or measured Could also consider non-deterministic Machine Learning algorithms Can also include mutation testing for effectiveness of data sets Should investigate creating large data sets that correlate to real-world data

Conclusion Our contribution is an approach that combines parameterization and randomness to control the properties of very large data sets Critical for limiting the scope of individual tests and for pinpointing specific issues related to the traits of the input data

Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University