1 An Approach to Software Testing of Machine Learning Applications Chris Murphy, Gail Kaiser, Marta Arias Columbia University.

Slides:

Advertisements

Similar presentations

Lecture 8: Testing, Verification and Validation

Advertisements

Testing Concurrent/Distributed Systems Review of Final CEN 5076 Class 14 – 12/05.

Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.

Prachi Saraph, Mark Last, and Abraham Kandel. Introduction Black-Box Testing Apply an Input Observe the corresponding output Compare Observed output with.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Software Testing and QA Theory and Practice (Chapter 2: Theory of Program Testing) © Naik & Tripathy 1 Software Testing and Quality Assurance Theory and.

Software Failure: Reasons Incorrect, missing, impossible requirements * Requirement validation. Incorrect specification * Specification verification. Faulty.

1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.

Properties of Machine Learning Applications for Use in Metamorphic Testing Chris Murphy, Gail Kaiser, Lifeng Hu, Leon Wu Columbia University.

Automatic System Testing of Programs without Test Oracles

Reliability and Information Gain Ida Sprinkhuizen-Kuyper Evgueni Smirnov Georgi Nalbantov (UM/EUR)

On Effective Testing of Health Care Simulation Software Christian Murphy, M.S. Raunak, Andrew King, Sanjian Chen, Christopher Imbriano, Gail Kaiser, Insup.

Applications of Metamorphic Testing Chris Murphy University of Pennsylvania November 17, 2011.

Using JML Runtime Assertion Checking to Automate Metamorphic Testing in Applications without Test Oracles Christian Murphy, Kuang Shen, Gail Kaiser Columbia.

CSE 300: Software Reliability Engineering Topics covered: Software metrics and software reliability Software complexity and software quality.

Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.

Data mining and statistical learning - lecture 13 Separating hyperplane.

Software Testing and Quality Assurance

Programming Fundamentals (750113) Ch1. Problem Solving

Using Runtime Testing to Detect Defects in Applications without Test Oracles Chris Murphy Columbia University November 10, 2008.

Finding Bugs in Web Applications Using Dynamic Test Generation and Explicit-State Model Checking -Shreyas Ravindra.

Handouts Software Testing and Quality Assurance Theory and Practice Chapter 9 Functional Testing

Mining Binary Constraints in the Construction of Feature Models Li Yi Peking University March 30, 2012.

SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :

CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.

Dr. Pedro Mejia Alvarez Software Testing Slide 1 Software Testing: Building Test Cases.

CMSC 345 Fall 2000 Unit Testing. The testing process.

ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.

Verification and Validation Overview References: Shach, Object Oriented and Classical Software Engineering Pressman, Software Engineering: a Practitioner’s.

Testing Theory cont. Introduction Categories of Metrics Review of several OO metrics Format of Presentation CEN 5076 Class 6 – 10/10.

Lecture 11 Testing and Debugging SFDV Principles of Information Systems.

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.

Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.

FAULT TREE ANALYSIS (FTA). QUANTITATIVE RISK ANALYSIS Some of the commonly used quantitative risk assessment methods are; 1.Fault tree analysis (FTA)

Scalable Statistical Bug Isolation Authors: B. Liblit, M. Naik, A.X. Zheng, A. Aiken, M. I. Jordan Presented by S. Li.

Quality Concepts within CMM and PMI G.C.Reddy

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

What is Testing? Testing is the process of finding errors in the system implementation. –The intent of testing is to find problems with the system.

“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.

Genetic Algorithms Genetic algorithms provide an approach to learning that is based loosely on simulated evolution. Hypotheses are often described by bit.

Week 14 Introduction to Computer Science and Object-Oriented Programming COMP 111 George Basham.

Software Testing and Quality Assurance 1. What is the objectives of Software Testing?

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

RANKING David Kauchak CS 451 – Fall Admin Assignment 4 Assignment 5.

Data Mining and Decision Support

Software Quality Assurance and Testing Fazal Rehman Shamil.

NTU & MSRA Ming-Feng Tsai

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Support-Vector Networks C Cortes and V Vapnik (Tue) Computational Models of Intelligence Joon Shik Kim.

Mutation Testing Breaking the application to test it.

C++ for Engineers and Scientists, Second Edition 1 Problem Solution and Software Development Software development procedure: method for solving problems.

Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.

SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.

PROGRAMMING FUNDAMENTALS INTRODUCTION TO PROGRAMMING. Computer Programming Concepts. Flowchart. Structured Programming Design. Implementation Documentation.

Modeling of Core Protection Calculator System Software February 28, 2005 Kim, Sung Ho Kim, Sung Ho.

SOFTWARE TESTING AND QUALITY ASSURANCE. Software Testing.

Cs498dm Software Testing Darko Marinov January 24, 2012.

Software Testing. Software Quality Assurance Overarching term Time consuming (40% to 90% of dev effort) Includes –Verification: Building the product right,

Software Defects Cmpe 550 Fall 2005

JMP Discovery Summit 2016 Janet Alvarado

Verification and Validation Overview

Testing Approaches.

Machine Learning Week 1.

Programming Fundamentals (750113) Ch1. Problem Solving

Classification and Prediction

Programming Fundamentals (750113) Ch1. Problem Solving

Intro to Machine Learning

Programming Fundamentals (750113) Ch1. Problem Solving

Presentation transcript:

1 An Approach to Software Testing of Machine Learning Applications Chris Murphy, Gail Kaiser, Marta Arias Columbia University

2 Introduction We are investigating the quality assurance of Machine Learning (ML) applicationsWe are investigating the quality assurance of Machine Learning (ML) applications Currently we are concerned with a real-world application for potential future use in predicting electrical device failuresCurrently we are concerned with a real-world application for potential future use in predicting electrical device failures Machine Learning applications fall into a class for which it can be said that there is “no reliable oracle”Machine Learning applications fall into a class for which it can be said that there is “no reliable oracle” –These are also known as “non-testable programs” and could fall into Davis and Weyuker’s class of “programs which were written in order to determine the answer in the first place. There would be no need to write such programs, if the correct answer were known.”

3 Introduction We have developed an approach to creating test cases for Machine Learning applications:We have developed an approach to creating test cases for Machine Learning applications: Analyze the problem domain and real-world data setsAnalyze the problem domain and real-world data sets Analyze the algorithm as it is definedAnalyze the algorithm as it is defined Analyze an implementation’s runtime optionsAnalyze an implementation’s runtime options Our approach was designed for MartiRank and then generalized to other ranking algorithms such as Support Vector Machines (SVM)Our approach was designed for MartiRank and then generalized to other ranking algorithms such as Support Vector Machines (SVM)

4 Overview Machine Learning BackgroundMachine Learning Background Testing Approach and FrameworkTesting Approach and Framework Findings and ResultsFindings and Results Evaluation and ObservationsEvaluation and Observations Future WorkFuture Work

5 Machine Learning Fundamentals Data sets consist of a number of examples, each of which has attributes and a labelData sets consist of a number of examples, each of which has attributes and a label In the first phase (“training”), a model is generated that attempts to generalize how attributes relate to the labelIn the first phase (“training”), a model is generated that attempts to generalize how attributes relate to the label In the second phase, the model is applied to a previously-unseen data set (“testing” data) with unknown labels to produce a classification (or, in our case, a ranking)In the second phase, the model is applied to a previously-unseen data set (“testing” data) with unknown labels to produce a classification (or, in our case, a ranking) –This can be used for validation or for prediction

6 MartiRank and SVM MartiRank was specifically designed for the device failure applicationMartiRank was specifically designed for the device failure application –Seeks to find the combination of segmenting and sorting the data that produces the best result SVM is typically a classification algorithmSVM is typically a classification algorithm –Seeks to find a hyperplane that separates examples from different classes –Different “kernels” use different approaches –SVM-Light has a ranking mode based on the distance from the hyperplane

7 Related Work There has been much research into applying Machine Learning techniques to software testing, but not the other way aroundThere has been much research into applying Machine Learning techniques to software testing, but not the other way around Reusable real-world data sets and Machine Learning frameworks are available for checking how well a Machine Learning algorithm predicts, but not for testing its correctnessReusable real-world data sets and Machine Learning frameworks are available for checking how well a Machine Learning algorithm predicts, but not for testing its correctness

8 Analyzing the Problem Domain Consider properties of the real-world data setsConsider properties of the real-world data sets –Data set size: Number of attributes and examples –Range of values: attributes and labels –Precision of floating-point numbers –Categorical data: how alphanumeric attrs are addressed Also, repeating or missing data valuesAlso, repeating or missing data values

9 Analyzing the Algorithm Look for imprecisions in the specification, not necessarily bugs in the implementationLook for imprecisions in the specification, not necessarily bugs in the implementation –How to handle missing attribute values –How to handle negative labels Consider how to construct a data set that could cause a “predictable” rankingConsider how to construct a data set that could cause a “predictable” ranking

10 Analyzing the Runtime Options Determine how the implementation may manipulate the input dataDetermine how the implementation may manipulate the input data –Permuting the input order –Reading the input in “chunks” Consider configuration parametersConsider configuration parameters –For example, disabled anything probabilistic Need to ensure that results are deterministic and repeatableNeed to ensure that results are deterministic and repeatable

11 The Testing Framework Data set generator: # of examples, # of attributes, % failures, % missing, any categorical data, repeat/no-repeat modesData set generator: # of examples, # of attributes, % failures, % missing, any categorical data, repeat/no-repeat modes Model comparison: specific to MartiRankModel comparison: specific to MartiRank Ranking comparison: includes metrics like normalized equivalence and AUCsRanking comparison: includes metrics like normalized equivalence and AUCs Tracing options: for generating and comparing outputs of debugging statementsTracing options: for generating and comparing outputs of debugging statements

12 Equivalence Classes Data sizes of different orders of magnitudeData sizes of different orders of magnitude Repeating vs. non-repeating attribute valuesRepeating vs. non-repeating attribute values Missing vs. no-missing attribute valuesMissing vs. no-missing attribute values Categorical vs. non-categorical dataCategorical vs. non-categorical data 0/1 labels vs. non-negative integer labels0/1 labels vs. non-negative integer labels Predictable vs. non-predictable data setsPredictable vs. non-predictable data sets Used data set generator to parameterize test case selection criteriaUsed data set generator to parameterize test case selection criteria

13 Testing MartiRank Produced a core dump on data sets with large number of attributes (over 200)Produced a core dump on data sets with large number of attributes (over 200) Implementation does not correctly handle negative labelsImplementation does not correctly handle negative labels Does not use a “stable” sorting algorithmDoes not use a “stable” sorting algorithm

14 Regression Testing of MartiRank Creation of a suite of testing data allowed us to use it for regression testingCreation of a suite of testing data allowed us to use it for regression testing Discovered that refactoring had introduced a bug into an important calculationDiscovered that refactoring had introduced a bug into an important calculation

15 Testing Multiple Implementations of MartiRank We had three implementations developed by three different codersWe had three implementations developed by three different coders Can be used as “pseudo-oracles” for each otherCan be used as “pseudo-oracles” for each other Used to discover a bug in the way one implementation was handling missing valuesUsed to discover a bug in the way one implementation was handling missing values

16 Applying Approach to SVM-Light Permuting the input data led to different modelsPermuting the input data led to different models –Caused by “chunking” data for use by an approximating variant of optimization algorithm Introduction of noise in a data set in some cases caused it not to find a “predictable” rankingIntroduction of noise in a data set in some cases caused it not to find a “predictable” ranking Different kernels also caused different results with “predictable” rankingsDifferent kernels also caused different results with “predictable” rankings

17 Evaluation and Observations Testing approach revealed bugs and imprecision in the implementations, as well as discrepancies from the stated algorithmsTesting approach revealed bugs and imprecision in the implementations, as well as discrepancies from the stated algorithms Inspection of the algorithms led to the creation of “predictable” data setsInspection of the algorithms led to the creation of “predictable” data sets What is “predictable” for one algorithm may not lead to a “predictable” ranking in anotherWhat is “predictable” for one algorithm may not lead to a “predictable” ranking in another Algorithm’s failure to address specific data set traits can lead to incorrect results (and/or inconsistent results across implementations)Algorithm’s failure to address specific data set traits can lead to incorrect results (and/or inconsistent results across implementations) The approach can be generalized to other Machine Learning ranking algorithms, as well as classificationThe approach can be generalized to other Machine Learning ranking algorithms, as well as classification

18 Limitations and Future Work Test suite adequacy for coverage not addressedTest suite adequacy for coverage not addressed Can also include mutation testing for effectiveness of data setsCan also include mutation testing for effectiveness of data sets Should investigate creating large data sets that correlate to real-world dataShould investigate creating large data sets that correlate to real-world data Could also consider non-deterministic Machine Learning algorithmsCould also consider non-deterministic Machine Learning algorithms

19 Questions?