Metamorphic Exploration of an Unsupervised Clustering Program

Slides:



Advertisements
Similar presentations
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Advertisements

PARTITIONAL CLUSTERING
Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.
Properties of Machine Learning Applications for Use in Metamorphic Testing Chris Murphy, Gail Kaiser, Lifeng Hu, Leon Wu Columbia University.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
What is Cluster Analysis?
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.
Evaluating Performance for Data Mining Techniques
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Data mining and machine learning A brief introduction.
DATA MINING CLUSTERING K-Means.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Enhancing Interactive Visual Data Analysis by Statistical Functionality Jürgen Platzer VRVis Research Center Vienna, Austria.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
A new clustering tool of Data Mining RAPID MINER.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Grid-Based Genetic Algorithm Approach to Colour Image Segmentation Marco Gallotta Keri Woods Supervised by Audrey Mbogho.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Data Mining – Algorithms: K Means Clustering
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD
Unsupervised Learning
Machine Learning with Spark MLlib
Clustering Anna Reithmeir Data Mining Proseminar 2017
Fuzzy Logic in Pattern Recognition
What Is Cluster Analysis?
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Chapter 15 – Cluster Analysis
Machine Learning overview Chapter 18, 21
Machine Learning overview Chapter 18, 21
CSSE463: Image Recognition Day 21
Data Mining K-means Algorithm
Pattern Recognition Sergios Theodoridis Konstantinos Koutroumbas
Waikato Environment for Knowledge Analysis
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
AIM: Clustering the Data together
Clustering Basic Concepts and Algorithms 1
CSSE463: Image Recognition Day 23
Topic Oriented Semi-supervised Document Clustering
Multivariate Statistical Methods
Tutorial for WEKA Heejun Kim June 19, 2018.
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Nearest Neighbors CSC 576: Data Mining.
CSSE463: Image Recognition Day 23
Pasi Fränti and Sami Sieranoja
Statistical Models and Machine Learning Algorithms --Review
Topic 5: Cluster Analysis
Introduction to Machine learning
Unsupervised Learning
Presentation transcript:

Metamorphic Exploration of an Unsupervised Clustering Program Sen Yang, Dave Towey School of Computer Science, University of Nottingham Ningbo China, Zhejiang, People’s Republic of China dave.towey@nottingham.edu.cn Zhi Quan Zhou Institute of Cybersecurity and Cryptology, University of Wollongong, Wollongong, NSW 2522, Australia

UNNC FoSE UNNC IDIC … and others Acknowledgements Thank you to Sen for drafting the PPTs

UNNC China UK Malaysia

UNNC University of Nottingham Ningbo China First Sino-foreign University Established in 2004 English Medium of Instruction (EMI) About 8,000 students ~10% international More than 750 staff (academic and professional) From more than 70 countries and regions around the world “An innovation, and centre for innovation”

Background I The oracle problem makes it difficult to test Machine Learning (ML) software There are also many (potential) ML users who are not expert, due to ML’s popularity Most related research is about supervised ML

Background II Evaluating the quality of clustering algorithms can be challenging No label for validation Users’ subjective expectations matter a lot MT can alleviate the oracle problem

Background III ML has been applied in numerous industrial domains, including: Medical, Economy, Automated driving, Decision making, … Supervised ML (Classification problem) ML algorithm Unsupervised ML (Clustering problem)

Metamorphic Testing The idea: It is possible to identify relationships amongst multiple inputs and outputs for a software under test (SUT), even if we don’t know the correctness of individual outputs Metamorphic Relations (MRs) are the necessary properties of the SUT

Metamorphic Exploration Recently, MRs have been applied to enhance system understanding and use These “MRs” need not be necessary properties for software correctness They can be hypothesized by the users Call them “Hypothesized MRs” (HMRs)

Our Study A case study of metamorphic exploration using a clustering program Weka, one of the most popular data science platforms for ML and data mining K-means Clustering Algorithm

K-means Clustering Algorithm Attempts to (iteratively) partition a dataset into K distinct non-overlapping subgroups (clusters) where each data point belongs to one and only one group Aims to minimize the following cost function (sum of the squared error) X = {xi}, i = 1,…, n be the set of n d-dimensional points to be clustered into a set of K clusters, C = {ck}, k = 1,…,K. Let μk be the mean of cluster ck.

K-means Clustering Algorithm Fig. 1. Basic steps of K-means algorithm (k = 2). (a) Original dataset. (b) Random initial cluster centroids. (c-f) Illustration of running two iterations.

Hypothesised MRs HMR1: Translation of 2D points along a line parallel to the x- or y-axis should not have an impact on the clustering results HMR2: Adding a duplicate point should not have an impact on the clustering results HMR3: Moving an existing point towards the cluster center should not have an impact on the clustering results HMR4: Adding a dimension in which all points have an equal value should not have an impact on the clustering results. HMR5: Swapping the x- and y-coordinates of each and every point (which is essentially conducting a geometric transformation on the existing points) should not have an impact on the clustering results HMR6: Using the same set of points while changing their input order to the SUT should not have an impact on the clustering results HMR7: In 2D, flipping the data points along one (x- or y-) axis should not have an impact on the clustering results

Hypothesised MRs HMR1: Translation of 2D points along a line parallel to the x- or y-axis should not have an impact on the clustering results HMR2: Adding a duplicate point should not have an impact on the clustering results HMR3: Moving an existing point towards the cluster center should not have an impact on the clustering results HMR4: Adding a dimension in which all points have an equal value should not have an impact on the clustering results. HMR5: Swapping the x- and y-coordinates of each and every point (which is essentially conducting a geometric transformation on the existing points) should not have an impact on the clustering results HMR6: Using the same set of points while changing their input order to the SUT should not have an impact on the clustering results HMR7: In 2D, flipping the data points along one (x- or y-) axis should not have an impact on the clustering results

Set-up Applied the approach on implementations of K-means clustering algorithm in Weka 3.8.3 Initial test cases (single test case): Iris.2D.arff data Default setting with K=5, S=10 Manually manipulated the follow-up test cases

(Preliminary) Results

(Preliminary) Results

Violation of HRM2 Fig.2. Clustering results of source test data Fig.3. Clustering results of HMR2

(Preliminary) Results

(Preliminary) Results

Violation of HRM6 Fig.2. Clustering results of source test data Fig.4. Clustering results of HM6

Analysis & Discussion Adding a new data point leads to a new round of calculation of the Euclidean distance when re-locating the new cluster centroids This will also influence the selection of initial cluster center points Changing of the data entry order will have impact on the clustering results

Analysis & Discussion This reflects a characteristic of the K-means clustering algorithm itself, and is not a defect in the implementation K-means algorithm is sensitive to the selection of initial clustering centroid

Analysis & Discussion

Recommendation If the user needs to add new data containing some duplicates to the original dataset, then it is better to choose a hierarchical clustering algorithm, to keep a stable clustering performance If we want to re-generate an earlier test result, we should keep the original order of the data points when they are entered into the system

Sen’s Conclusion Compared with the understanding of the SUT before the experiment, now have gained new knowledge and understanding of the system ME can be used to help users explore the system ME can be used to guide users to better use the algorithm

Discussion & Future Work Opportunity to embrace ME as a step towards MT Undergraduate curricula, and elsewhere Metamorphic Exploration (Journal First): Thursday 30 May 15:10, Van-Horne

Thank you!

Q & A