Zeidat&Eick, MLMTA, Las Vegas 2004 1 K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.

Slides:

Advertisements

Similar presentations

K-Means Clustering Algorithm Mining Lab

Advertisements

Data Set used. K Means K Means Clusters 1.K Means begins with a user specified amount of clusters 2.Randomly places the K centroids on the data set 3.Finds.

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.

Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.

Clustering Basic Concepts and Algorithms

Clustering Categorical Data The Case of Quran Verses

PARTITIONAL CLUSTERING

WEI-MING CHEN k-medoid clustering with genetic algorithm.

CLEVER: CLustEring using representatiVEs and Randomized hill climbing Rachana Parmar and Christoph F. Eick:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Basic Data Mining Techniques Chapter Decision Trees.

Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.

1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.

Cluster Analysis (1).

What is Cluster Analysis?

Radial Basis Function Networks

Evaluating Performance for Data Mining Techniques

2013 Teaching of Clustering

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

Professor: S. J. Wang Student : Y. S. Wang

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.

Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.

Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Overview The goal of Project3 is to design a region discovery algorithm and evaluate.

Chapter 8 The k-Means Algorithm and Genetic Algorithm.

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

Ch. Eick: Supervised Clustering --- Algorithms and Applications Supervised Clustering --- Algorithms and Applications Christoph F. Eick Department of Computer.

CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.

1. Data Mining (or KDD) Let us find something interesting! Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially.

CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.

Heuristic Optimization Methods Greedy algorithms, Approximation algorithms, and GRASP.

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

1 Eick, Zeidat, Vilalta: Using Representative-based Clustering for NN Dataset Editing (ICDM04) Using Representative-Based Clustering For Nearest Neighbour.

Data Mining & Machine Learning Group ADMA09 Rachsuda Jianthapthaksin, Christoph F. Eick and Ricardo Vilalta University of Houston, Texas, USA A Framework.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.

Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.

CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.

Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

Supervised Clustering --- Algorithms and Applications

Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Using Clustering to Learn Distance Functions for Supervised Similarity Assessment.

Using Clustering to Enhance Classifiers Christoph F. Eick Organization of the Talk 1.Brief Introduction to KDD 2.Using Clustering a. for Nearest Neighbour.

CLUSTERING PARTITIONING METHODS Elsayed Hemayed Data Mining Course.

Evolutionary Computing Chapter 12. / 26 Chapter 12: Multiobjective Evolutionary Algorithms Multiobjective optimisation problems (MOP) -Pareto optimality.

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.

Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.

Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.

Adaptive Cluster Ensemble Selection Javad Azimi, Xiaoli Fern {azimi, Oregon State University Presenter: Javad Azimi. 1.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.

Ch. Eick Project 2 COSC Christoph F. Eick.

Clustering Data Streams A presentation by George Toderici.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.

Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.

Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,

 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.

Clustering CSC 600: Data Mining Class 21.

Using Supervised Clustering to Enhance Classifiers

K Nearest Neighbor Classification

Example: Applying EC to the TSP Problem

Research Areas Christoph F. Eick

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Clustering Wei Wang.

Text Categorization Berlin Chen 2003 Reference:

Clustering Techniques

Presentation transcript:

Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer Science University of Houston

Eick&Zeidat, MLMTA, Las Vegas Talk Outline 1. What is Supervised Clustering? 2. Representative-based Clustering Algorithms 3. Benefits of Supervised Clustering 4. Algorithms for Supervised Clustering 5. Empirical Results 6. Conclusion and Areas of Future Work

Eick&Zeidat, MLMTA, Las Vegas (Traditional) Clustering Partition a set of objects into groups of similar objects. Each group is called cluster Clustering is used to “detect classes” in data set (“unsupervised learning”) Clustering is based on a fitness function that relies on a distance measure and usually tries to minimize distance between objects within a cluster.

Eick&Zeidat, MLMTA, Las Vegas (Traditional) Clustering… (continued) A CB Attribute2 Attribute1

Eick&Zeidat, MLMTA, Las Vegas Supervised Clustering Assumes that clustering is applied to classified examples. The goal of supervised clustering is to identify class-uniform clusters that have a high probability density.  prefers clusters whose members belong to single class (low impurity) We would, also, like to keep the number of clusters low (small number of clusters).

Eick&Zeidat, MLMTA, Las Vegas Supervised Clustering … (continued) Attribute 1 Attribute 2 Traditional ClusteringSupervised Clustering

Eick&Zeidat, MLMTA, Las Vegas A Fitness Function for Supervised Clustering q(X) := Impurity(X) + β*Penalty(k) k: number of clusters used n: number of examples the dataset c: number of classes in a dataset. β: Weight for Penalty(k), 0< β ≤2.0

Eick&Zeidat, MLMTA, Las Vegas Representative-Based Supervised Clustering (RSC) Aims at finding a set of objects among all objects (called representatives) in the data set that best represent the objects in the data set. Each representative corresponds to a cluster. The remaining objects in the data set are, then, clustered around these representatives by assigning objects to the cluster of the closest representative. Remark: The popular k-medoid algorithm, also called PAM, is a representative-based clustering algorithm.

Eick&Zeidat, MLMTA, Las Vegas Representative Based Supervised Clustering … (Continued) Attribute2 Attribute1

Eick&Zeidat, MLMTA, Las Vegas Representative Based Supervised Clustering … (Continued) Attribute2 Attribute

Eick&Zeidat, MLMTA, Las Vegas Representative Based Supervised Clustering … (Continued) Attribute2 Attribute Objective of RSC: Find a subset O R of O such that the clustering X obtained by using the objects in O R as representatives minimizes q(X).

Eick&Zeidat, MLMTA, Las Vegas Why do we use Representative-Based Clustering Algorithms? Representatives themselves are useful: – can be used for summarization – can be used for dataset compression Smaller search space if compared with algorithms, such as k-means. Less sensitive to outliers Can be applied to datasets that contain nominal attributes (not feasible to compute means)

Eick&Zeidat, MLMTA, Las Vegas Applications of Supervised Clustering Enhance classification algorithms. – Use SC for Dataset Editing to enhance NN-classifiers [ICDM04] – Improve Simple Classifiers [ICDM03] Learn Sub-classes / Summary Generation Distance Function Learning Dataset Compression/Reduction For Measuring the Difficulty of a Classification Task

Eick&Zeidat, MLMTA, Las Vegas Representative Based Supervised Clustering  Dataset Editing A C E a. Dataset clustered using supervised clustering. b. Dataset edited using cluster representatives. Attribute1 D B Attribute2 F Attribute1

Eick&Zeidat, MLMTA, Las Vegas Representative Based Supervised Clustering  Enhance Simple Classifiers Attribute1 Attribute2

Eick&Zeidat, MLMTA, Las Vegas Representative Based Supervised Clustering  Learning Sub-classes Attribute2 Ford Trucks Attribute1 Ford Trucks Ford Vans GMC Trucks GMC Van :Ford :GMC

Eick&Zeidat, MLMTA, Las Vegas Clustering Algorithms Currently Investigated 1. Partitioning Around Medoids (PAM).  Traditional 2. Supervised Partitioning Around Medoids (SPAM). 3. Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR). 4. Top Down Splitting Algorithm (TDS). 5. Supervised Clustering using Evolutionary Computing (SCEC).

Zeidat&Eick, MLMTA, Las Vegas Algorithm SRIDHCR REPEAT r TIMES curr := a randomly created set of representatives (with size between c+1 and 2*c) WHILE NOT DONE DO 1.Create new solutions S by adding a single non- representative to curr and by removing a single representative from curr. 2.Determine the element s in S for which q(s) is minimal (if there is more than one minimal element, randomly pick one). 3.IF q(s)<q(curr) THEN curr:=s ELSE IF q(s)=q(curr) AND |s|>|curr| THEN Curr:=s ELSE terminate and return curr as the solution for this run. Report the best out of the r solutions found.

Zeidat&Eick, MLMTA, Las Vegas Set of Medoids after adding one non-medoid q(X)Set of Medoids after removing a medoid q(X) (Initial solution) ……....…… …………… Trials in first part (add a non-medoid)Trials in second part (drop a medoid) RunSet of Medoids producing lowest q(X) in the runq(X)Purity (Init. Solution)

Zeidat&Eick, MLMTA, Las Vegas Algorithm SPAM Build Initial Solution curr: ( given # of clusters k ) 1.Determine the medoid of the most frequent class in the dataset. Insert that object m into curr. 2.For k-1 times, add an object v in the dataset to curr (that is not already in curr) that gives the lowest value for q(X) for curr  {v}. Improve Initial Solution curr: DO FOREVER FOR ALL representative objects r in curr DO FOR ALL non-representatives objects o in dataset DO 1.Create a new solution v by clustering the dataset around the representative set curr  {r }  {o} and insert v into S. 2.Calculate q(v) for this clustering. Determine the element s in S for which q(s) is minimal (if there is more than one minimal element, randomly pick one). IF q(s)<q(curr) THEN curr:=s ELSE TERMINATE returning curr as the final solution.

Eick&Zeidat, MLMTA, Las Vegas Differences between SPAM and SRIDHCR 1. SPAM tries to improve the current solution by replacing a representative by a non-representative, whereas SRIDHCR improves the current solution by removing a representative/by inserting a non-representative. 2. SPAM is run keeping the number of clusters k fixed, whereas SRIDHCR searches for a “good” value of k, therefore exploring a larger solution space. However, in the case of SRIDHCR which choices for k are good is somewhat restricted by the selection of the parameter . 3. SRIDHCR is run r times starting from a random initial solution, SPAM is only run once.

Eick&Zeidat, MLMTA, Las Vegas Performance Measures for the Experimental Evaluation The investigated algorithms were evaluated based on the following performance measures: Cluster Purity (Majority %). Value of the fitness function q(X). Average dissimilarity between all objects and their representatives (cluster tightness). Wall-Clock Time (WCT). Actual time, in seconds, that the algorithm took to finish the clustering task.

Zeidat&Eick, MLMTA, Las Vegas AlgorithmPurityq(X)Tightness(X). Iris-Plants data set, # clusters=3 PAM SRIDHCR SPAM Vehicle data set, # clusters =65 PAM SRIDHCR SPAM Image-Segment data set, # clusters =53 PAM SRIDHCR SPAM Pima-Indian Diabetes data set, # clusters =45 PAM SRIDHCR SPAM % 7% Table 4: Traditional vs. Supervised Clustering (β=0.1)

Zeidat&Eick, MLMTA, Las Vegas Algorithmq(X)PurityTightness (X) WCT (Sec.) IRIS-Flowers Dataset, # clusters=3 PAM SRIDHCR SPAM Vehicle Dataset, # clusters=65 PAM SRIDHCR SPAM Segmentation Dataset, # clusters=53 PAM SRIDHCR SPAM Pima-Indians-Diabetes, # clusters=45 PAM SRIDHCR SPAM Table 5: Comparative Performance of the Different Algorithms, β=0.1

AlgorithmAvg. PurityTightness(X)Avg.WCT (Sec.) IRIS-Flowers Dataset, # clusters=3 PAM SRIDHCR SPAM Vehicle Dataset, # clusters=56 PAM SRIDHCR SPAM Segmentation Dataset, # clusters=32 PAM SRIDHCR SPAM Pima-Indians-Diabetes, # clusters=2 PAM SRIDHCR SPAM Table 6: Average Comparative Performance of the Different Algorithms, β=0.4

Eick&Zeidat, MLMTA, Las Vegas Why is SRIDHCR performing so much better than SPAM? SPAM is relatively slow compared with a single run of SRIDHCR allowing for 5-30 restarts of SRIDHCR using the same resources. This enables SRIDHCR to conduct a more balanced exploration of the search space. Fitness landscape induced by q(X) contains many plateau-like structures (q(X1)=q(X2)) and many local minima and SPAM seems to get stuck more easily. The fact that SPAM uses a fixed k-value does not seem beneficiary for finding good solutions, e.g.: SRIDHCR might explore {u1,u2,u3,u4}  …  {u1,u2,u3,u4,v1,v2}  …  {u3,u4,v1,v2}, whereas SPAM might terminate with the sub- optimal solution {u1,u2,u3,u4}, if neither the replacement of u1 through v1 nor the replacement of u2 by v2 enhances q(X).

Zeidat&Eick, MLMTA, Las Vegas DatasetkβTies % Using q(X)Ties % Using Tightness(X) Iris-Plants Iris-Plants Iris-Plants Iris-Plants Vehicle Vehicle Vehicle Vehicle Segmentation Segmentation Segmentation Segmentation Diabetes Diabetes Diabetes Diabetes Table 7: Ties distribution

Zeidat&Eick, MLMTA, Las Vegas Figure 2: How Purity and k Change as β Increases

Eick&Zeidat, MLMTA, Las Vegas Conclusions 1. As expected, supervised clustering algorithms produced significantly better cluster purity than traditional clustering. Improvements range between 7% and 19% for different data sets. 2. Algorithms that too greedily explore the search space, such as SPAM, do not seem to be very suitable for supervised clustering. In general, algorithms that explore the search space more randomly seem to be more suitable for supervised clustering. 3. Supervised clustering can be used to enhance classifiers, dataset summarization, and generate better distance functions.

Eick&Zeidat, MLMTA, Las Vegas Future Work 1. Continue work on supervised clustering algorithms – Find better solutions – Faster – Explain some observations 2. Using supervised clustering for summary generation/learning subclasses 3. Using supervised clustering to find “compressed” nearest neighbor classifiers. 4. Using supervised clustering to enhance simple classifiers 5. Distance function learning

Eick&Zeidat, MLMTA, Las Vegas K-Means Algorithm Attribute2 Attribute

Eick&Zeidat, MLMTA, Las Vegas K-Means Algorithm Attribute2 Attribute