Using Supervised Clustering to Enhance Classifiers

Using Supervised Clustering to Enhance Classifiers
Christoph F. Eick and Nidal Zeidat Department of Computer Science University of Houston Organization of the Talk Supervised Clustering Representative-based Supervised Clustering Algorithms Applications: Using Supervised Clustering for Dataset Editing Class Decomposition Region Discovery in Spatial Datasets Summary and Future Work

List of Persons that Contributed to the Work Presented in Today’s Talk
Tae-Wan Ryu (former PhD student; now faculty member Cal State Fullerton) Ricardo Vilalta (colleague at UH since 2002; Co-Director of the UH’s Data Mining and Knowledge Discovery Group) Murali Achari (former Master student) Alain Rouhana (former Master student) Abraham Bagherjeiran (current PhD student) Chunshen Chen (current Master student) Nidal Zeidat (current PhD student) Sujing Wang (current PhD student) Kim Wee (current MS student) Zhenghong Zhao (former Master student)

Ch. Eick 1. Introduction There has been a lot of work on tradi itinal and semi-supervised clustering. Traditional clustering is applied to unclassified examples and relies on a notion of closeness. The focus of traditional clustering is to identify groups in datasets trying to minimizing given fitness function, such as trying to minimize the tightness of a clustering. Objectives Supervised Clustering: Minimize cluster impurity while keeping the number of clusters low (expressed by a fitness function q(X)).

Motivation: Finding Subclasses using SC
Attribute1 Ford Trucks :Ford :GMC GMC Trucks GMC Van Ford Vans Finally, supervised clustering can be used to identify subclasses of a given set of classes. Consider, we have cars that are classified as Ford or General Motors cars. A supervised clustering algorithms would identify subclasses, such as… Ford SUV Attribute2 GMC SUV

Related Work Supervised Clustering
Sinkkonen’s [SKN02] discriminative clustering and Tishby’s information bottleneck method [TPB99, ST99] can be viewed as probabilistic supervised clustering algorithms. There has been a lot of work in the area of semi-supervised clustering that centers on clustering with background information. Although the focus of this work is traditional clustering, there is still a lot of similarity between techniques and algorithms they investigate and the techniques and algorithms we investigate. There are many possible supervised clustering algorithm. In our work, we investigate representative-based supervised clustering algorithms that aim

2. Representative-Based Supervised Clustering
Aims at finding a set of objects among all objects (called representatives) in the data set that best represent the objects in the data set. Each representative corresponds to a cluster. The remaining objects in the data set are then clustered around these representatives by assigning objects to the cluster of the closest representative. Remark: The popular k-medoid algorithm, also called PAM, is a representative-based clustering algorithm. There are many possible supervised clustering algorithm. In our work, we investigate representative-based supervised clustering algorithms that aim

Representative-Based Supervised Clustering … (Continued)
2 Attribute1 1 3 4 Attribute2

Representative-Based Supervised Clustering … (continued)
2 Attribute1 1 3 The objective of representative-based supervised clustering is… 4 Attribute2 Objective of RSC: Find a subset OR of O such that the clustering X obtained by using the objects in OR as representatives minimizes q(X).

SC Algorithms Currently Investigated
Supervised Partitioning Around Medoids (SPAM). Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR). Top Down Splitting Algorithm (TDS). Supervised Clustering using Evolutionary Computing (SCEC) Agglomerative Hierarchical Supervised Clustering (AHSC) Grid-Based Supervised Clustering (GRIDSC) The remainder of this talk centers on algorithm for supervised clustering. Currently we are investigating several clustering algorithms and on comparign their performance.

A Fitness Function for Supervised Clustering
q(X) := Impurity(X) + β*Penalty(k) k: number of clusters used n: number of examples the dataset c: number of classes in a dataset. β: Weight for Penalty(k), 0< β ≤2.0 The following transparency gives a fitness function that has to be minimized by a supervised clustering algorithm. The fitness functions is the sum of the impurity of a clustering X and of a penalty that it is associated with the number of clusters k used. The fitness function uses a parameter b; if we are interested in obtaining a large number of clusters we would use a high b value. Penalty(k) increase sub-linearly. because the effect of increasing the # of clusters from k to k+1 has greater effect on the end result when k is small than when it is large. Hence the formula above

Algorithm SRIDHCR (Greedy Hill Climbing)
REPEAT r TIMES curr := a randomly created set of representatives (with size between c+1 and 2*c) WHILE NOT DONE DO Create new solutions S by adding a single non-representative to curr and by removing a single representative from curr Determine the element s in S for which q(s) is minimal (if there is more than one minimal element, randomly pick one) IF q(s)<q(curr) THEN curr:=s ELSE IF q(s)=q(curr) AND |s|>|curr| THEN Curr:=s ELSE terminate and return curr as the solution for this run. Report the best out of the r solutions found. Highlights: k is not an input parameter, SRIDHCR searches for best k within the range that is induced by b. Reports the best clustering found in r runs

Supervised Clustering using Evolutionary Computing: SCEC
Initial generation Next generation Mutation Crossover Copy Final generation Best solution Result:

Supervised Clustering --- Algorithms and Applications
Organization of the Talk Supervised Clustering Representative-based Supervised Clustering Algorithms Applications: Using Supervised Clustering for for Dataset Editing for Class Decomposition for Region Discovery in Spatial Datasets Conclusion and Future Work

Nearest Neighbour Rule
Consider a two class problem where each sample consists of two measurements (x,y). For a given query point q, assign the class of the nearest neighbour. k = 1 Compute the k nearest neighbours and assign the class by majority vote. k = 3 Problem: requires “good” distance function

3a. Dataset Reduction: Editing
Training data may contain noise, overlapping classes Editing seeks to remove noisy points and produce smooth decision boundaries – often by retaining points far from the decision boundaries Main Goal of Editing: enhance the accuracy of classifier (% of “unseen” examples classified correctly) Secondary Goal of Editing: enhance the speed of a k-NN classifier

Wilson Editing Wilson 1972 Remove points that do not agree with the majority of their k nearest neighbours Earlier example Overlapping classes Original data Original data Wilson editing with k=7 Wilson editing with k=7

RSC  Dataset Editing Attribute1 Attribute1 A B D C F E One application of supervised clustering is data set editing. The idea of dataset editing is to remove examples from a training set with the goal to enhance accuracy of a classifier. The idea is to use the cluster representatives instead of the whole dataset when training a classifier. The cluster representatives are determined using supervised clustering. Attribute2 Attribute2 a. Dataset clustered using supervised clustering. b. Dataset edited using cluster representatives.

Supervised Clustering vs. Clustering the Examples of Each Separately
Approaches to discover subclasses of a given class: Cluster the examples of each class separately Use supervised clustering Figure 4. Supervised clustering editing vs. clustering each class (x and o) separately. O OOx x x Remark: A traditional clustering algorithm, such as k-medoids, would pick o as the cluster representative, because it is “blind” on how the examples of other classes distribute, whereas supervised clustering would pick o as the representative; obviously, o is not a good choice for editing, because it attracts points of the class x, which leads to misclassifications.

Experimental Evaluation
We compared a traditional 1-NN classifier and Supervised Clustering Editing (SCE). A benchmark consisting of 8 UCI datasets was used for this purpose. Accuracies were computed using 10-fold cross validation. SRIDHCR was used for supervised clustering. SCE was tested using different compression rates by associating different penalties with the number of clusters found (by setting parameter b to: 0.4 and 1.0). Compression rates of SCE and Wilson Editing were computed using: 1-(k/n) with n being the size of the original dataset and k being the size of the edited dataset.

Experimental Results (Table 4)

Summary SCE vs. 1-NN-classifier
SCE achieved very high compression rates without loss in accuracy for 5 of the 8 datasets tested. SCE accomplished a significant improvement in accuracy for 3 of the 8 datasets tested. Surprisingly, many UCI datasets can be compressed by just using a single representative per class without a significant loss in accuracy. SCE, in contrast to other editing techniques, removes examples that are classified correctly as well as examples that are classified incorrectly from the dataset. This explains its much higher compression rates, if compared to other techniques. SCE frequently picks representatives that are in the center of a region that is dominated by a single class; however, sometimes for with more complex shapes, the need arises for representatives to be lined up across of each other to avoid attracting points in neighboring clusters.

Complex9 Dataset

Supervised Clustering Result for Complex9

Diamonds9 dataset clustered using SC algorithm SRIDHCR

Future Direction of this Research
Data Set p Data Set’ IDLA IDLA Classifier C Classifier C’ Goal: Find p, such that C’ is more accurate than C or C and C’ have approximately the same accuracy, but C’ can be learnt more quickly and/or C’ classifies new examples more quickly.

3.b Class Decomposition Simple classifiers:
Attribute 1 Attribute 1 Attribute 2 Attribute 2 Simple classifiers: Encompass a small class of approximating functions. Limited flexibility in their decision boundaries Attribute 1 Attribute 2

Naïve Bayes vs. Naïve Bayes with Class Decomposition

3.c Discovery of Interesting Regions for Spatial Data Mining
Task: 2D/3D datasets are given; discover interesting regions in the dataset that maximize a given fitness function; examples of region discovery include: Discover regions that have significant deviations from the prior probability of a class; e.g. regions in the state of Wyoming were people are very poor or not poor at all Discover regions that have significant variation in the income (fitness is defined based on the variance with respect to income in a region) Discover congested regions for traffic control Our Approach: We use (supervised) clustering to discover such regions with a fitness function representing a particular measure of interestingness; regions are implicitly defined by the set of points that belong to a cluster.

Wyoming Map

Household Income in 1999: Wyoming Park County

Clusters  Regions Example: 2 clusters in red and blue are given; regions are defined by using a Voronoi diagram based on a NN classifier with k=7; region are in grey and white.

t(p(C),prior(C),g1,g2,R+,R-)
An Evaluation Scheme for Discovering Regions that Deviate from the Prior Probability of a Class C Let prior(C)= |C|/n p(c,C)= percentage of examples in c that belong to class C Reward(c) is computed based on p(c.C), prior(C) , and based on the following parameters: g1,g2,R+,R- (g11g2; R+,R-0) relying on the following interpolation function (e.g. g1=0.8,g2=1.2,R+ =1, R-=1): qC(X)= ScX (t(p(c,C),prior(C),g1,g2,R+,R-) *|c|b)/n) with b>1 (typically, <b<2); the idea is that increases in cluster-size rewarded nonlinearly, favoring clusters with more points as long as |c|*t(…) increases. Reward(c) R+ R- t(p(C),prior(C),g1,g2,R+,R-) prior(C)*g1 prior(C) prior(C)*g2 1 p(c,C)

Ch. Eick Example: Discovery of “Interesting Regions” in Wyoming Census 2000 Datasets

Supervised Clustering --- Algorithms and Applications
Organization of the Talk Supervised Clustering Representative-based Supervised Clustering Algorithms Applications: Using Supervised Clustering for for Dataset Editing for Class Decomposition for Region Discovery in Spatial Datasets Summary and Future Work

4. Summary and Future Work
A novel data mining technique, we term “supervised clustering”, was introduced. The benefits of using supervised clustering as a preprocessing step to enhance classification algorithms, such as NN classifiers and naïve Bayesian classifiers, were demonstrated. In our current research, we investigate the use of supervised clustering for spatial data mining, distance function learning, and for discovering subclasses. Moreover, we investigate how to make supervised clustering adaptive with respect to user feedback.

An Environment for Adaptive (Supervised) Clustering for Summary Generation Applications
Algorithm Inputs changes Adaptation System Evaluation System feedback Past Experience Domain Expert quality Fitness Functions (predefined) q(X), … Idea: Development of a Generic Clustering/Feedback/Adaptation Architecture whose objective is to facilitate the search for clusterings that maximize an internally and/or an externally given reward function.

Links to 5 Related Papers
[VAE03] R. Vilalta, M. Achari, C. Eick, Class Decomposition via Clustering: A New Framework for Low-Variance Classifiers, in Proc. IEEE International Conference on Data Mining (ICDM), Melbourne, Florida, November 2003. [EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and Benefits, short version of this paper to appear in Proc. International Conference on Tools with AI (ICTAI), Boca Raton, Florida, November 2004. [ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to Learn Distance Functions for Supervised Similarity Assessment, to appear MLDM'05, Leipzig, Germany, July 2005. [EZV04] C. Eick, N. Zeidat, R. Vilalta, Using Representative-Based Clustering for Nearest Neighbor Dataset Editing, to appear in Proc. IEEE International Conference on Data Mining (ICDM), Brighton, England, November 2004. [ZSE05] N. Zeidat, S. Wang, and C. Eick, Data Set Editing Techniques: A Comparative Study, submitted for publication.

Using Supervised Clustering to Enhance Classifiers

Similar presentations

Presentation on theme: "Using Supervised Clustering to Enhance Classifiers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Supervised Clustering to Enhance Classifiers

Similar presentations

Presentation on theme: "Using Supervised Clustering to Enhance Classifiers"— Presentation transcript:

Similar presentations

About project

Feedback