Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supervised Clustering --- Algorithms and Applications

Similar presentations


Presentation on theme: "Supervised Clustering --- Algorithms and Applications"— Presentation transcript:

1 Supervised Clustering --- Algorithms and Applications
Christoph F. Eick Department of Computer Science University of Houston Organization of the Talk Supervised Clustering Supervised Clustering Algorithms Applications: Using Supervised Clustering for Dataset Editing Class Decomposition Distance Function Learning Region Discovery in Spatial Datasets Other Activities I am Involved With

2 List of Persons that Contributed to the Work Presented in Today’s Talk
Tae-Wan Ryu (former PhD student; now faculty member Cal State Fullerton) Ricardo Vilalta (colleague at UH since 2002; Co-Director of the UH’s Data Mining and Knowledge Discovery Group) Abraham Bagherjeiran (current PhD student) Chunshen Chen (current Master student) Nidal Zeidat (former PhD student) Kim Wee (former MS student) Zhenghong Zhao (former Master student) Jing Wang (Master student) Banafsheh Vaezian (Master student) Dan Jiang (Master student)

3 Traditional Clustering
Partition a set of objects into groups of similar objects. Each group is called a cluster. Clustering is used to “detect classes” in a data set (“unsupervised learning”). Clustering is based on a fitness function that relies on a distance measure and usually tries to create “tight” clusters. There has been a lot of work on tradi itinal and semi-supervised clustering. Traditional clustering is applied to unclassified examples and relies on a notion of closeness. The focus of traditional clustering is to identify groups in datasets trying to minimizing given fitness function, such as trying to minimize the tightness of a clustering.

4 Different Forms of Clustering
Ch. Eick Different Forms of Clustering There has been a lot of work on tradi itinal and semi-supervised clustering. Traditional clustering is applied to unclassified examples and relies on a notion of closeness. The focus of traditional clustering is to identify groups in datasets trying to minimizing given fitness function, such as trying to minimize the tightness of a clustering. Objectives Supervised Clustering: Minimize cluster impurity while keeping the number of clusters low (expressed by a fitness function q(X)).

5 Motivation: Finding Subclasses using SC
Attribute1 Ford Trucks :Ford :GMC GMC Trucks GMC Van Ford Vans Finally, supervised clustering can be used to identify subclasses of a given set of classes. Consider, we have cars that are classified as Ford or General Motors cars. A supervised clustering algorithms would identify subclasses, such as… Ford SUV Attribute2 GMC SUV

6 Related Work Supervised Clustering
Sinkkonen’s [SKN02] discriminative clustering and Tishby’s information bottleneck method [TPB99, ST99] can be viewed as probabilistic supervised clustering algorithms. There has been a lot of work in the area of semi-supervised clustering that centers on clustering with background information. Although the focus of this work is traditional clustering, there is still a lot of similarity between techniques and algorithms they investigate and the techniques and algorithms we investigate. There are many possible supervised clustering algorithm. In our work, we investigate representative-based supervised clustering algorithms that aim

7 2. Supervised Clustering Algorithms
Supervised Partitioning Around Medoids (SPAM). Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR). Supervised Clustering using Evolutionary Computing (SCEC) SCEC with Gabriel Graph Based Post-processing (SCECGGP) Agglomerative Hierarchical Supervised Clustering (SCAH) Hierarchical Grid-based Supervised Clustering (SCHG) Supervised Clustering using Multi-Resolution Grids (SCMRG) Supervised Clustering using Density Estimation Techniques (SCDE) The remainder of this talk centers on algorithm for supervised clustering. Currently we are investigating several clustering algorithms and on comparign their performance. Remark: For a more detailed discussion of SCEC and SRIDHCR see [EZZ04]; For more details concerning the other algorithms see [EVJW06].

8 Representative-Based Supervised Clustering
Aims at finding a set of objects among all objects (called representatives) in the data set that best represent the objects in the data set. Each representative corresponds to a cluster. The remaining objects in the data set are then clustered around these representatives by assigning objects to the cluster of the closest representative. Remark: The popular k-medoid algorithm, also called PAM, is a representative-based clustering algorithm. There are many possible supervised clustering algorithm. In our work, we investigate representative-based supervised clustering algorithms that aim

9 Representative-Based Supervised Clustering … (Continued)
2 Attribute1 1 3 4 Attribute2

10 Representative-Based Supervised Clustering … (continued)
2 Attribute1 1 3 The objective of representative-based supervised clustering is… 4 Attribute2 Objective of RSC: Find a subset OR of O such that the clustering X obtained by using the objects in OR as representatives minimizes q(X).

11 Evaluation Schemes for SC
For Meta-Learning / Clustering for Classification / Distance Function Learning; evaluation is based on: Purity Number of Clusters For Region Discovery in Spatial Datasets: Reward-based evaluation schemes are employed Quality of a Clustering: Sum of rewards associated with each cluster A reward of a cluster c is computed as follows: Reward(c)*size(c) with >1 Idea: clusters reward increase with cluster size non-linearily Consequence: Merging two clusters with equal rewards leads to a better clustering (a +b <(a+b) with a and b being the size of each cluster)

12 A Fitness Function for Supervised Clustering
q(X) := Impurity(X) + β*Penalty(k) k: number of clusters used n: number of examples the dataset c: number of classes in a dataset. β: Weight for Penalty(k), 0< β ≤2.0 The following transparency gives a fitness function that has to be minimized by a supervised clustering algorithm. The fitness functions is the sum of the impurity of a clustering X and of a penalty that it is associated with the number of clusters k used. The fitness function uses a parameter b; if we are interested in obtaining a large number of clusters we would use a high b value. Penalty(k) increase sub-linearly. because the effect of increasing the # of clusters from k to k+1 has greater effect on the end result when k is small than when it is large. Hence the formula above

13 Discovery of Interesting Regions for Spatial Data Mining
Task: 2D/3D datasets are given; discover interesting regions in the dataset that maximize a given fitness function; examples of region discovery include: Discover regions that have significant deviations from the prior probability of a class; e.g. regions in the state of Wyoming were people are very poor or not poor at all Discover regions that have significant variation in the income (fitness is defined based on the variance with respect to income in a region) Discover regions for congressional redistricting Discover congested regions for traffic control Remark: We use (supervised) clustering to discover such regions; regions are implicitly defined by the set of points that belong to a cluster.

14 Ch. Eick Example: Discovery of “Interesting Regions” in Wyoming Census 2000 Datasets

15 t(p(C),prior(C),g1,g2,R+,R-)
An Evaluation Scheme for Discovering Regions that Deviate from the Prior Probability of a Class C Let prior(C)= |C|/n p(c,C)= percentage of examples in c that belong to class C Reward(c) is computed based on p(c.C), prior(C) , and based on the following parameters: g1,g2,R+,R- (g11g2; R+,R-0) relying on the following interpolation function (e.g. g1=0.8,g2=1.2,R+ =1, R-=1): qC(X)= ScX (t(p(c,C),prior(C),g1,g2,R+,R-) *|c|b) /nb) with b>1 (typically, <b<2); the idea is that increases in cluster-size rewarded nonlinearly, favoring clusters with more points as long as |c|*t(…) increases. Reward(c) R+ R- t(p(C),prior(C),g1,g2,R+,R-) prior(C)*g1 prior(C) prior(C)*g2 1 p(c,C)

16 Clusters  Regions Example: 2 clusters in red and blue are given; regions are defined by using a Voronoi diagram based on a NN classifier with k=7; region are in grey and white.

17 Algorithm SRIDHCR (Greedy Hill Climbing)
REPEAT r TIMES curr := a randomly created set of representatives (with size between c+1 and 2*c) WHILE NOT DONE DO Create new solutions S by adding a single non-representative to curr and by removing a single representative from curr Determine the element s in S for which q(s) is minimal (if there is more than one minimal element, randomly pick one) IF q(s)<q(curr) THEN curr:=s ELSE IF q(s)=q(curr) AND |s|>|curr| THEN Curr:=s ELSE terminate and return curr as the solution for this run. Report the best out of the r solutions found. Highlights: k is not an input parameter, SRIDHCR searches for best k within the range that is induced by b. Reports the best clustering found in r runs

18 Supervised Clustering using Evolutionary Computing: SCEC
Initial generation Next generation Mutation Crossover Copy Final generation Best solution Result:

19 Create next Generation Create next Generation
Initialize Solutions Clustering on S[i] Evaluation Intermediate Result Record Best Solution, Q Compose Population S Mutation New S’[i] Crossover Copy K-tournament Evaluate a Population Loop PS times Create next Generation Best Solution, Q, Summary Exit Loop N times Initialize Solutions Clustering on S[i] Evaluation Intermediate Result Record Best Solution, Q Compose Population S Mutation New S’[i] Crossover Copy K-tournament Evaluate a Population Loop PS times Create next Generation Best Solution, Q, Summary Exit Loop N times The complete flow chart of SCEC The complete flow chart of SCEC

20 Complex1 Dataset

21 Supervised Clustering Result

22 Supervised Clustering --- Algorithms and Applications
Organization of the Talk Supervised Clustering Supervised Clustering Algorithms Applications: Using Supervised Clustering for for Dataset Editing for Class Decomposition for Distance Function Learning for Region Discovery in Spatial Datasets Other Activities I am Involved With

23 Nearest Neighbour Rule
Consider a two class problem where each sample consists of two measurements (x,y). For a given query point q, assign the class of the nearest neighbour. k = 1 Compute the k nearest neighbours and assign the class by majority vote. k = 3 Problem: requires “good” distance function

24 3a. Dataset Reduction: Editing
Training data may contain noise, overlapping classes Editing seeks to remove noisy points and produce smooth decision boundaries – often by retaining points far from the decision boundaries Main Goal of Editing: enhance the accuracy of classifier (% of “unseen” examples classified correctly) Secondary Goal of Editing: enhance the speed of a k-NN classifier

25 Wilson Editing Wilson 1972 Remove points that do not agree with the majority of their k nearest neighbours Earlier example Overlapping classes Original data Original data Wilson editing with k=7 Wilson editing with k=7

26 RSC  Dataset Editing Attribute1 Attribute1 A B D C F E One application of supervised clustering is data set editing. The idea of dataset editing is to remove examples from a training set with the goal to enhance accuracy of a classifier. The idea is to use the cluster representatives instead of the whole dataset when training a classifier. The cluster representatives are determined using supervised clustering. Attribute2 Attribute2 a. Dataset clustered using supervised clustering. b. Dataset edited using cluster representatives.

27 Experimental Evaluation
We compared a traditional 1-NN, 1-NN using Wilson Editing, Supervised Clustering Editing (SCE), and C4.5 (that was run using its default parameter setting). A benchmark consisting of 8 UCI datasets was used for this purpose. Accuracies were computed using 10-fold cross validation. SRIDHCR was used for supervised clustering. SCE was tested using different compression rates by associating different penalties with the number of clusters found (by setting parameter b to: 0.1, 0.4 and 1.0). Compression rates of SCE and Wilson Editing were computed using: 1-(k/n) with n being the size of the original dataset and k being the size of the edited dataset.

28 Table 2: Prediction Accuracy for the four classifiers.
β NR Wilson 1-NN C4.5 Glass (214) 0.1 0.636 0.607 0.692 0.677 0.4 0.589 1.0 0.575 Heart-Stat Log (270) 0.796 0.804 0.767 0.782 0.833 0.838 Diabetes (768) 0.736 0.734 0.690 0.745 Vehicle (846) 0.667 0.716 0.700 0.723 0.665 Heart-H (294) 0.755 0.809 0.783 0.802 0.793 Waveform (5000) 0.834 0.768 0.781 0.841 0.837 Iris-Plants (150) 0.947 0.936 0.973 0.953 Segmentation (2100) 0.938 0.966 0.956 0.968 0.919 0.890

29 Table 3: Dataset Compression Rates for SCE and Wilson Editing.
Avg. k [Min-Max] for SCE SCE Compression Rate (%) Wilson Glass (214) 0.1 34 [28-39] 84.3 27 0.4 25 [19-29] 88.4 1.0 6 [6 – 6] 97.2 Heart-Stat Log (270) 15 [12-18] 94.4 22.4 2 [2 – 2] 99.3 Diabetes (768) 27 [22-33] 96.5 30.0 9 [2-18] 98.8 99.7 Vehicle (846) 57 [51-65] 97.3 30.5 38 [ 26-61] 95.5 14 [ 9-22] 98.3 Heart-H (294) 14 [11-18] 95.2 21.9 2 Waveform (5000) 104 [79-117] 97.9 23.4 28 [20-39] 99.4 4 [3-6] 99.9 Iris-Plants (150) 4 [3-8] 6.0 3 [3 – 3] 98.0 Segmentation (2100) 57 [48-65] 2.8 30 [24-37] 98.6 14

30 Summary SCE and Wilson Editing
Wilson editing enhances the accuracy of a traditional 1-NN classifier for six of the eight datasets tested. It achieved compression rates of approx. 25%, but much lower compression rates for “easy” datasets. SCE achieved very high compression rates without loss in accuracy for 6 of the 8 datasets tested. SCE accomplished a significant improvement in accuracy for 3 of the 8 datasets tested. Surprisingly, many UCI datasets can be compressed by just using a single representative per class without a significant loss in accuracy. SCE tends to pick representatives that are in the center of a region that is dominated by a single class; it removes examples that are classified correctly as well as examples that are classified incorrectly from the dataset. This explains its much higher compression rates. Remark: For a more detailed evaluation of SCE, Wilson Editing, and other editing techniques see [EZV04].

31 Future Direction of this Research
Data Set p Data Set’ IDLA IDLA Classifier C Classifier C’ Goal: Find p, such that C’ is more accurate than C or C and C’ have approximately the same accuracy, but C’ can be learnt more quickly and/or C’ classifies new examples more quickly.

32 Supervised Clustering vs. Clustering the Examples of Each Separately
Approaches to discover subclasses of a given class: Cluster the examples of each class separately Use supervised clustering Figure 4. Supervised clustering editing vs. clustering each class (x and o) separately. O OOx x x Remark: A traditional clustering algorithm, such as k-medoids, would pick o as the cluster representative, because it is “blind” on how the examples of other classes distribute, whereas supervised clustering would pick o as the representative; obviously, o is not a good choice for editing, because it attracts points of the class x, which leads to misclassifications.

33 3.b Class Decomposition (see also [VAE03])
Applications of Supervised Clustering 3.b Class Decomposition (see also [VAE03]) Attribute 1 Attribute 1 Attribute 2 Attribute 2 Simple classifiers: Encompass a small class of approximating functions. Limited flexibility in their decision boundaries Attribute 1 Attribute 2

34 Naïve Bayes vs. Naïve Bayes with Class Decomposition

35 Example: How to Find Similar Patients?
3c. Using Clustering in Distance Function Learning Example: How to Find Similar Patients? The following relation is given (with tuples): Patient(ssn, weight, height, cancer-sev, eye-color, age,…) Attribute Domains ssn: 9 digits weight between 30 and 650; mweight=158 sweight=24.20 height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2 cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor eye-color: {brown, blue, green, grey } age: between 3 and 100; mage=45 sage=13.2 Task: Define Patient Similarity

36 CAL-FULL/UH Database Clustering & Similarity Assessment Environments
Training Data A set of clusters Library of clustering algorithms Learning Tool Object View Similarity measure Clustering Tool Library of similarity measures Similarity Measure Tool Data Extraction Tool User Interface Today’s topic Type and weight information Default choices and domain information DBMS For more details: see [RE05]

37 Similarity Assessment Framework and Objectives
Objective: Learn a good distance function q for classification tasks. Our approach: Apply a clustering algorithm with the distance function q to be evaluated that returns a number of clusters k. The more pure the obtained clusters are the better is the quality of q. Our goal is to learn the weights of an object distance function q such that all the clusters are pure (or as pure is possible); for more details see [ERBV05] and [BECV05] papers.

38 Idea: Coevolving Clusters and Distance Functions
Weight Updating Scheme / Search Strategy Clustering X Distance Function Q Cluster “Bad” distance function Q1 “Good” distance function Q2 q(X) Clustering Evaluation o o o x x o x o o o x o o o Goodness of the Distance Function Q o o x x x x x x

39 Idea Inside/Outside Weight Updating
o:=examples belonging to majority class x:= non-majority-class examples Cluster1: distances with respect to Att1 xo oo ox Action: Increase weight of Att1 Idea: Move examples of the majority class closer to each other Cluster1: distances with respect to Att2 o o xx o o Action: Decrease weight for Att2

40 Sample Run of IOWU for Diabetes Dataset
Graph produced by Abraham Bagherjeiran

41 Research Framework Distance Function Learning
Evaluation Weight-Updating Scheme / Search Strategy K-Means Inside/Outside Weight Updating [ERBV04] Supervised Clustering Work By Karypis Randomized Hill Climbing NN-Classifier Adaptive Clustering Other Research [BECV05]

42 3d. Using Clustering for Region Discovery
 Separate Set of Transparencies

43 Supervised Clustering --- Algorithms and Applications
Organization of the Talk Supervised Clustering Supervised Clustering Algorithms Applications: Using Supervised Clustering for for Dataset Editing for Class Decomposition for Distance Function Learning for Region Discovery in Spatial Datasets Other Activities I am Involved In

44 An Environment for Adaptive (Supervised) Clustering for Summary Generation Applications
Algorithm Inputs changes Adaptation System Evaluation System feedback Past Experience Domain Expert quality Fitness Functions (predefined) q(X), Idea: Development of a Generic Clustering/Feedback/Adaptation Architecture whose objective is to facilitate the search for clusterings that maximize an internally and/or an externally given reward function (for some initial ideas see [BECV05])

45 Clustering Algorithm Inputs
Data Set Examples Data Set Feature Representation Distance Function Clustering Algorithm Parameters Fitness Function Parameters Background Knowledge

46 Remark: Topics that were “covered” in this talk are in blue
Research Topics 2005/2006 Inductive Learning/Data Mining Decision trees, nearest neighbor classifiers Using clustering to enhance classification algorithms Making sense of data Supervised Clustering Learning subclasses Supervised clustering algorithms that learn clusters with arbitrary shape Using supervised clustering for region discovery Adaptive clustering Tools for Similarity Assessment and Distance Function Learning Data Set Compression and Creating Meta Knowledge for Local Learning Techniques Comparative studies Creating maps and other data set signatures for datasets based on editing, SC, and other techniques Traditional Clustering Data Mining and Information Retrieval for Structured Data Other: Evolutionary Computing, File Prediction, Ontologies, Heuristic Search, Reinforcement Learning, Data Models. Remark: Topics that were “covered” in this talk are in blue

47 Links to 7 Paper [VAE03] R. Vilalta, M. Achari, C. Eick, Class Decomposition via Clustering: A New Framework for Low-Variance Classifiers, in Proc. IEEE International Conference on Data Mining (ICDM), Melbourne, Florida, November 2003. [EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and Benefits, short version appeared in Proc. International Conference on Tools with AI (ICTAI), Boca Raton, Florida, November 2004. [EZV04] C. Eick, N. Zeidat, R. Vilalta, Using Representative-Based Clustering for Nearest Neighbor Dataset Editing, in Proc. IEEE International Conference on Data Mining (ICDM), Brighton, England, November 2004. [RE05] T. Ryu and C. Eick, A Clustering Methodology and Tool, in Information Sciences 171(1-3): (2005). [ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to Learn Distance Functions for Supervised Similarity Assessment, in Proc. MLDM'05, Leipzig, Germany, July 2005. [BECV05] A. Bagherjeiran, C. Eick, C.-S. Chen, R. Vilalta, Adaptive Clustering: Obtaining Better Clusters Using Feedback and Past Experience, Proc. IEEE International Conference on Data Mining (ICDM), Houston, Texas November [EVJW06] C. Eick, B. Vaezian, D. Jiang, J. Wang, Discovery of Interesting Regions in Spatial Data Sets Using Supervised Clustering, submitted to PKDD 2006; long version to be available as a UH Technical Report by June 15, 2006.


Download ppt "Supervised Clustering --- Algorithms and Applications"

Similar presentations


Ads by Google