Using Supervised Clustering to Enhance Classifiers

Slides:



Advertisements
Similar presentations
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
Advertisements

PARTITIONAL CLUSTERING
Classification and Decision Boundaries
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Nearest Neighbour Condensing and Editing David Claus February 27, 2004 Computer Vision Reading Group Oxford.
Active Learning for Class Imbalance Problem
Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering Christoph F. Eick, Banafsheh Vaezian, Dan Jiang, Jing Wang PKDD Conference,
Name: Sujing Wang Advisor: Dr. Christoph F. Eick
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Overview The goal of Project3 is to design a region discovery algorithm and evaluate.
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
Ch. Eick: Supervised Clustering --- Algorithms and Applications Supervised Clustering --- Algorithms and Applications Christoph F. Eick Department of Computer.
1. Data Mining (or KDD) Let us find something interesting! Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially.
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques Nearest Neighbor Editing and Condensing Techniques 1.Nearest Neighbor.
MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,
1 Eick, Zeidat, Vilalta: Using Representative-based Clustering for NN Dataset Editing (ICDM04) Using Representative-Based Clustering For Nearest Neighbour.
Data Mining & Machine Learning Group ADMA09 Rachsuda Jianthapthaksin, Christoph F. Eick and Ricardo Vilalta University of Houston, Texas, USA A Framework.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.
Change Analysis in Spatial Datasets by Interestingness Comparison Vadeerat Rinsurongkawong, and Christoph F. Eick Department of Computer Science, University.
Supervised Clustering --- Algorithms and Applications
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Using Clustering to Learn Distance Functions for Supervised Similarity Assessment.
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries that Characterize Clustering Results Focus: Primary Focus Cluster Summarization (what.
Using Clustering to Enhance Classifiers Christoph F. Eick Organization of the Talk 1.Brief Introduction to KDD 2.Using Clustering a. for Nearest Neighbour.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Department of Computer Science Research Areas and Projects 1. Data Mining and Machine Learning Group ( research.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
MLDM16, Newark Supervised Clustering— Algorithms and Applications Christoph F. Eick Department of Computer Science University of Houston Organization of.
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning: Ensemble Methods
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Data Mining K-means Algorithm
کاربرد نگاشت با حفظ تنکی در شناسایی چهره
Support Vector Machines
Research Focus Objectives: The Data Analysis and Intelligent Systems (DAIS) Lab  aims at the development of data analysis, data mining, GIS and artificial.
K Nearest Neighbor Classification
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
HC-edit: A Hierarchical Clustering Approach To Data Editing
Nearest-Neighbor Classifiers
Example: Applying EC to the TSP Problem
Research Areas Christoph F. Eick
Critical Issues with Respect to Clustering
Prepared by: Mahmoud Rafeek Al-Farra
Clustering.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Revision (Part II) Ke Chen
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Discriminative Frequent Pattern Analysis for Effective Classification
COSC 4335: Other Classification Techniques
Spatial Data Mining Definition: Spatial data mining is the process of discovering interesting patterns from large spatial datasets; it organizes by location.
Nearest Neighbors CSC 576: Data Mining.
Text Categorization Berlin Chen 2003 Reference:
Image Segmentation.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Using Supervised Clustering to Enhance Classifiers Christoph F. Eick and Nidal Zeidat Department of Computer Science University of Houston Organization of the Talk Supervised Clustering Representative-based Supervised Clustering Algorithms Applications: Using Supervised Clustering for Dataset Editing Class Decomposition Region Discovery in Spatial Datasets Summary and Future Work

List of Persons that Contributed to the Work Presented in Today’s Talk Tae-Wan Ryu (former PhD student; now faculty member Cal State Fullerton) Ricardo Vilalta (colleague at UH since 2002; Co-Director of the UH’s Data Mining and Knowledge Discovery Group) Murali Achari (former Master student) Alain Rouhana (former Master student) Abraham Bagherjeiran (current PhD student) Chunshen Chen (current Master student) Nidal Zeidat (current PhD student) Sujing Wang (current PhD student) Kim Wee (current MS student) Zhenghong Zhao (former Master student)

Ch. Eick 1. Introduction There has been a lot of work on tradi itinal and semi-supervised clustering. Traditional clustering is applied to unclassified examples and relies on a notion of closeness. The focus of traditional clustering is to identify groups in datasets trying to minimizing given fitness function, such as trying to minimize the tightness of a clustering. Objectives Supervised Clustering: Minimize cluster impurity while keeping the number of clusters low (expressed by a fitness function q(X)).

Motivation: Finding Subclasses using SC Attribute1 Ford Trucks :Ford :GMC GMC Trucks GMC Van Ford Vans Finally, supervised clustering can be used to identify subclasses of a given set of classes. Consider, we have cars that are classified as Ford or General Motors cars. A supervised clustering algorithms would identify subclasses, such as… Ford SUV Attribute2 GMC SUV

Related Work Supervised Clustering Sinkkonen’s [SKN02] discriminative clustering and Tishby’s information bottleneck method [TPB99, ST99] can be viewed as probabilistic supervised clustering algorithms. There has been a lot of work in the area of semi-supervised clustering that centers on clustering with background information. Although the focus of this work is traditional clustering, there is still a lot of similarity between techniques and algorithms they investigate and the techniques and algorithms we investigate. There are many possible supervised clustering algorithm. In our work, we investigate representative-based supervised clustering algorithms that aim

2. Representative-Based Supervised Clustering Aims at finding a set of objects among all objects (called representatives) in the data set that best represent the objects in the data set. Each representative corresponds to a cluster. The remaining objects in the data set are then clustered around these representatives by assigning objects to the cluster of the closest representative. Remark: The popular k-medoid algorithm, also called PAM, is a representative-based clustering algorithm. There are many possible supervised clustering algorithm. In our work, we investigate representative-based supervised clustering algorithms that aim

Representative-Based Supervised Clustering … (Continued) 2 Attribute1 1 3 4 Attribute2

Representative-Based Supervised Clustering … (continued) 2 Attribute1 1 3 The objective of representative-based supervised clustering is… 4 Attribute2 Objective of RSC: Find a subset OR of O such that the clustering X obtained by using the objects in OR as representatives minimizes q(X).

SC Algorithms Currently Investigated Supervised Partitioning Around Medoids (SPAM). Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR). Top Down Splitting Algorithm (TDS). Supervised Clustering using Evolutionary Computing (SCEC) Agglomerative Hierarchical Supervised Clustering (AHSC) Grid-Based Supervised Clustering (GRIDSC) The remainder of this talk centers on algorithm for supervised clustering. Currently we are investigating several clustering algorithms and on comparign their performance.

A Fitness Function for Supervised Clustering q(X) := Impurity(X) + β*Penalty(k) k: number of clusters used n: number of examples the dataset c: number of classes in a dataset. β: Weight for Penalty(k), 0< β ≤2.0 The following transparency gives a fitness function that has to be minimized by a supervised clustering algorithm. The fitness functions is the sum of the impurity of a clustering X and of a penalty that it is associated with the number of clusters k used. The fitness function uses a parameter b; if we are interested in obtaining a large number of clusters we would use a high b value. Penalty(k) increase sub-linearly. because the effect of increasing the # of clusters from k to k+1 has greater effect on the end result when k is small than when it is large. Hence the formula above

Algorithm SRIDHCR (Greedy Hill Climbing) REPEAT r TIMES curr := a randomly created set of representatives (with size between c+1 and 2*c) WHILE NOT DONE DO Create new solutions S by adding a single non-representative to curr and by removing a single representative from curr Determine the element s in S for which q(s) is minimal (if there is more than one minimal element, randomly pick one) IF q(s)<q(curr) THEN curr:=s ELSE IF q(s)=q(curr) AND |s|>|curr| THEN Curr:=s ELSE terminate and return curr as the solution for this run. Report the best out of the r solutions found. Highlights: k is not an input parameter, SRIDHCR searches for best k within the range that is induced by b. Reports the best clustering found in r runs

Supervised Clustering using Evolutionary Computing: SCEC Initial generation Next generation Mutation Crossover Copy Final generation Best solution Result:

Supervised Clustering --- Algorithms and Applications Organization of the Talk Supervised Clustering Representative-based Supervised Clustering Algorithms Applications: Using Supervised Clustering for for Dataset Editing for Class Decomposition for Region Discovery in Spatial Datasets Conclusion and Future Work

Nearest Neighbour Rule Consider a two class problem where each sample consists of two measurements (x,y). For a given query point q, assign the class of the nearest neighbour. k = 1 Compute the k nearest neighbours and assign the class by majority vote. k = 3 Problem: requires “good” distance function

3a. Dataset Reduction: Editing Training data may contain noise, overlapping classes Editing seeks to remove noisy points and produce smooth decision boundaries – often by retaining points far from the decision boundaries Main Goal of Editing: enhance the accuracy of classifier (% of “unseen” examples classified correctly) Secondary Goal of Editing: enhance the speed of a k-NN classifier

Wilson Editing Wilson 1972 Remove points that do not agree with the majority of their k nearest neighbours Earlier example Overlapping classes Original data Original data Wilson editing with k=7 Wilson editing with k=7

RSC  Dataset Editing Attribute1 Attribute1 A B D C F E One application of supervised clustering is data set editing. The idea of dataset editing is to remove examples from a training set with the goal to enhance accuracy of a classifier. The idea is to use the cluster representatives instead of the whole dataset when training a classifier. The cluster representatives are determined using supervised clustering. Attribute2 Attribute2 a. Dataset clustered using supervised clustering. b. Dataset edited using cluster representatives.

Supervised Clustering vs. Clustering the Examples of Each Separately Approaches to discover subclasses of a given class: Cluster the examples of each class separately Use supervised clustering Figure 4. Supervised clustering editing vs. clustering each class (x and o) separately. O OOx x x Remark: A traditional clustering algorithm, such as k-medoids, would pick o as the cluster representative, because it is “blind” on how the examples of other classes distribute, whereas supervised clustering would pick o as the representative; obviously, o is not a good choice for editing, because it attracts points of the class x, which leads to misclassifications.

Experimental Evaluation We compared a traditional 1-NN classifier and Supervised Clustering Editing (SCE). A benchmark consisting of 8 UCI datasets was used for this purpose. Accuracies were computed using 10-fold cross validation. SRIDHCR was used for supervised clustering. SCE was tested using different compression rates by associating different penalties with the number of clusters found (by setting parameter b to: 0.4 and 1.0). Compression rates of SCE and Wilson Editing were computed using: 1-(k/n) with n being the size of the original dataset and k being the size of the edited dataset.

Experimental Results (Table 4)

Summary SCE vs. 1-NN-classifier SCE achieved very high compression rates without loss in accuracy for 5 of the 8 datasets tested. SCE accomplished a significant improvement in accuracy for 3 of the 8 datasets tested. Surprisingly, many UCI datasets can be compressed by just using a single representative per class without a significant loss in accuracy. SCE, in contrast to other editing techniques, removes examples that are classified correctly as well as examples that are classified incorrectly from the dataset. This explains its much higher compression rates, if compared to other techniques. SCE frequently picks representatives that are in the center of a region that is dominated by a single class; however, sometimes for with more complex shapes, the need arises for representatives to be lined up across of each other to avoid attracting points in neighboring clusters.

Complex9 Dataset

Supervised Clustering Result for Complex9

Diamonds9 dataset clustered using SC algorithm SRIDHCR

Future Direction of this Research Data Set p Data Set’ IDLA IDLA Classifier C Classifier C’ Goal: Find p, such that C’ is more accurate than C or C and C’ have approximately the same accuracy, but C’ can be learnt more quickly and/or C’ classifies new examples more quickly.

3.b Class Decomposition Simple classifiers: Attribute 1 Attribute 1 Attribute 2 Attribute 2 Simple classifiers: Encompass a small class of approximating functions. Limited flexibility in their decision boundaries Attribute 1 Attribute 2

Naïve Bayes vs. Naïve Bayes with Class Decomposition

3.c Discovery of Interesting Regions for Spatial Data Mining Task: 2D/3D datasets are given; discover interesting regions in the dataset that maximize a given fitness function; examples of region discovery include: Discover regions that have significant deviations from the prior probability of a class; e.g. regions in the state of Wyoming were people are very poor or not poor at all Discover regions that have significant variation in the income (fitness is defined based on the variance with respect to income in a region) Discover congested regions for traffic control Our Approach: We use (supervised) clustering to discover such regions with a fitness function representing a particular measure of interestingness; regions are implicitly defined by the set of points that belong to a cluster.

Wyoming Map

Household Income in 1999: Wyoming Park County

Clusters  Regions Example: 2 clusters in red and blue are given; regions are defined by using a Voronoi diagram based on a NN classifier with k=7; region are in grey and white.

t(p(C),prior(C),g1,g2,R+,R-) An Evaluation Scheme for Discovering Regions that Deviate from the Prior Probability of a Class C Let prior(C)= |C|/n p(c,C)= percentage of examples in c that belong to class C Reward(c) is computed based on p(c.C), prior(C) , and based on the following parameters: g1,g2,R+,R- (g11g2; R+,R-0) relying on the following interpolation function (e.g. g1=0.8,g2=1.2,R+ =1, R-=1): qC(X)= ScX (t(p(c,C),prior(C),g1,g2,R+,R-) *|c|b)/n) with b>1 (typically, 1.0001<b<2); the idea is that increases in cluster-size rewarded nonlinearly, favoring clusters with more points as long as |c|*t(…) increases. Reward(c) R+ R- t(p(C),prior(C),g1,g2,R+,R-) prior(C)*g1 prior(C) prior(C)*g2 1 p(c,C)

Ch. Eick Example: Discovery of “Interesting Regions” in Wyoming Census 2000 Datasets

Supervised Clustering --- Algorithms and Applications Organization of the Talk Supervised Clustering Representative-based Supervised Clustering Algorithms Applications: Using Supervised Clustering for for Dataset Editing for Class Decomposition for Region Discovery in Spatial Datasets Summary and Future Work

4. Summary and Future Work A novel data mining technique, we term “supervised clustering”, was introduced. The benefits of using supervised clustering as a preprocessing step to enhance classification algorithms, such as NN classifiers and naïve Bayesian classifiers, were demonstrated. In our current research, we investigate the use of supervised clustering for spatial data mining, distance function learning, and for discovering subclasses. Moreover, we investigate how to make supervised clustering adaptive with respect to user feedback.

An Environment for Adaptive (Supervised) Clustering for Summary Generation Applications Algorithm Inputs changes Adaptation System Evaluation System feedback Past Experience Domain Expert quality Fitness Functions (predefined) q(X), … Idea: Development of a Generic Clustering/Feedback/Adaptation Architecture whose objective is to facilitate the search for clusterings that maximize an internally and/or an externally given reward function.

Links to 5 Related Papers [VAE03] R. Vilalta, M. Achari, C. Eick, Class Decomposition via Clustering: A New Framework for Low-Variance Classifiers, in Proc. IEEE International Conference on Data Mining (ICDM), Melbourne, Florida, November 2003. http://www.cs.uh.edu/~ceick/kdd/VAE03.pdf [EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and Benefits, short version of this paper to appear in Proc. International Conference on Tools with AI (ICTAI), Boca Raton, Florida, November 2004. http://www.cs.uh.edu/~ceick/kdd/EZZ04.pdf [ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to Learn Distance Functions for Supervised Similarity Assessment, to appear MLDM'05, Leipzig, Germany, July 2005. http://www.cs.uh.edu/~ceick/kdd/ERBV04.pdf [EZV04] C. Eick, N. Zeidat, R. Vilalta, Using Representative-Based Clustering for Nearest Neighbor Dataset Editing, to appear in Proc. IEEE International Conference on Data Mining (ICDM), Brighton, England, November 2004. http://www.cs.uh.edu/~ceick/kdd/EZV04.pdf [ZSE05] N. Zeidat, S. Wang, and C. Eick, Data Set Editing Techniques: A Comparative Study, submitted for publication. http://www.cs.uh.edu/~ceick/kdd/ZSE04.pdf