Using Clustering to Enhance Classifiers Christoph F. Eick Organization of the Talk 1.Brief Introduction to KDD 2.Using Clustering a. for Nearest Neighbour.

Using Clustering to Enhance Classifiers Christoph F. Eick Organization of the Talk 1.Brief Introduction to KDD 2.Using Clustering a. for Nearest Neighbour Editing b. for Distance Function Learning c. for Class Decomposition 3.Representative-Based Supervised Clustering Algorithms 4.Summary and Conclusion

Objectives of Today’s Presentation Goal: To give you a flavor what kind of questions and techniques are investigated by my/our current research Brief introduction to KDD Not discussed: –Why is KDD/classification/clustering important? –Example applications for KDD/classification/clustering. –Evaluation of presented techniques (if you are interested how techniques presented in this presentation compare with other approaches you can read [VAE03], [EZZ04], [ERBV04], [EZV04], [RE05]). –Literature survey

1. Knowledge Discovery in Data [and Data Mining] (KDD) Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad) Frequently, the term data mining is used to refer to KDD. Many commercial and experimental tools and tool suites are available (see http://www.kdnuggets.com/siftware.html)http://www.kdnuggets.com/siftware.html

KDD: Confluence of Multiple Disciplines KDD Database Technology Statistics Other Disciplines Information Science Machine Learning Visualization

Popular KDD-Tasks Classification (learn how to classify) Clustering (finding groups of similar objects) Estimation and Prediction (try to learn a function that predicts the value of a continuous output variable based on a set of input variables) Deviation and Fraud Detection Concept description: Characterization and Discrimination Trend and Evolution Analysis Mining for Associations and Correlations Text Mining Web Mining Visualization Data Transformation and Data Cleaning Data Integration and Data Warehousing

Important KDD Conferences KDD (has 500-900 participants, strong industrial presence, KDD-Cup, controlled by ACM)KDD ICDM (receives approx. 500 papers each year, controlled by IEEE)ICDM PKDD (European KDD Conference)PKDD

Assumption: We have a data set containing classified examples Goal: We want to learn a function (a classifier) that classifies an example based on its characteristics (attributes) Example: http://www2.cs.uh.edu/~wxstrong/AI/nba.data http://www2.cs.uh.edu/~wxstrong/AI/nba.data http://www2.cs.uh.edu/~wxstrong/AI/nba.names Topic for the next 40 minutes: Presentation of 3 different approaches that use clustering to obtain better classifier. 2. Clustering for Classification

List of Persons that Contributed to the Work Presented in Today’s Presentation Tae-Wan Ryu Ricardo Vilalta Murali Achari Alain Rouhana Abraham Bagherjeiran Chunshen Chen Nidal Zeidat Zhenghong Zhao

Nearest Neighbour Rule Consider a two class problem where each sample consists of two measurements (x,y). k = 1 k = 3 For a given query point q, assign the class of the nearest neighbour. Compute the k nearest neighbours and assign the class by majority vote. Problem: requires “good” distance function

2a. Dataset Reduction: Editing Training data may contain noise, overlapping classes Editing seeks to remove noisy points and produce smooth decision boundaries – often by retaining points far from the decision boundaries Main Goal of Editing: enhance the accuracy of classifier (% of “unseen” examples classified correctly) Secondary Goal of Editing: enhance the speed of a k-NN classifier

Wilson Editing Wilson 1972 Remove points that do not agree with the majority of their k nearest neighbours Wilson editing with k=7 Original data Earlier example Wilson editing with k=7 Original data Overlapping classes

Traditional Clustering Partition a set of objects into groups of similar objects. Each group is called cluster. Clustering is used to “detect classes” in data set (“unsupervised learning”). Clustering is based on a fitness function that relies on a distance measure and usually tries to create “tight” clusters.

Objectives Supervised Clustering: Minimize cluster impurity while keeping the number of clusters low (expressed by a fitness function q(X)).

Representative-Based Supervised Clustering (RSC) Aims at finding a set of objects among all objects (called representatives) in the data set that best represent the objects in the data set. Each representative corresponds to a cluster. The remaining objects in the data set are, then, clustered around these representatives by assigning objects to the cluster of the closest representative. Remark: The popular k-medoid algorithm, also called PAM, is a representative-based clustering algorithm.

Representative-Based Supervised Clustering … (Continued) Attribute2 Attribute1 1 2 3 4

Representative-Based Supervised Clustering … (continued) Attribute2 Attribute1 1 2 3 4 Objective of RSC: Find a subset O R of O such that the clustering X obtained by using the objects in O R as representatives minimizes q(X).

RSC  Dataset Editing A C E a. Dataset clustered using supervised clustering. b. Dataset edited using cluster representatives. Attribute1 D B Attribute2 F Attribute1

Experimental Evaluation βNRWilson1-NNC4.5 Glass (214) 0.10.6360.6070.6920.677 0.40.5890.6070.6920.677 1.00.5750.6070.6920.677 Heart-Stat Log (270) 0.1 0.7960.8040.7670.782 0.40.8330.8040.7670.782 1.00.8380.8040.7670.782 Diabetes (768) 0.1 0.7360.7340.6900.745 0.40.7360.7340.6900.745 1.00.7450.7340.6900.745 Vehicle (846) 0.1 0.6670.7160.7000.723 0.40.6670.7160.7000.723 1.00.6650.7160.7000.723 Waveform (5000) 0.10.8340.7960.7680.781 0.40.8410.7960.7680.781 1.00.8370.7960.7680.781

General Direction of this Research Data Set Data Set’ Goal: Find  such that C’ is more accurate than C or C and C’ have approximately the same accuracy, but C’ can be learnt more quickly and/or C’ classifies new examples more quickly.  IDLA Classifier CClassifier C’

Example: How to Find Similar Patients? The following relation is given (with 10000 tuples): Patient(ssn, weight, height, cancer-sev, eye-color, age,…) Attribute Domains –ssn: 9 digits –weight between 30 and 650; m weight =158 s weight =24.20 –height between 0.30 and 2.20 in meters; m height =1.52 s height =19.2 –cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor –eye-color: {brown, blue, green, grey } –age: between 3 and 100; m age =45 s age =13.2 Task: Define Patient Similarity 2b. Using Clustering in Distance Function Learning

Data Extraction Tool DBMS Clustering Tool User Interface A set of clusters Similarity measure Similarity Measure Tool Default choices and domain information Library of similarity measures Type and weight information Object View Library of clustering algorithms CAL-FULL/UH Database Clustering & Similarity Assessment Environments Learning Tool Training Data Today’s topic For more details: see [RE05]

Similarity Assessment Framework and Objectives Objective: Learn a good distance function  for classification tasks. Our approach: Apply a clustering algorithm with the distance function  to be evaluated that returns a number of clusters k. The more pure the obtained clusters are the better is the quality of . Our goal is to learn the weights of an object distance function  such that all the clusters are pure (or as pure is possible); for more details see [ERBV04] paper.

Idea: Coevolving Clusters and Distance Functions Clustering X Distance Function  Cluster Goodness of the Distance Function  q(X) Clustering Evaluation Weight Updating Scheme / Search Strategy x x x x o o o o xx o o xx o o o o “Bad” distance function   “Good” distance function   x x o o

Idea Inside/Outside Weight Updating Cluster1: distances with respect to Att1 Action: Increase weight of Att1 Action: Decrease weight for Att2 Cluster1: distances with respect to Att2 Idea: Move examples of the majority class closer to each other xo oo ox o o xx o o o:=examples belonging to majority class x:= non-majority-class examples

Sample Run of IOWU for Diabetes Dataset Graph produced by Abraham Bagherjeiran

Research Framework Distance Function Learning Random Search Randomized Hill Climbing Inside/Outside Weight Updating K-Means Supervised Clustering NN-Classifier Weight-Updating Scheme / Search Strategy Distance Function Evaluation … … Other Work

2.c Using Clustering for Class Decomposition Attribute2 Ford Trucks Attribute1 Ford SUV Ford Vans GMC Trucks GMC Van GMC SUV :Ford :GMC

RSC  Enhance Simple Classifiers Attribute1 Attribute2 AB C D

3. SC Algorithms Currently Investigated 1.Supervised Partitioning Around Medoids (SPAM). 2.Single Representative Insertion/Deletion Steepest Decent Hill Climbing with Randomized Restart (SRIDHCR). 3.Top Down Splitting Algorithm (TDS). 4.Supervised Clustering using Evolutionary Computing (SCEC) 5.Agglomerative Hierarchical Supervised Clustering (AHSC).

A Fitness Function for Supervised Clustering q(X) := Impurity(X) + β*Penalty(k) k: number of clusters used n: number of examples the dataset c: number of classes in a dataset. β: Weight for Penalty(k), 0< β ≤2.0

Applications of Supervised Clustering Enhance classification algorithms. –Use SC for Dataset Editing to enhance NN- classifiers [ICDM04] –Improve Simple Classifiers [ICDM03] Learning Sub-classes Distance Function Learning [ERBV04] Dataset Compression/Reduction Redistricting Meta Learning / Creating Signatures for Datasets

4. Summary We gave a brief introduction to KDD We demonstrated how clustering can be used to obtain “better” classifiers We introduced a new form of clustering, called supervised clustering, for this purpose.

Research Topics 2004-2005 Inductive Learning/Data Mining –Decision trees, nearest neighbor classifiers –Using clustering to enhance classification algorithms –Making sense of data Supervised Clustering –Learning subclasses –Supervised clustering algorithms that learn clusters with arbitrary shape –Redistricting algorithms Tools for Similarity Assessment and Distance Function Learning Data Set Compression and Creating Meta Knowledge for Local Learning Techniques –Comparative study involving traditional editing and condensing and unusual techniques –Creating maps and other data set signatures for datasets based on editing, SC, and other techniques Traditional Clustering Data Mining and Information Retrieval for Structured Data Other: Evolutionary Computing, File Prediction, Ontologies, Heuristic Search, Reinforcement Learning, Data Models. Remark: Topics that were “covered” in this talk are in blue

Where to Find References? Data mining and KDD (SIGKDD member CDROM): –Conference proceedings: KDD, ICDM, PKDD etc. –Journal: Data Mining and Knowledge Discovery Database field (SIGMOD member CD ROM): –Conference proceedings: ACM-SIGMOD, VLDB, ICDE, EDBT, DASFAA –Journals: ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc. AI and Machine Learning: –Conference proceedings: ICML, AAAI, IJCAI, etc. –Journals: Machine Learning, Artificial Intelligence, etc. Statistics: –Conference proceedings: Joint Stat. Meeting, etc. –Journals: Annals of statistics, etc. Visualization: –Conference proceedings: CHI, etc. –Journals: IEEE Trans. visualization and computer graphics, etc.

Links to 5 Papers [VAE03] R. Vilalta, M. Achari, C. Eick, Class Decomposition via Clustering: A New Framework for Low-Variance Classifiers, in Proc. IEEE International Conference on Data Mining (ICDM), Melbourne, Florida, November 2003. http://www.cs.uh.edu/~ceick/kdd/VAE03.pdf [EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and Benefits, short version of this paper to appear in Proc. International Conference on Tools with AI (ICTAI), Boca Raton, Florida, November 2004. http://www.cs.uh.edu/~ceick/kdd/EZZ04.pdf [ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to Learn Distance Functions for Supervised Similarity Assessment, in revision, to be submitted to MLDM'05, Leipzig, Germany, July 2005 http://www.cs.uh.edu/~ceick/kdd/ERBV04.pdf [EZV04] C. Eick, N. Zeidat, R. Vilalta, Using Representative-Based Clustering for Nearest Neighbor Dataset Editing, to appear in Proc. IEEE International Conference on Data Mining (ICDM), Brighton, England, November 2004. http://www.cs.uh.edu/~ceick/kdd/EZV04.pdf [RE05]. Ryu, C. Eick, A Database Clustering Methodology and Tool, to appear in Information Science, Spring 2005. http://www.cs.uh.edu/~ceick/kdd/RE05.doc

Weight Adjustment within a Cluster Let w i be the current weight of the i-th attribute Let  i be the average distance of the examples that belong to the cluster with respect to  i Let  i be the distance of examples that belong to the majority class of the cluster with respect to  i Learning: Then weights are adjusted as follows with respect to a particular cluster: w i ’=w i + (  i –  i )  or better w i ’=w i + w i  min(max  (  i –  i )  with  being the learning rate and  maximal adjustment (e.g. if  a weight can be maximally increased/decreased by 20%) per weight per cluster. Remark: If the cluster is ‘pure’ or does not contain 2 or more elements of a particular class, no weight adjustment takes place. Work at UH

Using Clustering to Enhance Classifiers Christoph F. Eick Organization of the Talk 1.Brief Introduction to KDD 2.Using Clustering a. for Nearest Neighbour.

Similar presentations

Presentation on theme: "Using Clustering to Enhance Classifiers Christoph F. Eick Organization of the Talk 1.Brief Introduction to KDD 2.Using Clustering a. for Nearest Neighbour."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Clustering to Enhance Classifiers Christoph F. Eick Organization of the Talk 1.Brief Introduction to KDD 2.Using Clustering a. for Nearest Neighbour.

Similar presentations

Presentation on theme: "Using Clustering to Enhance Classifiers Christoph F. Eick Organization of the Talk 1.Brief Introduction to KDD 2.Using Clustering a. for Nearest Neighbour."— Presentation transcript:

Similar presentations

About project

Feedback