Download presentation
Presentation is loading. Please wait.
Published byMiranda Lindsey Modified over 9 years ago
1
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Using Clustering to Learn Distance Functions for Supervised Similarity Assessment Christoph F. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta Department of Computer Science University of Houston Organization of the Talk 1.Similarity Assessment 2.A Framework for Distance Function Learning 3.Inside Outside Weight Updating 4.Distance Function Learning Research at UH-DMML 5.Experimental Evaluation 6.Other Distance Function Learning Research 7.Summary
2
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 1. Similarity Assessment Definition: Similarity assessment is the task of determining which objects are similar to each other and which are dissimilar to each other. Goal of Similarity Assessment: Construct a distance function! Applications of Similarity Assessment: Case-based reasoning Classification techniques that rely on distance functions Clustering … Complications: Usually, there is no universal “good” distance function for a set of objects; the usefulness of a distance depends on the task it used for (“no free lunch in similarity assessment either”). Defining the distance between objects is more an art than a science.
3
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 The following relation is given (with 10000 tuples): Patient(ssn, weight, height, cancer-sev, eye-color, age,…) Attribute Domains –ssn: 9 digits –weight between 30 and 650; m weight =158 s weight =24.20 –height between 0.30 and 2.20 in meters; m height =1.52 s height =19.2 –cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor –eye-color: {brown, blue, green, grey } –age: between 3 and 100; m age =45 s age =13.2 Task: Define Patient Similarity Motivating Example: How To Find Similar Patients?
4
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Data Extraction Tool DBMS Clustering Tool User Interface A set of clusters Similarity measure Similarity Measure Tool Default choices and domain information Library of similarity measures Type and weight information Object View Library of clustering algorithms CAL-FULL/UH Database Clustering & Similarity Assessment Environments Learning Tool Training Data Today’s topic For more details: see [RE05]
5
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 2. A Framework for Distance Function Learning Assumption: The distance between two objects is computed as the weighted sum of the distances with respect to their attributes. Objective: Learn a “good” distance function for classification tasks. Our approach: Apply a clustering algorithm with the object distance function to be evaluated that returns k clusters. Our goal is to learn the weights of an object distance function such that pure clusters are obtained (or as pure is possible) --- a pure cluster contains example belonging to a single class.
6
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Idea: Coevolving Clusters and Distance Functions Clustering X Distance Function Cluster Goodness of the Distance Function q(X) Clustering Evaluation Weight Updating Scheme / Search Strategy x x x x o o o o xx o o xx o o o o “Bad” distance function “Good” distance function x x o o
7
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 3. Inside/Outside Weight Updating Cluster1: distances with respect to Att1 Action: Increase weight of Att1 Action: Decrease weight for Att2 Cluster1: distances with respect to Att2 Idea: Move examples of the majority class closer to each other xo oo ox o o xx o o o:=examples belonging to majority class x:= non-majority-class examples
8
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Inside/Outside Weight Updating Algorithm 1.Cluster the dataset using a given weight vector w=(w 1,…,w p ) using k-means 2.FOR EACH cluster-attribute pair DO 1.Modify w using inside/outside weight updating 3.IF NOT DONE, CONTINUE with Step1; OTHERWISE, RETURN w.
9
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Inside/Outside Weight Updating Heuristic o o xx o oxo oo ox Example 1: Example 2: (W)(W) The weight of the i-th attribute w i is updated as follows for a given cluster:
10
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Idea: Weight Inside/Outside Weight Updating 1 2 3 4 5 6 Cluster k Attribute1Attribute2Attribute3 Initial Weights: w 1 =w 2 =w 3 =1; Updated Weights: w 1 =1.14,w 2 =1.32, w 3 =0.84
11
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Illustration: Net Effect of Weight Adjustments New Object DistancesOld Object Distances 1 2 3 4 5 6 Cluster k
12
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 A Slight Enhanced Weight Update Formula
13
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Sample Run of IOWU for the Diabetes Dataset
14
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 4. Distance Function Learning Research at UH-DMML Randomized Hill Climbing Adaptive Clustering Inside/Outside Weight Updating K-Means Supervised Clustering NN-Classifier Weight-Updating Scheme / Search Strategy Distance Function Evaluation … … Work By Karypis [BECV05] Other Research [ERBV04] Current Research [EZZ04]
15
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 5. Experimental Evaluation Used a benchmark consisting of 7/15 UCI datasets Inside/outside weight updating was run for 200 iterations was set to 0.3 Evaluation (10-fold cross validation repeated 10 times was used to determine accuracy) –Used 1-NN classifier as the base line classifer –Usee the learned distance function for a 1-NN –Used the learned distance function for a NCC classifier (new!)
16
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 NCC-Classifier A C E a. Dataset clustered by K-meansb. Dataset edited using cluster centroids that carry the class label of the cluster majority class Attribute1 D B Attribute2 F Attribute1 Idea: the training set is replaced by k (centroid, majority class) pairs that are computed using k-means; the so generated dataset is then used to classify the examples in the test set.
17
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Experimental Evaluation Datasetnk1-NNLW1NNNCCC4.5 DIABETES7683570.6268.8973.0774.49 VEHICLE8466469.5969.8665.9472.28 HEART-STATLOG2701076.1577.5281.0778.15 GLASS2143069.9573.566.4167.71 HEART-C3032576.0676.3978.7776.94 HEART-H2942578.3377.5581.5480.22 IONOSPHERE3511087.191.7386.7389.74 Remark: Statistically significant improvements are in red.
18
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 DF-Learning With Randomized Hill Climbing Random: random number : rate of change for example:[-0.3,0.3] 0.3 -0.3 Generate R solutions in the neighborhood of w and pick the best one to be the new weight vector w
19
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Accuracy IOWA and Randomized Hill Climbing DatasetRHC(1c)RHC(2c)RHC(5c)IOWA(1c)IOWA(2c)IOWA(5c) autos 48.2146.6638.3240.9445.7041.39 breast-cancer 70.0973.0571.0471.8573.2171.49 wisconsin-breast-cancer 94.4796.2495.0694.4196.6794.03 credit-rating 53.1747.1744.5953.2849.1445.88 pima_diabetes 71.5673.9173.2472.1173.8074.22 german_credit 69.5071.3172.4867.4168.8970.47 Glass 61.2464.5662.3261.1663.3861.41 cleveland-14-heart-diseas 77.8974.8771.2077.3373.3967.30 hungarian-14-heart-diseas 80.9480.0978.4579.7779.6276.78 heart-statlog 82.3381.6776.3782.1581.7877.52 ionosphere 82.7485.7586.1785.2589.7289.57 sonar 70.7071.9773.6871.7072.6773.43 vehicle 56.25 58.3153.5156.3655.48 vote 94.6790.5488.8493.6894.2189.05 zoo 78.9767.1956.1179.2068.7553.80
20
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Uses reinforcement learning to adapt distance functions for k-means clustering. Employs search strategies that explores multiple paths in parallel. The algorithm maintains an open-list with maximum size |L| --- bad performers a dropped from the open list. Currently, beam search is used which creates 2p successors (increasing and decreasing the weight of each attribute exactly once) and evaluates those 2p*|L| successors and keeps the best |L| of those. Discretizes the search space in which states are (, ) tuples into a grid, and memorizes and updates the fitness values of the grid; value iteration is limited to “interesting states” by employing prioritized sweeping. Weights are updated by increasing / decreasing the weight of an attribute by a randomly chosen percentage that fall within an interval [min-change, max-change]; our current implementation uses: [25%,50%]. Employs entropy H(X) as the fitness function (low entropy pure cluster) Distance Function Learning With Adaptive Clustering
21
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 6. Related Distance Function Learning Research Interactive approaches that use user feedback and reinforcement learning to derive a good distance function. Other work uses randomized hill climbing and neural networks to learn distance functions for classification tasks; mostly, NN-queries are used to evaluate the quality of a clustering. Other work, mostly in the area of semi-supervised clustering, adapts object distances to cope with constraints.
22
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 7. Summary Described an approach that employs clustering for distance function evaluation. Introduced an attribute weight updating heuristic called inside/outside weight- updating and evaluated its performance. The inside/weight updating approach enhanced a 1-NN classifier significantly for some UCI datasets, but not for all data sets that were tested. The quality of the employed approach is dependent on the number of cluster k which is an input parameter; our current research centers on determining k automatically with a supervised clustering algorithm [EZZ04] The general idea to replace a dataset by cluster representatives to enhance NN-classifiers shows a lot of promise in this (as exemplified in the NCC classifier) and other research we are currently conducting. Distance function learning is quite time consuming; one run of 200 iterations of inside/outside weight updating takes between 5 seconds and 5 minutes depending on dataset size and k-value; other techniques we currently investigate are significantly slower; therefore, we are currently moving to high performance computing facilities for the empirical evaluation of the distance function learning approaches.
23
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Links to 4 Papers 1.[EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and Benefits, short version appeared in Proc. International Conference on Tools with AI (ICTAI), Boca Raton, Florida, November 2004. http://www.cs.uh.edu/~ceick/kdd/EZZ04.pdfhttp://www.cs.uh.edu/~ceick/kdd/EZZ04.pdf 2.[RE05] T. Ryu and C. Eick, A Clustering Methodology and Tool, in Information Sciences 171(1-3): 29-59 (2005). http://www.cs.uh.edu/~ceick/kdd/RE05.dochttp://www.cs.uh.edu/~ceick/kdd/RE05.doc 3.[ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to Learn Distance Functions for Supervised Similarity Assessment, in Proc. MLDM'05, Leipzig, Germany, July 2005. http://www.cs.uh.edu/~ceick/kdd/ERBV05.pdf http://www.cs.uh.edu/~ceick/kdd/ERBV05.pdf 4.[BECV05] A. Bagherjeiran, C. Eick, C.-S. Chen, R. Vilalta, Adaptive Clustering: Obtaining Better Clusters Using Feedback and Past Experience, submitted for publication. http://www.cs.uh.edu/~ceick/kdd/BECV05.pdf http://www.cs.uh.edu/~ceick/kdd/BECV05.pdf
24
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Question?
25
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Randomized Hill Climbing Fast start: algorithm starts from small neighborhood size until it can not find any better solutions. Then it increases its neighborhood size by 3 times hopping that a better solution can be found by trying more points Shoulder condition: When the algorithm has moved to a shoulder or flat hill, it will keep getting solutions with the same fitness value. Our algorithm terminates when it has tried for 3 times and still getting the same results. This prevents it from been trapped in a shoulder forever
26
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Randomized Hill Climbing Shoulder Flat hill State space Objective function
27
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Purity in clusters obtained (internal) Test 2.2 (Beta=0.4) Inside outside weight updating (Repeat 200 times) SCEC paramet ersPS=200, n=30 Learning Rate (%) DiabetesVehicleHeartStatlogGlassHeart-CHeart-HIONOSPHERE 100.231770.351420.133330.242300.330030.142860.11252 350.221350.338700.140740.238320.330030.146250.08717 500.213540.362130.140740.260990.333330.132650.08717 700.217450.355450.140740.238710.333330.136050.08717
28
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Purity in clusters obtained (internal) Test 2.2 (Beta=0.4) Randomize Hill Climbing (p=30) SCEC parametersPS=200, n=30 Learning Rate r(%) Diabete sVehicle HeartStatlo gGlassHeart-CHeart-HIONOSPHERE 50.21740.35320.14070.28040.33990.13610.1196 150.22270.35500.12960.24070.33660.10200.1150 300.21740.35150.11480.23230.33330.12590.1207 500.21740.33200.11110.23300.33330.12590.1054 650.22140.31080.11480.23230.31350.11900.0957 800.20830.30920.11480.21960.33000.13610.1082 900.20570.31080.12960.23490.32010.10880.0872
29
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Ch. Eick Objectives Supervised Clustering: Minimize cluster impurity while keeping the number of clusters low (expressed by a fitness function q(X)). Different Forms of Clustering
30
Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 A Fitness Function for Supervised Clustering q(X) := Impurity(X) + β*Penalty(k) k: number of clusters used n: number of examples the dataset c: number of classes in a dataset. β: Weight for Penalty(k), 0< β ≤2.0 Penalty(k) increase sub-linearly. because the effect of increasing the # of clusters from k to k+1 has greater effect on the end result when k is small than when it is large. Hence the formula above
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.