A N A RCHITECTURE AND A LGORITHMS FOR M ULTI -R UN C LUSTERING Rachsuda Jiamthapthaksin, Christoph F. Eick and Vadeerat Rinsurongkawong Computer Science Department University of Houston, TX 1
O UTLINE 1. Motivation 2. Goals 3. Overviews 4. Related work 5. An architecture and algorithms for multi-run clustering 6. Experimental results 7. Conclusion and future works 2
1. M OTIVATION 3 Region discovery framework A family of clustering algorithms A family of plug-in fitness functions Domain experts Region discovery framework A family of clustering algorithms A family of plug-in fitness functions Multi-run clustering Manually select parameters of clustering algorithms Rely on active learning to automatically select parameters of clustering algorithms Cougar^2: Open Source Data Mining and Machine Learning Framework
2. G OALS Given O = { o 1,…, o n } as a spatial dataset. A clustering algorithm seeks for a clustering X that maximizes a fitness function q ( X ). X = { x 1, x 2,…, x k }, x i x j = , ( i j ),, and The goal is to automatically find a set of distinct and high quality clusters that originate from different runs. 4
3. O VERVIEWS OF MULTI - RUN CLUSTERING – 1 Key hypothesis: better clustering results can be obtained by combining clusters that originate from multiple runs of a clustering algorithm. 5
3. O VERVIEWS OF MULTI - RUN CLUSTERING – 2 Challenges: Selecting appropriate parameters for an arbitrary clustering algorithm Determining which clusters to be stored as candidate clusters. Generating a final clustering from candidate clusters Alternative clusters, e.g. hotspots in spatial datasets at different granularities 6
4. R ELATED WORK Meta clustering [ Caruana et al ]: early create diverse clusterings, cluster them into groups afterward, and finally let users choose a group of clusterings that is the best for their needs. Ensemble clustering [ Gionis et al. 2005; Zeng et al ]: aggregates different clusterings into one consolidated clustering 7
D EFINITION OF A STATE A state s in a state space S ( S R 2bm ) : s = { s 1_min, s 1_max,…, s m_min, s m_max }, s i 2b A state s for CLEVER s = { k’ min, k’ max, p min, p max, p’ min, p’ max } 8
5. A N ARCHITECTURE OF MULTI - RUN CLUSTERING SYSTEM State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit S1 S2 S4S3 S6 S5 Parameters M M X X M’ Steps in multi-run clustering: S1: Parameter selection. S2: Run a clustering algorithm. S3: Compute a state feedback. S4: Update the state utility table. S5: Update the cluster list M. S6: Summarize clusters discovered M’. 9
P RE - PROCESSING STEP. C OMPUTE NECESSARY STATISTICS TO SET UP MULTI - RUN CLUSTERING SYSTEM. State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit S0 we run m rounds of CLEVER by randomly selecting k’, p and p’. 10
S TEP 1. S ELECT PARAMETERS OF A CLUSTERING ALGORITHM. State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit S1 1. Randomly select a state. 2. Choose a state with the maximum state utility value. 3. Choose a state in the neighborhood of the state having the maximum state utility value. Fig. 2. Examples of the policies P( 1) = 0.2, P( 2) = 0.6, P( 3) = 0.2. s 1 = {k’ min =1, k’ max =10, p min =1, p max =10, p’ min =11, p’ max =20} s 2 = {k’ min =11, k’ max =20, p min =41, p max =50, p’ min =31, p’ max =40} Selected state: {k’=12, p=45, p’=40} 11
S TEP 2. R UN CLEVER TO GENERATE A CLUSTERING WITH RESPECT TO GIVEN PARAMETERS. State Utility Learning Clustering Algorithm Storage Unit Cluster Summarization Unit S2 k’=12, p=45, p’=40 Parameters 12 Fitness Function:
S TEP 3. C OMPUTE A STATE UTILITY. State Utility Learning Clustering Algorithm Clustering Algorithm Storage Unit Cluster Summarization Unit S3 A relative clustering quality function (RCQ) Novelty(X,M) = (1 - similarity(X,M)) Enhancement(X,M) M X 13 X = {x 1,…,x k }, and y i be the most similar cluster in the stored cluster list M to x i X. RCQ(X,M) = Novelty(X,M) x ||Speed(X)|| x ||q(X)||
S TEP 4. U PDATE A STATE UTILITY. State Utility Learning Storage Unit Cluster Summarization Unit S4 Clustering Algorithm Clustering Algorithm Utility Update U’ 14
S TEP 5. U PDATE CLUSTER LISTS TO MAINTAIN A SET OF DISTINCT AND HIGH QUALITY CLUSTERS. State Utility Learning Storage Unit Cluster Summarization Unit S5 Clustering Algorithm Clustering Algorithm X Let M be the current set of multi-run clusters. X be a new clustering to be processed for updating M. sim be a similarity threshold. r th be a reward storage threshold. X will be processed as follows: FOR c X DO Let m be the most similar cluster in M to c. IF sim ( m, c )> sim AND Reward ( m )< Reward ( c ) THEN replace ( m, c, M ) ELSE IF Reward ( c )> r th THEN insert ( c, M ) ELSE discard ( c ); Fig. 3. Cluster List Management algorithm (CLM) 15
S TEP 6. G ENERATE A FINAL CLUSTERING. State Utility Learning Storage Unit Cluster Summarization Unit S6 Clustering Algorithm Clustering Algorithm M M’ 16 Dominance-guided Cluster Reduction algorithm (DCR) Dominance graphs : a dominant cluster : dominated clusters A A B C D E F D E F AD
6. E XPERIMENTAL EVALUATION – 1 Evaluation of multi-run clustering on earthquake dataset* Show how multi-run clustering can discover interesting and alternative clusters in spatial data. Be interested in areas where deep earthquakes are in close proximity to shallow earthquakes. Use the High Variance function ( i ( c )) [Rinsurongkawong 2008] to find such regions. 17 *: earthquake dataset is available on the website of the U.S. Geological Survey Earthquake Hazards Program
6. E XPERIMENTAL EVALUATION – 2 Fig. 6. Top 5 clusters of X TheBestRun (ordered by reward) Fig. 7. Multi-run clustering results: clusters in M’. 18
6. E XPERIMENTAL EVALUATION – 3 Our system can find 70% of the new and high- quality clusters that do not exist in the best single run. With overlapping threshold of 0.2, there are 43% of the positive-reward clusters of the best run are not in M’. 19
6. E XPERIMENTAL EVALUATION – 4 Fig. 8. Overlay the multi-run clustering result (in color) by the top 5 rewards clusters of the best run (in black). 20
7. C ONCLUSION – 1 Propose an architecture and a concrete system for multi-run clustering to cope with parameters selection of a clustering algorithm, and to obtain alternative clusters in highly automated fashion; Uses active learning to automate the parameter selection, and various techniques to find both different clusters and good clusters on the fly. Propose Dominance-guided Cluster Reduction algorithm that post-processes clusters from the multiple runs to generate a final clustering by restricting cluster overlap. 21
7. C ONCLUSION – 2 The experimental result on earthquake dataset supports our claim that multi-run clustering outperforms single-run clustering with respect to clustering quality. Multi-run clustering can discover additional novel, alternative, high-quality clusters and enhance the quality of clusters found using single-run clustering. 22
7. F UTURE WORK Systematically evaluate the use of utility learning in choosing parameters of a clustering algorithm. Ultimate goal is to construct multi-run multi- objective clustering in one system. 23
T HANK YOU 24