Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer Science, University of Houston, USA Organization 1.Motivation 2.Analyzing Related Datasets 3.Correspondence Clustering Definition Frameworks Representative-based Correspondence Clustering Algorithms 4.Assessing Agreement between Related Datasets 5.Experimental Evaluation 6.Conclusion and Future Work 1
Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 Clustering Related Datasets has many applications: Relating habitats of animals and their source of food Understanding change of ozone concentrations due to industrial emissions of other pollutants. Analyzing changes in water temperature However, traditional clustering algorithms that cluster each dataset separately are not well suited to cluster related datasets: They do not consider correspondence between the datasets The variance inherent in most clustering algorithms complicates analyzing related datasets. 1. Motivation 2
Subtopics: Disparity Analysis/Emergent Pattern Discovery (“how do two groups differ with respect to their patterns?”) Change Analysis (“what is new/different?”) in temporal datasets; e.g. [CKT06] and [CSZHT07] utilize a concept of temporal smoothness that states that clustering results of data in two consecutive time frames should not be dramatically different. Relational clustering [BBM07] clusters different types of objects based on their properties and relationships. Co-clustering [DMM03] partition rows and columns of data matrix simultaneously to create clusters for two sets of objects. Correspondence clustering centers on “mining good clusters in different datasets with interesting relationships between them”. 2. Analyzing Related Datasets Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 3
Clustering with Plug-in Fitness Functions In the last 5 years, my research group developed families of clustering algorithms that find contiguous spatial clusters that by maximizing a plug-in fitness function. This work is motivated by a mismatch between evaluation measures of traditional clustering algorithms (such as cluster compactness) and what domain experts are actually looking for. The presented paper generalizes this work to mine multiple datasets. 4
Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 Definition: A correspondence clustering algorithm clusters data in two or more datasets O={O 1,…,O n } and generates clustering results X={X 1,…,X n } such that for 1 i n, X i is created from O i and the correspondence clustering algorithm seeks for clusters X i ’s such that each X i maximizes interestingness i (X i ) with respect to O i as well as maximizes the correspondence measure Corr(X 1,…,X n ) between itself and the other clusterings Xj for 1 j n, j i. 3. Correspondence Clustering 5
Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 Example Correspondence Clustering O1—Earthquakes 86-91O2—Earthquakes Example: Analyze Changes with Respect to Regions of High Variance of Earthquake Depth. Find clusters X1 for O1 and X2 for O2 maximizing the following objective function: q̃(X1,X2)=( (i(X1)+i(X2))) + ((1 ) Agreement(X1,X2)) where i(X) measures the interestingness of X based on the variance in earthquake depth of earthquakes in the clusters of X and determines the importance of dataset cluster quality and agreement between the two clusterings. 6
Relies on clustering algorithms that support plug-in fitness functions, allowing for non-distance based notion of interestingness. Geared towards analyzing spatial datasets; spatial attributes serve as glue to relate different spatial datasets Corresponding clustering can be viewed as a multi-objective optimization problem in which we try to obtain good clusters in multiple datasets with good fit with respect to a given correspondence relationship. What is unique about Corresponding Clustering? 7
2 groups of algorithms can be distinguished: Iterative algorithms that improve the clustering of one dataset while keeping the clusters of other datasets fixed. Concurrent algorithms that cluster all datasets in parallel. In the following, an iterative representative-based correspondence clustering algorithm C-CLEVER-I will be briefly discussed. Algorithms for Correspondence Clustering Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 8
Representative-based Clustering Attribute2 Attribute Objective: Find a set of objects O R such that the clustering X obtained by using the objects in O R as representatives minimizes q(X). Characteristic: cluster are formed by assigning objects to the closest representative Popular Algorithms: K-means, K-medoids, CLEVER,… 9
CLEVER [ACM-GIS’08] Is a representative-based clustering algorithm, similar to PAM. Searches variable number of clusters and larger neighborhood sizes to battle premature termination and randomized hill climbing and adaptive sampling to reduce complexity. In general, new clusters are generated in the neighborhood of the current solution by replacing, inserting, and replacing representatives. Searches for optimal number of clusters Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 10
Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 Inputs: O1 and O2, TCond, k’, neighborhood-size, p, p’, Output: X1, X2, q(X1), q(X2), q̃(X1,X2), Corr(X1,X2) Algorithm: 1. Run CLEVER on dataset O1 with fitness function q and get clustering result X1 and a set of representative R1: (X1,R1) :=Run CLEVER(O1, q); 2. Repeat until the Termination Condition TCond is met. a. Run CLEVER on dataset O2 with compound fitness function q̃2 that uses the representatives R1 to calculate Corr(X1,X2): (X2,R2) :=Run CLEVER(O2,R1, q̃2) b. Run CLEVER on dataset O1 with compound fitness function q̃1 that uses the representatives R2 to calculate Corr(X1,X2): (X1,R1) :=Run CLEVER(O1,R2, q̃1) C-CLEVER-I Outputs and Fitness Functions: X1, X2 are clusterings of O1 and O2 q is a single dataset fitness function q̃(X1,X2)=( (q(X1)+q(X2))) + ((1 ) Corr(X1,X2)) q̃1(X1)=( (q(X1)) + ((1 ) Corr(X1,X2))—q with X2 fixed q̃2(X2)=( (q(X2)) + ((1 ) Corr(X1,X2))—q with X1 fixed 11
Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 We assume that the two datasets share the same spatial attributes. However, the challenge of this task is that we do not have object identity; otherwise, we could directly compare the co-occurrence matrices M X1 and M X2 of the two clusterings. Key Idea: We can use the representatives of the clusters in one dataset to cluster the other dataset; then we can compute the similarity of original clustering with the clustering obtained using the representatives of the other dataset for both data sets. Finally, we assess similarity by averaging over these two similarities: 4. Assessing Agreement between Clusterings For Related Datasets 12
Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 5. Experimental Evaluation O1O2 Example: Analyze Changes with Respect to Regions of High Variance of Earthquake Depth. Find clusters X1 for O1 and X2 for O2 maximizing the following objective function: q̃(X1,X2)=( (i(X1)+i(X2))) + ((1 ) Agreement(X1,X2)) What is done in the experimental evaluation? 1.We compare running C-CLEVER-I with running CLEVER 2.Analyze the potential of using agreement as a correspondence function to reduce the variance of clustering results. 3.We analyze different initialization strategies and parameter settings. 13
Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 Comparing CLEVER and C-CLEVER-I CLEVERC-CLEVER-I ( =1.0e-5) C-CLEVER-I ( =2.0e-6) Fitness q(X 1 ) Fitness q(X 2 ) q(X 1 ) + q(X 2 ) Agreement(X 1,X 2 ) Agreement X 1 ’s Agreement X 2 ’s Computation Time5.48E E E+06 Table 5-3. Comparison of average results of CLEVER and C-CLEVER-I Comparison between CLEVER and C-CLEVER-I 14
Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 Different Initialization Strategies Table 5-3. Comparison of average results of CLEVER and C-CLEVER-I Comparison between Different Initialization Strategies The following initialization strategies for C-CLEVER-I have been explored: (1)Use random initial representatives (2)Use the nearest neighbors of the final representatives of the last iteration for the other dataset (3) Use the final representatives from the previous iteration for the same dataset C-CLEVER- I-C (2) C-CLEVER- I-O (3) C-CLEVER- I-R (1) Compound Fitness q̃(X 1,X 2 ) Fitness q(X 1 )2.3E E E+08 Fitness q(X 2 )2.14E E E+08 q(X 1 ) + q(X 2 )4.44E E E+08 Agreement(X 1,X 2 ) Computation Time3.23E E E+06 15
Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 A representative-based correspondence clustering framework has been introduced that relies on plug-in clustering and correspondence functions. Correspondence clustering algorithms that are generalizations of an algorithm called CLEVER were presented. Our experimental results suggest that correspondence clustering can reduce the variance inherent to representative-based clustering algorithms. Since the two datasets are related to each other, using one dataset to supervise the clustering of the other dataset can lead to more reliable clusterings. As a by-product, a novel agreement assessment method to compare representative-based clusterings that originate from different dataset has been introduced. 6. Conclusion 16
Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 What about other correspondence measures, besides agreement and disagreement? What about applications that look for other forms of correspondence between clusters originating from different spatial datasets? Several implementation strategies for concurrent correspondence clustering are possible: –Cluster each dataset for a few iterations and switch –Find clusters for both datasets using some spatial traversal approach, creating clusters for subregions –… Future Work 17
Rinsurakawong&Eick: Correspondence Clustering, PAKDD’10 Inputs: Dataset O, k’, neighborhood-size, p, p’, Outputs: Clustering X, fitness q Algorithm: 1. Create a current solution by randomly selecting k’ representatives from O. 2. Create p neighbors of the current solution randomly using the given neighborhood definition. 3. If the best neighbor improves the fitness q, it becomes the current solution. Go back to step If the fitness does not improve, the solution neighborhood is re-sampled by generating p’ more neighbors. If re-sampling does not lead to a better solution, terminate returning the current solution; otherwise, go back to step 2 replacing the current solution by the best solution found by re- sampling. CLEVER 18