MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti, and Christoph F. Eick Department of Computer Science, University of Houston Organization 1.Motivation  Scope of the research –Region Discovery –Traditional Clustering  Clustering with Plug-In Fitness Functions  Shape-aware Clustering Algorithms  Ideas of MOSAIC 2.Background 3.The MOSAIC Algorithm 4.Experimental Evalution 5.Related Work 6.Conclusion and Future Work

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 1.1 Motivation: Examples of Region Discovery RD-Algorithm Application 1: Hot-spot Discovery [EVDW06] Application 2: Find Interesting Regions with respect to a Continuous Variable Application 3: Find “representative” regions (Sampling) Application 4: Regional Co-location Mining Application 5: Regional Association Rule Mining [DEWY06] Application 6: Regional Association Rule Scoping [EDYKN07] Wells in Texas: Green: safe well with respect to arsenic Red: unsafe well  =1.01  =1.04

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 Region Discovery Framework The algorithms we currently investigate solve the following problem: Given: A dataset O with a schema R A distance function d defined on instances of R A fitness function q(X) that evaluates clustering X={c 1,…,c k } as follows: q(X)=  c  X reward(c)=  c  X interestingness(c)  size(c)  with  >1 Objective: Find c 1,…,c k  O such that: 1.c i  c j =  if i  j 2.X={c 1,…,c k } maximizes q(X) 3.All cluster c i  X are contiguous 4.c 1 ,…,  c k  O 5.c 1,…,c k are usually ranked based on the reward each cluster receives, and low reward clusters are frequently not reported

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 1.2 Clustering with Plug-In Fitness Functions Clustering algorithms No fitness function Provides plug-in fitness function Fixed Fitness Function DBSCAN Hierarchical Clustering Implicit Fitness Function K-Means CHAMELEON MOSAIC PAM

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 1.3 Shape-aware Clustering Shape is a significant characteristic in traditional clustering and region discovery Examples Fig. 1: some chain-like patterns in Volcano dataset Fig.2: arbitrary shape of regions of high (low) arsenic concentration in Texas wells

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 1.4 Ideas Underlying MOSAIC MOSAIC provides a generic framework that integrates representative-based clustering, agglomerative clustering, and proximity graphs, and which approximates arbitrary shape clusters using unions of small convex polygons Fig. 6: An illustration of MOSAIC’s approach (a) input (b) output

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 Talk Organization 1.Motivation 2.Background  Representative-based clustering  Agglomerative clustering  Proximity Graphs 3.The MOSAIC Algorithm 4.Experimental Evaluation 5.Related Work 6.Conclusion and Future Work

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 2.1 Representative-based Clustering Attribute2 Attribute1 1 2 3 4 Objective: Find a set of objects O R such that the clustering X obtained by using the objects in O R as representatives minimizes q(X). Properties: Cluster shapes are convex polygons Popular Algorithms: K-means, K-medoids, SCEC

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 2.2 MOSAIC and Agglomerative Clustering Advantages MOSAIC over traditional agglomerative clustering: Wider search —considers all neighboring clusters Plug-in fitness function Clusters are always contiguous Expensive algorithm is only run for 20-1000 iterations Highly generic algorithm

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 2.3 Proximity Graphs How to identify neighboring clusters for representative-based clustering algorithms? Proximity graphs provide various definitions of “neighbour” NNG = Nearest Neighbour Graph MST = Minimum Spanning Tree RNG = Relative Neighbourhood Graph GG = Gabriel Graph DT = Delaunay Triangulation (neighbours of a 1NN-classifier)

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 Proximity Graphs: Delaunay The Delaunay Triangulation is the dual of the Voronoi diagram Three points are each others neighbours if their tangent sphere contains no other points Complete: captures all neighbouring clusters Expensive to compute in high dimensions

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 Proximity Graphs: Gabriel The Gabriel graph is a subset of the Delaunay Triangulation (some decision boundary might be missed) Points are neighbours only if their (diametral) sphere of influence is empty Can be computed more efficiently: O(k 3 )

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 3. MOSAIC Fig. 10: Gabriel graph for clusters generated by a representative-based clustering algorithm

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 Pseudo Code MOSAIC 1. Run a representative-based clustering algorithm to create a large number of clusters. 2. Read the representatives of the obtained clusters. 3. Create a merge candidate relation using proximity graphs. 4. WHILE there are merge-candidates (C i,C j ) left BEGIN Merge the pair of merge-candidates (C i,C j ), that enhances fitness function q the most, into a new cluster C’ Update merge-candidates: C Merge-Candidate(C’,C)  Merge-Candidate(C i,C) Merge-Candidate(C j,C) END RETURN the best clustering X found.

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 Complexity MOSAIC Let n be the number of objects in the dataset k be the number of clusters returned by the representative- based algorithm Complexity MOSAIC: O(k 3 + k 2 *O(q(x))) Remarks: The above formula assumes that fitness is computed from the scratch when a new clustering is obtained Lower complexities can be obtained with incrementally reusing results of previous fitness computations Our current implementation assumes that only additive fitness functions are used

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 4. Experimental Evaluation for Traditional Clustering Compared MOSAIC with DBSCAN and K-means Used silhouette as q(X) when running MOSAIC; Silhouette considers cohesion and separation (measured as the distance to the nearest cluster). Used 9-Diamonds, Volcano, Diabetes, Ionosphere, and Vehicle datasets in the experimental evaluation

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 Experimental Results Finding good parameter setting for DBSCAN turned out to be problematic for the 9-Diamonds and Volcano spatial datasets. Neither DBSCAN nor MOSAIC were able to obtain to identify all chain-like patterns in the Volcano dataset. We compared MOSAIC and K-means for the Ionosphere, Diabetes, and Vehicle high- dimensional datasets. Cluster quality was measured using Silhouette. MOSAIC outperformed K-means on these datasets.

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 Volcano Dataset Result MOSAIC

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 Volcano Dataset Result DBSCAN

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 Open Issues: What is a Good Fitness Function for Traditional Clustering? The use plug-in fitness functions within traditional clustering algorithms is not very common. Use existing cluster evaluation measures as fitness function, such as cohesion, separation, and silhouette, does not lead to very good clustering when confronted with arbitrary shape clusters [Choo07]. Question: Can we find better cluster evaluation measures or is finding good evaluation measures for traditional clustering a hopeless project?

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 5. Related Work CURE integrates a partitioning algorithm with an agglomerative hierarchical algorithm [GRS98]. CHAMELEON [KHK99] provides a sophisticated two-phased clustering algorithm: a multilevel graph partitioning algorithm and agglomerative clustering algorithm on knn sparse graph.

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 Related Work Continued Lin and Zhong [LC02 and ZG03] propose hybrid clustering algorithms that combine representative- based clustering and agglomerative clustering methods. Surdeanu [STA05] proposes a hybrid clustering approach that combines agglomerative clustering algorithm with the Expectation Maximization (EM) algorithm.

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 6. Conclusion A new clustering algorithm was introduced that approximates arbitrary shape clusters through unions of convex polygons The algorithm performs a wider search by considering “all” neighboring clusters as merge candidates. Gabriel graphs are used to determine neighboring clusters The algorithm is generic in that it can be used with any initial merge candidate relation, any fitness function, and any representative-based algorithms MOSAIC can also be seen as a generalization of agglomerative grid-based clustering algorithms. We mainly use MOSAIC in the region discovery project mentioned earlier.

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007 Future Work: Learn fitness function based on feedback Idea: employs machine learning techniques to learn a fitness function by using the feedback of a domain expert. –Pros: –It provides more adaptive approach to give the changes to tailor the fitness function based on the domain expert ’ s requirements. –The process of finding an appropriate fitness function is automatic. –Cons: –features selection is non-trivial –Learning the function is a difficult machine learning task

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Similar presentations

Presentation on theme: "MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,

Similar presentations

Presentation on theme: "MOSAIC: A Proximity Graph Approach for Agglomerative Clustering Jiyeon Choo, Rachsuda Jiamthapthaksin, Chun-shen Chen, Ulvi Celepcikay, Christian Guisti,"— Presentation transcript:

Similar presentations

About project

Feedback