Download presentation
Presentation is loading. Please wait.
1
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar
2
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 2 Hierarchical Clustering l Two main types of hierarchical clustering algorithms. –Agglomerative Start with the points as individual clusters Merge clusters until only one is left –Divisive Start with all the points as one cluster Split clusters until only only singleton clusters –Agglomerative is more popular l Traditional hierarchical algorithms use a similarity or distance matrix. –Merge or split one cluster at a time
3
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 3 Hierarchical Clustering l Produces a set of nested clusters organized as a hierarchical tree. l Can be visualized as a dendrogram –Tree like diagram –Records the sequences of merges or splits l Can ‘cut’ the dendrogram to get a partitional clustering
4
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 4 Basic Agglomerative Clustering Algorithm l Algorithm is straightforward 1.Compute the proximity matrix, if necessary 2.Let each data point by a cluster 3.Repeat 4.Merge the two closest clusters 5.Update the proximity matrix 6.Until only a single cluster remains l Key operation is the computation of the proximity of two clusters. l Different approaches to defining the distance between clusters distinguishes the different algorithms.
5
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 5 l For agglomerative hierarchical clustering we start with clusters of individual points and a proximity matrix. Agglomerative Hierarchical Clustering: Starting Situation p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix
6
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 6 C2C1 l After some merging steps, we have some clusters. Agglomerative Hierarchical Clustering: Intermediate Situation C1 C3 C5 C4 C2 C3C4C5 C1 C4 C2 C5 C3 Proximity Matrix
7
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 7 C2C1 l We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. Agglomerative Hierarchical Clustering: Intermediate Situation C1 C3 C5 C4 C2 C3C4C5 C1 C4 C2 C5 C3 Proximity Matrix
8
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 8 l The question is “How do we update the proximity matrix?” Agglomerative Hierarchical Clustering: After Merging C1 C4 C2 U C5 C3 ? ? ? ? ? C2 U C5 C1 C3 C4 C2 U C5 C3C4 Proximity Matrix
9
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 9 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Similarity? l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error Proximity Matrix
10
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 10 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error
11
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 11 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error
12
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 12 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p5......... Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error
13
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 13 l Similarity of two clusters is based on the two closest points in the different clusters. –Determined by one pair of points, i.e., by one link in the proximity graph. l Can handle non-elliptical shapes. l Sensitive to noise and outliers. Cluster Similarity: MIN or Single Link
14
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 14 Hierarchical Clustering: MIN Nested ClustersDendrogram 1 2 3 4 5 6 1 2 3 4 5
15
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 15 Strength of MIN Original PointsTwo Clusters
16
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 16 Limitations of MIN Original PointsTwo Clusters
17
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 17 l Similarity of two clusters is based on the two most distant points in the different clusters. –Determined by all pairs of points in the two clusters. l Tends to break large clusters. l Less susceptible to noise and outliers. l Biased towards globular clusters. Cluster Similarity: MAX or Complete Linkage
18
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 18 Hierarchical Clustering: MAX Nested ClustersDendrogram 1 2 3 4 5 6 1 2 5 3 4
19
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 19 Strength of MAX Original PointsTwo Clusters
20
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 20 Limitations of MAX Original PointsTwo Clusters
21
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 21 l Distance of two clusters is the average of pairwise distance between points in the two clusters. l Compromise between Single and Complete Link. l Need to use average connectivity for scalability since total connectivity favors large clusters. l Less susceptible to noise and outliers. l Biased towards globular clusters. Cluster Similarity: Group Average
22
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 22 Hierarchical Clustering: Group Average Nested ClustersDendrogram 1 2 3 4 5 6 1 2 5 3 4
23
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 23 l Similarity of two clusters is based on the increase in squared error when two clusters are merged. –Similar to group average if distance between points is distance squared. l Less susceptible to noise and outliers. l Biased towards globular clusters. l Hierarchical analogue of K-means –But Ward’s method does not correspond to a local minimum –Can be used to initialize K-means Cluster Similarity: Ward’s Method
24
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 24 Hierarchical Clustering: Group Average Nested ClustersDendrogram 1 2 3 4 5 6 1 2 5 3 4
25
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 25 Hierarchical Clustering: Comparison Group Average Ward’s Method 1 2 3 4 5 6 1 2 5 3 4 MINMAX 1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 3 4 5
26
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 26 l O(N 2 ) space since it uses the proximity matrix. –N is the number of points. l O(N 3 ) time in many cases. –There are N steps and at each step the size, N 2, proximity matrix must be updated and searched. –By being careful, the complexity can be reduced to O(N 2 log(N) ) time for some approaches. Hierarchical Clustering: Time and Space requirements
27
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 27 l Once a decision is made to combine two clusters, it cannot be undone. l No objective function is directly minimized. l Different schemes have problems with one or more of the following: –Sensitivity to noise and outliers. –Difficulty handling different sized clusters and convex shapes. –Breaking large clusters. Hierarchical Clustering: Problems and Limitations
28
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 28 DBSCAN l DBSCAN is a density based algorithm. –Density = number of points within a specified radius (Eps) –A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster. –A border point has fewer than MinPts within Eps, but is in the neighborhoold of a core point –A noise point is any point that is not a core point or a border point.
29
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 29 DBSCAN: Core, Border, and Noise Points
30
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 30 When DBSCAN Works Well Original Points Clusters
31
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 31 DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4
32
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 32 DBSCAN: Determining EPS and MinPts l Idea is that for points in a cluster, there k th nearest neighbors are at roughly the same distance. l Noise points have the k th nearest neighbor at at farther distance. l So, plot sorted distance of every point to its k th nearest neighbor. (k=4 used for 2D points.
33
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 33 When DBSCAN Does NOT Work Well Original Points (MinPts=4, Eps=9.75). (MinPts=4, Eps=9.92)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.