Clustering – Part II COSC 526 Class 13 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266.

Clustering – Part II COSC 526 Class 13 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: ramanathana@ornl.govramanathana@ornl.gov Slides inspired by: Andrew Moore (CMU), Andrew Ng (Stanford), Tan, Steinbach and Kumar

2 Last Class K-means Clustering + Hierarchical Clustering –Basic Algorithms –Modifications for streaming data More today: –Modifications for big data! DBSCAN Algorithm Measures for validating clusters…

3 Project Schedule Updates… No.MilestoneDue date 1Selection of Topics1/27/2015 2Project Description & Approach3/3/2015 3Initial Project Report3/31/2015 4Project Demonstrations4/16/2015 – 4/19/2015 5Project Report4/23/2015 6Posters4/23/2015 Reports (2-3) will be about 1 -2 pages most, Final report which will be between 5-6 pages typically NIPS format More to come on posters

4 Assignment 1 updates Dr. Sukumar will be here on Tue (Mar 3) to give you an overview of the python tools We will keep the deadline for the assignment to be Mar 10 Assignment 2 will go out on Mar 10

5 Density-based Spatial Clustering of Applications with Noise (DBSCAN)

6 Preliminaries Density is defined to be the number of points within a radius (ε) –In this case, density = 9 ε Core point has more than a specified number of points (minPts) at ε –Points are interior to a cluster Border points have < minPts at ε but are within vicinity of the core point A noise point is neither a core point nor a border point ε minPts = 4 core point border point noise point

7 DBSCAN Algorithm

8 Illustration of DBSCAN: Assignment of Core, Border and Noise Points

9 DBSCAN: Finding Clusters

10 Advantages and Limitations Resistant to noise Can handle clusters of different sizes and shapes Eps and MinPts are dependent on each other –Can be difficult to specify Different density clusters within the same class can be difficult to find

11 Advantages and Limitations Varying density data High dimensional data

12 How to determine Eps and MinPoints For points within a cluster, k th nearest neighbors are roughly at the same distance Noise points are farther away in general Plot by sorting the distance of every point to its kth nearest neighbor

13 Modifying clustering algorithms to work with large-datasets…

14 Last class, MapReduce Approach for K- means… In the map step: Read the cluster centers into memory from a sequencefile Iterate over each cluster center for each input key/value pair. Measure the distances and save the nearest center which has the lowest distance to the vector Write the clustercenter with its vector to the filesystem. In the reduce step (we get associated vectors for each center): Iterate over each value vector and calculate the average vector. (Sum each vector and devide each part by the number of vectors we received). This is the new center, save it into a SequenceFile. Check the convergence between the clustercenter that is stored in the key object and the new center. If it they are not equal, increment an update counter What do we do, if we do not have sufficient memory? i.e., we cannot hold all of the data… Practical scenarios: we cannot hold a million data points at a given time, each of them being high dimensional What do we do, if we do not have sufficient memory? i.e., we cannot hold all of the data… Practical scenarios: we cannot hold a million data points at a given time, each of them being high dimensional

15 Bradley-Fayyad-Reina (BFR) Algorithm BFR is a variant of k-means designed to work with very large, disk resident data Assumes clusters are normally distributed around a centroid in a Euclidean space –Standard deviation in different dimensions may vary –Clusters are axis-aligned ellipses Efficient way to summarize clusters –want memory required O(clusters) –Instead of O(data)

16 How does BFR work? Rather than keeping the data, BFR just maintains summary statistics of data: –Cluster summaries –Outliers –Points to be clustered This makes it easier to store only the statistics based on the number of clusters

17 Details of the BFR Algorithm 1.Initialize k clusters 2.Load a bag of points from disk 3.Assign new points to one of the K original clusters, if they are within some distance threshold of the cluster 4.Cluster the remaining points, and create new clusters 5.Try to merge new clusters from step 4 with any of the existing clusters 6.Repeat 2-5 until all points have been examined 1.Initialize k clusters 2.Load a bag of points from disk 3.Assign new points to one of the K original clusters, if they are within some distance threshold of the cluster 4.Cluster the remaining points, and create new clusters 5.Try to merge new clusters from step 4 with any of the existing clusters 6.Repeat 2-5 until all points have been examined

18 Details, details, details of BFR Points are read from disk one main-memory full at a time Most points from previous memory loads are summarized by simple statistics From the initial load, we select k centroids: –Take k random points –Take a small random sample and cluster optimally –Take a sample; pick a random point, and then k-1 more points, each as far as possible from the previously selected points

19 Three classes of points… 3 sets of points, which we can keep track of Discard Set (DS): –Points close enough to a centroid to be summarized Compression Set (CS): –Groups of points that are close together but not close to any existing centroid –These points are summarized, but not assigned to a cluster Retained Set (RS): –Isolated points waiting to be assigned to a compression set

20 Using the Galaxies Picture to learn BFR + + + + + + + + centroid Reject Set Compressed Set Discard Set Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

21 Summarizing the sets of points For each cluster, the discard set (DS) is summarized by: –The number of points, N –The vector SUM, whose i th component is is the sum of the coordinates of the points in the i th dimension –The vector SUMSQ: i th component = sum of squares of coordinates in i th dimension

22 More details on summarizing the points 2d+1 values represent any clusters –d is the number of dimensions Average in each dimension (the centroid) can be calculated as SUMi / N –SUMi = ith component of SUM Variance of a cluster’s discard set in dimension i is: (SUMSQi / N) – (SUMi / N)2 –And standard deviation is the square root of that Note: Dropping the “axis-aligned” clusters assumption would require storing full covariance matrix to summarize the cluster. So, instead of SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big!

23 The “Memory Load” of Points and Clustering Step 3: Find those points that are “sufficiently close” to a cluster centroid and add those points to that cluster and the DS –These points are so close to the centroid that they can be summarized and then discarded Step 4: Use any in-memory clustering algorithm to cluster the remaining points and the old RS –Use any in-memory clustering algorithm to cluster the remaining points and the old RS

24 More on Memory Load of points DS set: Adjust statistics of the clusters to account for the new points –Add Ns, SUMs, SUMSQs Consider merging compressed sets in the CS If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster

25 A few more details… How do we decide if a point is “close enough” to a cluster that we will add the point to that cluster? We need a way to decide whether to put a new point into a cluster (and discard) BFR suggests two ways: –Mahalanobis distance High likelihood of the point belonging to currently nearest centroid

26 Second way to define a “close” point Normalized Euclidean distance from centroid For point (x1, x2, …, xd) and centroid (c1, c2, …, cd): –Normalize in each dimension: yi = (xi-ci)/d –Take sum of squares of yi –Take the square root

27 More on Mahalanobis Distance If clusters are normally distributed in d dimensions, then after transformation, one standard deviation = sqrt(d) –i.e., 68% of points will cluster with a distance of sqrt(d) Accept a point for a cluster if its M.D. is < some threshold, e.g. 2 standard deviations

28 Should 2 CS clusters be combined? Compute the variance of the combined subcluster –N, SUM, and SUMSQ allow us to make that calculation quickly Combine if the combined variance is below some threshold Many alternatives: Treat dimensions differently, consider density

29 Limitations of BFR… Makes strong assumption about the data: –Normally distributed data –Does not work with non-linearly separable data –Woks only with axis aligned datasets… Real world datasets are hardly this way

30 Clustering with Representatives (CURE)

31 CURE: Efficient clustering approach Robust clustering approach Can handle outliers better Employs a hierarchical clustering approach: –Middle ground between centroid based and all-points extreme (MAX) Can handle different types of data

32 CURE Algorithm CURE (points, k) 1.It is similar to hierarchical clustering approach. But it uses sample point variance as the cluster representative rather than every point in the cluster. 2.First set a target sample number c. Than we try to select c well scattered sample points from the cluster. 3.The chosen scattered points are shrunk toward the centroid in a fraction of  where 0 <  <1

33 CURE clustering procedure 4.These points are used as representative of clusters and will be used as the point in d min cluster merging approach. 5.After each merging, c sample points will be selected from original representative of previous clusters to represent new cluster. 6.Cluster merging will be stopped until target k cluster is found Nearest Merge Nearest Merge

34 Other tweaks in the CURE algorithm Use Random Sampling: –data cannot be stored all at once in memory –it is similar to the core-set idea Partition and Two-pass clustering: –Reduces compute time –First, we divide the n data point into p partition and each contain n/p data point. –We than pre-cluster each partition until the number of cluster n/pq reached in each partition for some q > 1 –Then each cluster in the first pass result will be used as the second pass clustering input to form the final cluster.

35 Space and Time Complexity Worst case time complexity: –O(n 2 log n) Space complexity: –O(n) –Using a k-d tree for insertion/updating

36 Example of how it works

37 How to validate clustering approaches?

38 Cluster validity For supervised learning: –we had a class label, –which meant we could identify how good our training and testing errors were –Metric: Accuracy, Precision, Recall For clustering: –How do we measure the “goodness” of the resulting clusters?

39 Clustering random data (overfitting) If you ask a clustering algorithm to find clusters, it will find some

40 Different aspects of validating clsuters Determine the clustering tendency of a set of data, i.e., whether non-random structure actually exists in the data (e.g., to avoid overfitting) External Validation: Compare the results of a cluster analysis to externally known class labels (ground truth). Internal Validation: Evaluating how well the results of a cluster analysis fit the data without reference to external information. Compare clusterings to determine which is better. Determining the ‘correct’ number of clusters.

41 Measures of cluster validity External Index: Used to measure the extent to which cluster labels match externally supplied class labels. –Entropy, Purity, Rand Index Internal Index: Used to measure the goodness of a clustering structure without respect to external information. – Sum of Squared Error (SSE), Silhouette coefficient Relative Index: Used to compare two different clusterings or clusters. –Often an external or internal index is used for this function, e.g., SSE or entropy

42 Measuring Cluster Validation with Correlation Proximity Matrix vs. Incidence matrix: –A matrix K ij with 1 if the point belongs to the same cluster; 0 otherwise Compute the correlation between the two matrices: –Only n(n-1)/2 values to be computed –High values indicate similarity between points in the same cluster Not suited for density based clustering

43 Another approach: use similarity matrix for cluster validation

44 Internal Measures: SSE SSE is also a good measure to understand how good the clustering is –Lower SSE  good clustering Can be used to estimate number of clusters

45 More on Clustering a little later… We will discuss other forms of clustering in the following classes Next class: –please bring your brief write up on the two papers –We will discuss frequent itemset mining and a few other aspects of clustering –Move on to Dimensionality Reduction

Clustering – Part II COSC 526 Class 13 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266.

Similar presentations

Presentation on theme: "Clustering – Part II COSC 526 Class 13 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering – Part II COSC 526 Class 13 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266.

Similar presentations

Presentation on theme: "Clustering – Part II COSC 526 Class 13 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266."— Presentation transcript:

Similar presentations

About project

Feedback