Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: ramanathana@ornl.govramanathana@ornl.gov Slides inspired by: Andrew Moore (CMU), Andrew Ng (Stanford), Tan, Steinbach and Kumar

2 Assignment 1: Your first hand at random walks Write up is here… Pair up and do the assignment –It helps to work in small teams –Maximize your productivity Most of the assignment and its notes are in the handouts (class web-page)

3 Clustering: Basics…

4 Clustering Finding groups of items (or objects) in a group that are related to one other and different from other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

5 Applications Grouping regions together based on precipitation Grouping genes together based on expression patterns in cells Finding ensembles of folded/unfolded protein structures

6 What is not clustering? Supervised classification –class label information Simple segmentation –Dividing students into different registration groups (either alphabetically, by major, etc.) Results of a query –Grouping is a result of external specification Graph partitioning –Areas not identical… Take Home Message: Clustering of data is essentially driven by the data at hand!! Meaning or interpretation of the clusters should be driven by the data!! Take Home Message: Clustering of data is essentially driven by the data at hand!! Meaning or interpretation of the clusters should be driven by the data!!

7 Constitution of a cluster can be ambiguous How to decide between 8 clusters and 2 clusters?

8 Types of Clustering Partitional Clustering –A division of data into non- overlapping subsets (clusters) such that each data point is in exactly one subset Hierarchical Clustering –A set of nested clusters organized as a hierarchical tree p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p3p3 p4p4 p5p5 p6p6 p1p1 p2p2

9 Other types of distinctions… Exclusive vs. Non-exclusive: –Points may belong to multiple clusters Fuzzy vs. Non-fuzzy: –A point may belong to every cluster with weight between 0 and 1 –Similar to probabilistic clustering Partial vs. Complete: –We may want to cluster only some of the data Heterogeneous vs. Homogeneous –Cluster of widely different sizes…

10 Types of Clusters Well-separated clusters Center-based clusters Contiguous clusters Density based clusters Property/conceptual Described by an objective function set of points such that any point in a cluster is closer to every other point in the cluster than to any point not in the cluster

11 Types of Clusters Well-separated clusters Center-based clusters Contiguous clusters Density based clusters Property/conceptual Described by an objective function Cluster is a set of objects such that an object in a cluster is closer to the center of the cluster (called centroid) than any other center of any other cluster…

12 Types of Clusters Well-separated clusters Center-based clusters Contiguous clusters Density based clusters Property/conceptual Described by an objective function Nearest neighbor or transitive…

13 Types of Clusters Well-separated clusters Center-based clusters Contiguous clusters Density based clusters Property/conceptual Described by an objective function A cluster is a dense region of points separated by low-density regions Used when clusters are irregular and when noise/outliers are present

14 Types of Clusters Well-separated clusters Center-based clusters Contiguous clusters Density based clusters Property/conceptual Described by an objective function Find clusters that share a common property or representation Eg. taste, smell, …

15 Types of Clusters Well-separated clusters Center-based clusters Contiguous clusters Density based clusters Property/conceptual Described by an objective function Find clusters based on minimizing or maximizing an objective function Enumerate all possible ways of dividing points into clusters: Evaluate the goodness of each potential set of clusters by an objective function NP Hard problem Global vs. Local Objectives: Hierarchical clustering typically have local objectives Partitional algorithms typically have global objectives

16 More on objective functions… (1) Objective functions tend to map the clustering problem to a different domain and solve a related problem: –E.g., defining a proximity matrix as a weighted graph –Clustering is equivalent to breaking the graph into connected components –Minimize the edge weight between clusters and maximize edge weight within clusters

17 More on objective functions… (2) Best clustering usually minimizes/maximizes an objective function Mixture models assume that the data is a mixture of a number of parametric statistical distributions (e.g., Gaussians)

18 Characteristics of input data Type of proximity or density measure –derived measure, central to clustering Sparseness: –Dictates type of similarity –Adds to efficiency Attribute Type: –Dictates type of similarity Type of data: –Dictates type of similarity Dimensionality Noise and outliers Type of distribution

19 Clustering Algorithms: K-means Clustering Hierarchical Clustering Density-based Clustering

20 K-means Clustering Partitional clustering: –Each cluster is associated with a centroid –Each point is assigned to the cluster with the closest centroid –We need to identify the total number of clusters, k, as one of the inputs Simple Algorithm K-means Algorithm 1 : Select K points as the initial centers 2 : repeat 3 : Form K clusters by assigning all points to the closest centroid 4 : Recompute the centroid of each cluster 5: until centroids don’t change

21 K-means Clustering Initial centroids are chosen randomly: –clusters can vary depending on how you started Centroid is the mean of the points in the cluster “Closeness” is measured usually Euclidean distances K-means will typically converge quickly –Points stop changing assignments –Another stopping criterion: Only a few points change clusters Time complexity O(nKId) –n: number of points; K: number of clusters –I: number of iterations; d: number of attributes

22 K-means example

23 How to initialize (seed) K-means? If there are K “real” clusters, then the chance of selecting one centroid from each cluster is small –Chance is relatively small when K is large –If clusters have the same size (say m) –If k = 10  P = 0.00036 (really small!!) The choice of centroids can have a deep impact on how the clusters are determined…

24 Choosing K

25 What are the solutions for this problem? Multiple runs!! –Usually helps Sample the points so that you can guesstimate the number of clusters –Depends on how we have sampled –Or we have sampled outliers in the data Select more than the k number of centroids and then select k among these centroids –Choose widely separated k centroids

26 How to evaluate k-means clusters Most common measure is the sum of squared errors (SSE): Given two clustering outputs from k-means, we can choose the one with the least error Only compare clustering with the same K Important side note: K-means is a heuristic for minimizing SSE

27 Pre-processing and Post-processing Pre-processing: –normalize the data (e.g., scale the data to unit standard deviation) –eliminate outliers Post-processing: –Eliminate small clusters that may represent outliers –Split clusters that have a high SSE –Merge clusters that have a low SSE

28 Limitations of using K-means K-means can have problems when the data has: –different sizes –different densities –non-globular shapes –outliers!

29 How does this scale… (for MapReduce) In the map step: Read the cluster centers into memory from a sequencefile Iterate over each cluster center for each input key/value pair. Measure the distances and save the nearest center which has the lowest distance to the vector Write the clustercenter with its vector to the filesystem. In the reduce step (we get associated vectors for each center): Iterate over each value vector and calculate the average vector. (Sum each vector and devide each part by the number of vectors we received). This is the new center, save it into a SequenceFile. Check the convergence between the clustercenter that is stored in the key object and the new center. If it they are not equal, increment an update counter

30 Making k-means streaming Two broad approaches: –Solving the k-means as it comes: Guha, Mishra, Motwani, O’Callaghan (2001) Charikar, O'Callaghan, and Panigrahy (2003) Braverman, Meyerson, Ostrovsky, Roytman, Shindler, and Tagiku (2011) –Solving k-means using weighted coresets: Select a small sample of points that are weighted Weights are such that the solution of the k-means on the subset is similar to the original dataset

31 Fast Streaming K-means Shindler, Wong, Myerson, NIPS (2011) Shindler, NIPS presentation (2011)

32 Fast Streaming K-means Intuition on why this works: The probability that point x belongs to some cluster is proportional to its distance from the “mean” –referred to as “facility” here Costliest step: measuring δ: –Use approximate nearest neighbor algorithms Space complexity: Ω(k log n) –You are only storing neighborhood info –Use hashing and metric embedding (not discussed) Time complexity: o(nk) Shindler, Wong, Myerson, NIPS (2011)

33 Hierarchical Clustering

34 Hierarchical Clustering Produce a set of nested clusters organized as a hierarchical tree Can be conveniently visualized as a dendrogram: –a tree like representation which records the sequences of merges and splits

35 Types of Hierarchical Clustering Agglomerative Clustering: –Start with points as individual points (leaves) –At each step, merge the closest pair of clusters until one cluster (or k clusters) remain Divisive Clustering: –Start with one, all inclusive cluster –At each step, split a cluster until each cluster has a point (or there are k clusters) Traditional hierarchical clustering: –uses similarity or distance matrix –merge or split one cluster at a time

36 Agglomerative Clustering One of the more popular algorithms Basic algorithm is straightforward Agglomerative Clustering Algorithm 1 : Compute the distance matrix 2 : Let each data point be a cluster 3 : repeat 3 : Merge the two closest clusters 4 : Update the distance matrix 5: until only a single cluster remains Key operation is the computation of the proximity of two clusters → Different approaches to defining the distance between clusters distinguish the different algorithms

37 Starting Situation Start with clusters of individual data points and a distance matrix p1p2p3p4p5p… p1 p2 p3 p5 p…

38 Next step: Group points… After merging a few of these data points C1 C2 C3 C4 C5 c1c2c3c4c5 c1 c2 c3 c4 c5

39 Next step: Merge clusters… After merging a few of these data points C1 C2 C3 C4 C5 c1c2c3c4c5 c1 c2 c3 c4 c5

40 How to merge and update the distance matrix? Measure of similarity: –Min –Max –Group average –Distance between centroids –Other methods driven by an objective function How do these look on the clustering process?

41 Defining inter-cluster similarity Min (single link) Max (complete link) Group Average (average link) Distance between centroids

42 Single Link non-spherical/non-convex clusters

43 Complete Link Clustering Better suited for datasets with noise Tends to form smaller clusters Biased toward more globular clusters

44 Average link / Unweighted Pair Group Method using Arithmetic Averages (UPGMA) Compromise between single and complete linkage Works generally well in practice

45 How do we say when two clusterings are similar? Ward’s method –Similarity of two clusters is based on the increase in SSE when two clusters are merged Advantage: –Less susceptible to errors/outliers in the data –Analog of the K-means comparison –Can be used to initialize K-means Disadvantage: –Biased toward more globular clusters

46 Space and Time Complexity Space Complexity: O(N 2 ) –N is the number of data points –N2 entries in the distance matrix Time Complexity: O(N 3 ) –Many cases: N-steps for tree construction, and at each step the distance matrix with O(N 2 ) entries must be updated –Complexity can be reduced to O(N 2 logN) in some cases

47 Let’s talk about Scaling! Specific type of hierarchical clustering algorithm: –UPGMA (average linking) –Most widely used in bioinformatics literature However impractical for scaling to entire genome! –Need the whole distance/ dissimilarity matrix in memory (N2)! –How can we exploit sparsity?

48 Problem of interest… Given a large number of sequences and we have a way to determine how two or more sequences are similar We have a pairwise matrix  dissimilarity matrix Build a hierarchical clustering routine for understanding how proteins (or other bio- molecules) have evolved Given a large number of sequences and we have a way to determine how two or more sequences are similar We have a pairwise matrix  dissimilarity matrix Build a hierarchical clustering routine for understanding how proteins (or other bio- molecules) have evolved

49 The problem with UPGMA: Distance matrix computation is expensive We are computing the arithmetic mean between the sequences This is not defined when we have sparse inputs Triangle inequality is not satisfied based on how we have defined the way clusters are built…

50 Strategy to scale up this for Big Data Two aspects to handle: –Missing edges –Sparsity in the distance matrix detection threshold – for missing edge data… We are completing “missing” values in D using ψ!

51 Sparse UPGMA: Speeding Space: O(E) note E << N 2 Time: O(E log V) Still Expensive for E can be arbitrarily large! How do we deal with this?

52 Streaming for Sparsity: Multi-round Memory Constrained (MC-UPGMA) Two components needed: –Memory constrained clustering unit Holds only a subset of the E that needs to be processed in the current round –Memory constrained merging unit: Ensures we get only valid edges Space is only O(N) depending on how many sequences we have to load at any given time… Time: O(E log V)

53 Limitations of Hierarchical Clustering Greedy: once we make a decision for merging, it cannot be usually undone –Or can be expensive to undo –Methods exist to alter this No global function is being minimized or maximized Different schemes of hierarchical clustering have limitations: –Sensitivity to noise and outliers –Difficulty in handling different shapes –Chaining, breaking of clusters…

54 Density-based Spatial Clustering of Applications with Noise (DBSCAN)

55 Preliminaries Density is defined to be the number of points within a radius (ε) –In this case, density = 9 ε Core point has more than a specified number of points (minPts) at ε –Points are interior to a cluster Border points have < minPts at ε but are within vicinity of the core point A noise point is neither a core point nor a border point ε minPts = 4 core point border point noise point

56 DBSCAN Algorithm

57 Illustration of DBSCAN: Assignment of Core, Border and Noise Points

58 DBSCAN: Finding Clusters

59 Advantages and Limitations Resistant to noise Can handle clusters of different sizes and shapes Eps and MinPts are dependent on each other –Can be difficult to specify Different density clusters within the same class can be difficult to find

60 Advantages and Limitations Varying density data High dimensional data

61 How to determine Eps and MinPoints For points within a cluster, k th nearest neighbors are roughly at the same distance Noise points are farther away in general Plot by sorting the distance of every point to its kth nearest neighbor

62 How do we validate clusters?

63 Cluster validity For supervised learning: –we had a class label, –which meant we could identify how good our training and testing errors were –Metric: Accuracy, Precision, Recall For clustering: –How do we measure the “goodness” of the resulting clusters?

64 Clustering random data (overfitting) If you ask a clustering algorithm to find clusters, it will find some

65 Different aspects of validating clsuters Determine the clustering tendency of a set of data, i.e., whether non-random structure actually exists in the data (e.g., to avoid overfitting) External Validation: Compare the results of a cluster analysis to externally known class labels (ground truth). Internal Validation: Evaluating how well the results of a cluster analysis fit the data without reference to external information. Compare clusterings to determine which is better. Determining the ‘correct’ number of clusters.

66 Measures of cluster validity External Index: Used to measure the extent to which cluster labels match externally supplied class labels. –Entropy, Purity, Rand Index Internal Index: Used to measure the goodness of a clustering structure without respect to external information. – Sum of Squared Error (SSE), Silhouette coefficient Relative Index: Used to compare two different clusterings or clusters. –Often an external or internal index is used for this function, e.g., SSE or entropy

67 Measuring Cluster Validation with Correlation Proximity Matrix vs. Incidence matrix: –A matrix K ij with 1 if the point belongs to the same cluster; 0 otherwise Compute the correlation between the two matrices: –Only n(n-1)/2 values to be computed –High values indicate similarity between points in the same cluster Not suited for density based clustering

68 Another approach: use similarity matrix for cluster validation

69 Internal Measures: SSE SSE is also a good measure to understand how good the clustering is –Lower SSE  good clustering Can be used to estimate number of clusters

70 More on Clustering a little later… We will discuss other forms of clustering in the following classes Next class: –please bring your brief write up on the two papers

Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

Similar presentations

Presentation on theme: "Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

Similar presentations

Presentation on theme: "Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:"— Presentation transcript:

Similar presentations

About project

Feedback