Clustering – Part III: Spectral Clustering COSC 526 Class 14 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: Slides inspired by: Andrew Moore (CMU), Jure Leskovec (
2 Last Class DBSCAN: –Practical clustering algorithm that works even with difficult datasets –Scales to large datasets –Works well with noise
3 Spectral Clustering
4 Graph Partitioning Undirected graph Bi-partitioning task: –Divide vertices into two disjoint groups Questions: –How can we define a “good” partition of ? –How can we efficiently identify such a partition? AB J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
5 Graph Partitioning What makes a good partition? –Maximize the number of within-group connections –Minimize the number of between-group connections AB J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
6 A B Graph Cuts Express partitioning objectives as a function of the “edge cut” of the partition Cut: Set of edges with only one vertex in a group: cut(A,B) = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
7 Graph Cut Criterion Criterion: Minimum-cut –Minimize weight of connections between groups Degenerate case: Problem: –Only considers external cluster connections –Does not consider internal cluster connectivity arg min A,B cut(A,B) “Optimal cut” Minimum cut J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
8 Graph Cut Criteria [Shi-Malik] Criterion: Normalized cut [Shi-Malik 1997] –Connectivity between groups relative to the density of each group –vol(A): Total weight of the edges with at least one end point in A: Intuition for this criterion: –Produce more balanced partitions How do we efficiently find a good partition? –Problem: computing optimal cut is NP hard J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
9 Spectral Graph Partitioning A : adjacency matrix of undirected G – A ij =1 if is an edge, else 0 x is a vector in n with components –Think of it as a label/value of each node of What is the meaning of A x ? Entry y i is a sum of labels x j of neighbors of i J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
10 What is the meaning of Ax? j th coordinate of A x : –Sum of the x -values of neighbors of j –Make this a new value at node j Spectral Graph Theory: –Analyze the “spectrum” of matrix representing –Spectrum: Eigenvectors of a graph, ordered by the magnitude (strength) of their corresponding eigenvalues : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
11 Matrix Representations Adjacency matrix ( A ): – n n matrix – A=[a ij ], a ij =1 if edge between node i and j Important properties: –Symmetric matrix – Eigenvectors are real and orthogonal J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
12 Matrix Representations Degree matrix (D): – n n diagonal matrix – D=[d ii ], d ii = degree of node i J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
13 Matrix Representations Laplacian matrix (L): – n n symmetric matrix What is trivial eigenpair? Important properties: –Eigenvalues are non-negative real numbers –Eigenvectors are real and orthogonal J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
14 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” [Shi & Meila, 2002] e2e2 e3e x x x x x x y yy y y x x x x x x z z z z z z z z z z z e1e1 e2e2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
15 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propagates weights from neighbors” If W is connected but roughly block diagonal with k blocks then the top eigenvector is a constant vector the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks If W is connected but roughly block diagonal with k blocks then the top eigenvector is a constant vector the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
16 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propagates weights from neighbors” If W is connected but roughly block diagonal with k blocks then the “top” eigenvector is a constant vector the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks Spectral clustering: Find the top k+1 eigenvectors v 1,…,v k+1 Discard the “top” one Replace every node a with k- dimensional vector x a = Cluster with k-means Spectral clustering: Find the top k+1 eigenvectors v 1,…,v k+1 Discard the “top” one Replace every node a with k- dimensional vector x a = Cluster with k-means J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
17 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” smallest eigenvecs of D-A are largest eigenvecs of A smallest eigenvecs of I-W are largest eigenvecs of W Suppose each y(i)=+1 or -1: Then y is a cluster indicator that splits the nodes into two what is y T (D-A)y ? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
18 = size of CUT(y) NCUT: roughly minimize ratio of transitions between classes vs transitions within classes
19 So far… How to define a “good” partition of a graph? –Minimize a given graph cut criterion How to efficiently identify such a partition? –Approximate using information provided by the eigenvalues and eigenvectors of a graph Spectral Clustering J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
20 Spectral Clustering Algorithms Three basic stages: –1) Pre-processing Construct a matrix representation of the graph –2) Decomposition Compute eigenvalues and eigenvectors of the matrix Map each point to a lower-dimensional representation based on one or more eigenvectors –3) Grouping Assign points to two or more clusters, based on the new representation J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
21 Spectral Partitioning Algorithm 1) Pre-processing: –Build Laplacian matrix L of the graph 2) Decomposition: –Find eigenvalues and eigenvectors x of the matrix L –Map vertices to corresponding components of = X = How do we now find the clusters?
22 Spectral Partitioning 3) Grouping: –Sort components of reduced 1-dimensional vector –Identify clusters by splitting the sorted vector in two How to choose a splitting point? –Naïve approaches: Split at 0 or median value –More expensive approaches: Attempt to minimize normalized cut in 1-dimension (sweep over ordering of nodes induced by the eigenvector) Split at 0: Cluster A: Positive points Cluster B: Negative points A B
23 Example: Spectral Partitioning Rank in x 2 Value of x 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
24 Example: Spectral Partitioning Rank in x 2 Value of x 2 Components of x 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
25 Example: Spectral partitioning Components of x 1 Components of x 3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
26 k-Way Spectral Clustering How do we partition a graph into k clusters? Two basic approaches: –Recursive bi-partitioning [Hagen et al., ’92] Recursively apply bi-partitioning algorithm in a hierarchical divisive manner Disadvantages: Inefficient, unstable –Cluster multiple eigenvectors [Shi-Malik, ’00] Build a reduced space from multiple eigenvectors Commonly used in recent papers A preferable approach… J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
27 Why use multiple eigenvectors? Approximates the optimal cut [Shi-Malik, ’00] –Can be used to approximate optimal k -way normalized cut Emphasizes cohesive clusters –Increases the unevenness in the distribution of the data –Associations between similar points are amplified, associations between dissimilar points are attenuated –The data begins to “approximate a clustering” Well-separated space –Transforms data to a new “embedded space”, consisting of k orthogonal basis vectors Multiple eigenvectors prevent instability due to information loss J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
28 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” Q: How do I pick v to be an eigenvector for a block- stochastic matrix? smallest eigenvecs of D-A are largest eigenvecs of A smallest eigenvecs of I-W are largest eigenvecs of W J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
29 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” How do I pick v to be an eigenvector for a block- stochastic matrix? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
30 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” smallest eigenvecs of D-A are largest eigenvecs of A smallest eigenvecs of I-W are largest eigenvecs of W Suppose each y(i)=+1 or -1: Then y is a cluster indicator that cuts the nodes into two what is y T (D-A)y ? The cost of the graph cut defined by y what is y T (I-W)y ? Also a cost of a graph cut defined by y How to minimize it? Turns out: to minimize y T X y / (y T y) find smallest eigenvector of X But: this will not be +1/-1, so it’s a “relaxed” solution J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
31 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” [Shi & Meila, 2002] λ2λ2 λ3λ3 λ4λ4 λ 5, 6,7,…. λ1λ1 e1e1 e2e2 e3e3 “eigengap” J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
32 Some more terms If A is an adjacency matrix (maybe weighted) and D is a (diagonal) matrix giving the degree of each node –Then D-A is the (unnormalized) Laplacian –W=AD -1 is a probabilistic adjacency matrix –I-W is the (normalized or random-walk) Laplacian –More definitions are possible! The largest eigenvectors of W correspond to the smallest eigenvectors of I-W –So sometimes people talk about “bottom eigenvectors of the Laplacian”
33 A W A W K-nn graph (easy) Fully connected graph, weighted by distance
34
35
36 Spectral Clustering: Pros and Cons Elegant, and well-founded mathematically Works quite well when relations are approximately transitive (like similarity) Very noisy datasets cause problems –“Informative” eigenvectors need not be in top few –Performance can drop suddenly from good to terrible Expensive for very large datasets –Computing eigenvectors is the bottleneck J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
37 Use cases and runtimes K-Means –FAST –“Embarrassingly parallel” –Not very useful on anisotropic data Spectral clustering –Excellent quality under many different data forms –Much slower than K-Means J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
38 Further Reading Spectral Clustering Tutorial: hamburg.de/ML/contents/people/luxburg/publicati ons/Luxburg07_tutorial.pdf hamburg.de/ML/contents/people/luxburg/publicati ons/Luxburg07_tutorial.pdf
39 How to validate clustering approaches?
40 Cluster validity For supervised learning: –we had a class label, –which meant we could identify how good our training and testing errors were –Metric: Accuracy, Precision, Recall For clustering: –How do we measure the “goodness” of the resulting clusters?
41 Clustering random data (overfitting) If you ask a clustering algorithm to find clusters, it will find some
42 Different aspects of validating clsuters Determine the clustering tendency of a set of data, i.e., whether non-random structure actually exists in the data (e.g., to avoid overfitting) External Validation: Compare the results of a cluster analysis to externally known class labels (ground truth). Internal Validation: Evaluating how well the results of a cluster analysis fit the data without reference to external information. Compare clusterings to determine which is better. Determining the ‘correct’ number of clusters.
43 Measures of cluster validity External Index: Used to measure the extent to which cluster labels match externally supplied class labels. –Entropy, Purity, Rand Index Internal Index: Used to measure the goodness of a clustering structure without respect to external information. – Sum of Squared Error (SSE), Silhouette coefficient Relative Index: Used to compare two different clusterings or clusters. –Often an external or internal index is used for this function, e.g., SSE or entropy
44 Measuring Cluster Validation with Correlation Proximity Matrix vs. Incidence matrix: –A matrix K ij with 1 if the point belongs to the same cluster; 0 otherwise Compute the correlation between the two matrices: –Only n(n-1)/2 values to be computed –High values indicate similarity between points in the same cluster Not suited for density based clustering
45 Another approach: use similarity matrix for cluster validation
46 Internal Measures: SSE SSE is also a good measure to understand how good the clustering is –Lower SSE good clustering Can be used to estimate number of clusters
47 More on Clustering a little later… We will discuss other forms of clustering in the following classes Next class: –please bring your brief write up on the two papers –We will discuss frequent itemset mining and a few other aspects of clustering –Move on to Dimensionality Reduction
48 Summary We saw spectral clustering techniques: –only a broad overview –more details next class Speeding up Spectral clustering techniques can be challenging
49 Clustering Big Data Clustering Partitioning K-means Hierarchy- based CURE, Rock Density Based DBSCAN Grid Based WaveCluster Model Based EM, COBWEB
50 Categorizing Big Data Clustering Algorithms