Clustering – Part III: Spectral Clustering COSC 526 Class 14 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

Slides:



Advertisements
Similar presentations
AMCS/CS229: Machine Learning
Advertisements

Partitional Algorithms to Detect Complex Clusters
Clustering.
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
Modularity and community structure in networks
Cluster Analysis. Midterm: Monday Oct 29, 4PM  Lecture Notes from Sept 5, 2007 until Oct 15, Chapters from Textbook and papers discussed in class.
Information Networks Graph Clustering Lecture 14.
Normalized Cuts and Image Segmentation
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Clustering II CMPUT 466/551 Nilanjan Ray. Mean-shift Clustering Will show slides from:
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Lecture 21: Spectral Clustering
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Spectral Clustering 指導教授 : 王聖智 S. J. Wang 學生 : 羅介暐 Jie-Wei Luo.
CS 584. Review n Systems of equations and finite element methods are related.
Normalized Cuts and Image Segmentation Jianbo Shi and Jitendra Malik, Presented by: Alireza Tavakkoli.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Segmentation Graph-Theoretic Clustering.
Cluster Validation.
אשכול בעזרת אלגורתמים בתורת הגרפים
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Clustering Unsupervised learning Generating “classes”
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Image Segmentation Rob Atlas Nick Bridle Evan Radkoff.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Spectral coordinate of node u is its location in the k -dimensional spectral space: Spectral coordinates: The i ’th component of the spectral coordinate.
Presenter : Kuang-Jui Hsu Date : 2011/5/3(Tues.).
Graph Partitioning and Clustering E={w ij } Set of weighted edges indicating pair-wise similarity between points Similarity Graph.
Segmentation using eigenvectors Papers: “Normalized Cuts and Image Segmentation”. Jianbo Shi and Jitendra Malik, IEEE, 2000 “Segmentation using eigenvectors:
Chapter 14: SEGMENTATION BY CLUSTERING 1. 2 Outline Introduction Human Vision & Gestalt Properties Applications – Background Subtraction – Shot Boundary.
Lecture 20: Cluster Validation
Spectral Analysis based on the Adjacency Matrix of Network Data Leting Wu Fall 2009.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Clustering.
Spectral Clustering Jianping Fan Dept of Computer Science UNC, Charlotte.
Analysis of Social Media MLD , LTI William Cohen
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Analysis of Social Media MLD , LTI William Cohen
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Unsupervised Learning on Graphs. Spectral Clustering: Graph = Matrix A B C F D E G I H J ABCDEFGHIJ A111 B11 C1 D11 E1 F111 G1 H111 I111 J11.
Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Normalized Cuts and Image Segmentation Patrick Denis COSC 6121 York University Jianbo Shi and Jitendra Malik.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Analysis of Large Graphs Community Detection By: KIM HYEONGCHEOL WALEED ABDULWAHAB YAHYA AL-GOBI MUHAMMAD BURHAN HAFEZ SHANG XINDI HE RUIDAN 1.
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Hierarchical Clustering: Time and Space requirements
Intrinsic Data Geometry from a Training Set
Clustering Evaluation The EM Algorithm
CSE 4705 Artificial Intelligence
Segmentation Graph-Theoretic Clustering.
Grouping.
Discovering Clusters in Graphs
Critical Issues with Respect to Clustering
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
3.3 Network-Centric Community Detection
Feature space tansformation methods
On Clusterings: Good, Bad, and Spectral
Spectral clustering methods
“Traditional” image segmentation
Analysis of Large Graphs: Community Detection
Presentation transcript:

Clustering – Part III: Spectral Clustering COSC 526 Class 14 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: Slides inspired by: Andrew Moore (CMU), Jure Leskovec (

2 Last Class DBSCAN: –Practical clustering algorithm that works even with difficult datasets –Scales to large datasets –Works well with noise

3 Spectral Clustering

4 Graph Partitioning Undirected graph Bi-partitioning task: –Divide vertices into two disjoint groups Questions: –How can we define a “good” partition of ? –How can we efficiently identify such a partition? AB J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

5 Graph Partitioning What makes a good partition? –Maximize the number of within-group connections –Minimize the number of between-group connections AB J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

6 A B Graph Cuts Express partitioning objectives as a function of the “edge cut” of the partition Cut: Set of edges with only one vertex in a group: cut(A,B) = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

7 Graph Cut Criterion Criterion: Minimum-cut –Minimize weight of connections between groups Degenerate case: Problem: –Only considers external cluster connections –Does not consider internal cluster connectivity arg min A,B cut(A,B) “Optimal cut” Minimum cut J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

8 Graph Cut Criteria [Shi-Malik] Criterion: Normalized cut [Shi-Malik 1997] –Connectivity between groups relative to the density of each group –vol(A): Total weight of the edges with at least one end point in A: Intuition for this criterion: –Produce more balanced partitions How do we efficiently find a good partition? –Problem: computing optimal cut is NP hard J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

9 Spectral Graph Partitioning A : adjacency matrix of undirected G – A ij =1 if is an edge, else 0 x is a vector in  n with components –Think of it as a label/value of each node of What is the meaning of A  x ? Entry y i is a sum of labels x j of neighbors of i J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

10 What is the meaning of Ax? j th coordinate of A  x : –Sum of the x -values of neighbors of j –Make this a new value at node j Spectral Graph Theory: –Analyze the “spectrum” of matrix representing –Spectrum: Eigenvectors of a graph, ordered by the magnitude (strength) of their corresponding eigenvalues : J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

11 Matrix Representations Adjacency matrix ( A ): – n  n matrix – A=[a ij ], a ij =1 if edge between node i and j Important properties: –Symmetric matrix – Eigenvectors are real and orthogonal J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

12 Matrix Representations Degree matrix (D): – n  n diagonal matrix – D=[d ii ], d ii = degree of node i J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

13 Matrix Representations Laplacian matrix (L): – n  n symmetric matrix What is trivial eigenpair? Important properties: –Eigenvalues are non-negative real numbers –Eigenvectors are real and orthogonal J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

14 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” [Shi & Meila, 2002] e2e2 e3e x x x x x x y yy y y x x x x x x z z z z z z z z z z z e1e1 e2e2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

15 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propagates weights from neighbors” If W is connected but roughly block diagonal with k blocks then the top eigenvector is a constant vector the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks If W is connected but roughly block diagonal with k blocks then the top eigenvector is a constant vector the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

16 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propagates weights from neighbors” If W is connected but roughly block diagonal with k blocks then the “top” eigenvector is a constant vector the next k eigenvectors are roughly piecewise constant with “pieces” corresponding to blocks Spectral clustering: Find the top k+1 eigenvectors v 1,…,v k+1 Discard the “top” one Replace every node a with k- dimensional vector x a = Cluster with k-means Spectral clustering: Find the top k+1 eigenvectors v 1,…,v k+1 Discard the “top” one Replace every node a with k- dimensional vector x a = Cluster with k-means J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

17 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” smallest eigenvecs of D-A are largest eigenvecs of A smallest eigenvecs of I-W are largest eigenvecs of W Suppose each y(i)=+1 or -1: Then y is a cluster indicator that splits the nodes into two what is y T (D-A)y ? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

18 = size of CUT(y) NCUT: roughly minimize ratio of transitions between classes vs transitions within classes

19 So far… How to define a “good” partition of a graph? –Minimize a given graph cut criterion How to efficiently identify such a partition? –Approximate using information provided by the eigenvalues and eigenvectors of a graph Spectral Clustering J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

20 Spectral Clustering Algorithms Three basic stages: –1) Pre-processing Construct a matrix representation of the graph –2) Decomposition Compute eigenvalues and eigenvectors of the matrix Map each point to a lower-dimensional representation based on one or more eigenvectors –3) Grouping Assign points to two or more clusters, based on the new representation J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

21 Spectral Partitioning Algorithm 1) Pre-processing: –Build Laplacian matrix L of the graph 2) Decomposition: –Find eigenvalues and eigenvectors x of the matrix L –Map vertices to corresponding components of = X = How do we now find the clusters?

22 Spectral Partitioning 3) Grouping: –Sort components of reduced 1-dimensional vector –Identify clusters by splitting the sorted vector in two How to choose a splitting point? –Naïve approaches: Split at 0 or median value –More expensive approaches: Attempt to minimize normalized cut in 1-dimension (sweep over ordering of nodes induced by the eigenvector) Split at 0: Cluster A: Positive points Cluster B: Negative points A B

23 Example: Spectral Partitioning Rank in x 2 Value of x 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

24 Example: Spectral Partitioning Rank in x 2 Value of x 2 Components of x 2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

25 Example: Spectral partitioning Components of x 1 Components of x 3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

26 k-Way Spectral Clustering How do we partition a graph into k clusters? Two basic approaches: –Recursive bi-partitioning [Hagen et al., ’92] Recursively apply bi-partitioning algorithm in a hierarchical divisive manner Disadvantages: Inefficient, unstable –Cluster multiple eigenvectors [Shi-Malik, ’00] Build a reduced space from multiple eigenvectors Commonly used in recent papers A preferable approach… J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

27 Why use multiple eigenvectors? Approximates the optimal cut [Shi-Malik, ’00] –Can be used to approximate optimal k -way normalized cut Emphasizes cohesive clusters –Increases the unevenness in the distribution of the data –Associations between similar points are amplified, associations between dissimilar points are attenuated –The data begins to “approximate a clustering” Well-separated space –Transforms data to a new “embedded space”, consisting of k orthogonal basis vectors Multiple eigenvectors prevent instability due to information loss J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

28 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” Q: How do I pick v to be an eigenvector for a block- stochastic matrix? smallest eigenvecs of D-A are largest eigenvecs of A smallest eigenvecs of I-W are largest eigenvecs of W J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

29 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” How do I pick v to be an eigenvector for a block- stochastic matrix? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

30 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” smallest eigenvecs of D-A are largest eigenvecs of A smallest eigenvecs of I-W are largest eigenvecs of W Suppose each y(i)=+1 or -1: Then y is a cluster indicator that cuts the nodes into two what is y T (D-A)y ? The cost of the graph cut defined by y what is y T (I-W)y ? Also a cost of a graph cut defined by y How to minimize it? Turns out: to minimize y T X y / (y T y) find smallest eigenvector of X But: this will not be +1/-1, so it’s a “relaxed” solution J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

31 Spectral Clustering: Graph = Matrix W*v 1 = v 2 “propogates weights from neighbors” [Shi & Meila, 2002] λ2λ2 λ3λ3 λ4λ4 λ 5, 6,7,…. λ1λ1 e1e1 e2e2 e3e3 “eigengap” J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

32 Some more terms If A is an adjacency matrix (maybe weighted) and D is a (diagonal) matrix giving the degree of each node –Then D-A is the (unnormalized) Laplacian –W=AD -1 is a probabilistic adjacency matrix –I-W is the (normalized or random-walk) Laplacian –More definitions are possible! The largest eigenvectors of W correspond to the smallest eigenvectors of I-W –So sometimes people talk about “bottom eigenvectors of the Laplacian”

33 A W A W K-nn graph (easy) Fully connected graph, weighted by distance

34

35

36 Spectral Clustering: Pros and Cons Elegant, and well-founded mathematically Works quite well when relations are approximately transitive (like similarity) Very noisy datasets cause problems –“Informative” eigenvectors need not be in top few –Performance can drop suddenly from good to terrible Expensive for very large datasets –Computing eigenvectors is the bottleneck J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

37 Use cases and runtimes K-Means –FAST –“Embarrassingly parallel” –Not very useful on anisotropic data Spectral clustering –Excellent quality under many different data forms –Much slower than K-Means J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

38 Further Reading Spectral Clustering Tutorial: hamburg.de/ML/contents/people/luxburg/publicati ons/Luxburg07_tutorial.pdf hamburg.de/ML/contents/people/luxburg/publicati ons/Luxburg07_tutorial.pdf

39 How to validate clustering approaches?

40 Cluster validity For supervised learning: –we had a class label, –which meant we could identify how good our training and testing errors were –Metric: Accuracy, Precision, Recall For clustering: –How do we measure the “goodness” of the resulting clusters?

41 Clustering random data (overfitting) If you ask a clustering algorithm to find clusters, it will find some

42 Different aspects of validating clsuters Determine the clustering tendency of a set of data, i.e., whether non-random structure actually exists in the data (e.g., to avoid overfitting) External Validation: Compare the results of a cluster analysis to externally known class labels (ground truth). Internal Validation: Evaluating how well the results of a cluster analysis fit the data without reference to external information. Compare clusterings to determine which is better. Determining the ‘correct’ number of clusters.

43 Measures of cluster validity External Index: Used to measure the extent to which cluster labels match externally supplied class labels. –Entropy, Purity, Rand Index Internal Index: Used to measure the goodness of a clustering structure without respect to external information. – Sum of Squared Error (SSE), Silhouette coefficient Relative Index: Used to compare two different clusterings or clusters. –Often an external or internal index is used for this function, e.g., SSE or entropy

44 Measuring Cluster Validation with Correlation Proximity Matrix vs. Incidence matrix: –A matrix K ij with 1 if the point belongs to the same cluster; 0 otherwise Compute the correlation between the two matrices: –Only n(n-1)/2 values to be computed –High values indicate similarity between points in the same cluster Not suited for density based clustering

45 Another approach: use similarity matrix for cluster validation

46 Internal Measures: SSE SSE is also a good measure to understand how good the clustering is –Lower SSE  good clustering Can be used to estimate number of clusters

47 More on Clustering a little later… We will discuss other forms of clustering in the following classes Next class: –please bring your brief write up on the two papers –We will discuss frequent itemset mining and a few other aspects of clustering –Move on to Dimensionality Reduction

48 Summary We saw spectral clustering techniques: –only a broad overview –more details next class Speeding up Spectral clustering techniques can be challenging

49 Clustering Big Data Clustering Partitioning K-means Hierarchy- based CURE, Rock Density Based DBSCAN Grid Based WaveCluster Model Based EM, COBWEB

50 Categorizing Big Data Clustering Algorithms