Department of Electrical and Computer Engineering CNT 6805 Network Science and Applications Lecture 6 Graph Partitioning Part II: Spectral Clustering Dr. Dapeng Oliver Wu University of Florida Department of Electrical and Computer Engineering Fall 2016 Many figures are from Kolaczyk, Eric D. Statistical Analysis of Network Data: Methods and Models (2009) Springer Science+Business Media LLC.
Outline Similarity graph Laplacian matrix of a graph Spectral clustering algorithms Why spectral clustering works? Perspective from graph cut Perspective from random walk Perspective from perturbation theory
Graph Notations (1) Let G = (V,E) be an undirected graph with vertex set V = {v1, . . . , vn}. The weighted adjacency matrix of the graph is the matrix W = [wij], i,j=1,...,n. wij>=0, and wij =wji The degree of a vertex vi is defined as The degree matrix D is defined as the diagonal matrix with the degrees d1, . . . , dn on the diagonal. 3
Graph Notations (2) For two not necessarily disjoint sets A and B in V, we define crosstalk between A and B by Consider two different ways of measuring the “size” of a subset A in V: 4
Similarity Graph Given a set of data points x1, . . . xn with pairwise similarities sij>=0 or pairwise distances dij, we can transform the data points into a so-called similarity graph G = (V,E) by the following method: Each vertex vi in this graph represents a data point xi. Two vertices are connected if the similarity sij between the corresponding data points xi and xj is positive or larger than a certain threshold, and the edge is weighted by sij. 5
Four Types of Similarity Graphs (1) ε-neighborhood unweighted graph: Connect all points whose pairwise distances are smaller than ε. k-nearest neighbor weighted graphs: Connect vi and vj with an undirected edge if vi is among the k-nearest neighbors of vj or if vj is among the k-nearest neighbors of vi. The weight of an edge equals the similarity between the edge’s endpoints. Mutual k-nearest neighbor weighted graph: Connect vertices vi and vj if both vi is among the k-nearest neighbors of vj and vj is among the k-nearest neighbors of vi. 6
Four Types of Similarity Graphs (2) Fully connected weighted graph: Connect each pair of points having positive similarity and weight the edges by sij. Similarity between any two points xi and xj can be measured by Gaussian similarity function s(xi, xj) = exp(−||xi − xj||2/(2σ2)), where the parameter σ controls the width of the neighborhoods. 7
Unnormalized Laplacian Matrix (1) Unnormalized Laplacian of Graph: 8
Unnormalized Laplacian Matrix (2) Proposition 2 (Number of connected components and the spectrum of L) Let G be an undirected graph with non-negative weights. Then the multiplicity k of the eigenvalue 0 of L equals the number of connected components A1, . . ., Ak in the graph. The eigenspace of eigenvalue 0 is spanned by the indicator vectors 1A1 , . . . , 1Ak of those components. Remark: Vector 1 (i.e., all one vector) is also an eigenvector corresponding to eigenvalue 0, since Vector 1 is a linear combination (actually, sum) of 1A1 , . . . , 1Ak. But the multiplicity of the eigenvalue 0 is k, rather than k+1. 9
Normalized Laplacian Matrix (1) Normalized Laplacian of Graph: Symmetric matrix: 𝐿 sym = 𝐷 −1/2 𝐿 𝐷 −1/2 =𝐼− 𝐷 −1/2 𝑊 𝐷 −1/2 Transition matrix of random walk: 𝐿 rw = 𝐷 −1 𝐿=𝐼− 𝐷 −1 𝑊 Relation between them: 𝐿 sym = 𝐷 1/2 𝐿 rw 𝐷 −1/2 10
Normalized Laplacian Matrix (2) 11
Normalized Laplacian Matrix (3) Proposition 4 (Number of connected components and spectra of Lsym and Lrw) Let G be an undirected graph with non-negative weights. Then the multiplicity k of the eigenvalue 0 of both Lrw and Lsym equals the number of connected components A1, . . . , Ak in the graph. For Lrw, the eigenspace of 0 is spanned by the indicator vectors 1Ai of those components. For Lsym, the eigenspace of 0 is spanned by the vectors D1/21Ai. 12
Spectral Clustering Algorithms Unnormalized spectral clustering Normalized spectral clustering Normalized cuts algorithm (Shi and Malik) G-cut algorithm (Ng, Jordan, and Weiss) 13
Unnormalized Spectral Clustering 14
Normalized Spectral Clustering: Normalized Cuts (Shi and Malik) 15
Normalized Spectral Clustering According to Ng, Jordan, and Weiss 16
Example: 10-Nearest-Neighbor, choose first 4 eigenvectors, K-means with K=4 200 samples drawn from 4 Gaussian mixtures Each eigenvector has a dimension of 200. 17
Example: full graph, choose first 4 eigenvectors, K-means with K=4 1100 1001 1010 1100 1001 1010 Channel decoding: 18
Why Use Eigenvectors for Clustering? (1) The dominant eigenvector can be used as a centrality measure for each node, which is called eigenvector centrality. From Proposition 2 and 4, the multiplicity k of the eigenvalue 0 of unnormalized/normalized Laplacian equals the number of connected components A1, . . ., Ak in the graph. For L and Lrw, the eigenspace of 0 is spanned by the indicator vectors 1A1 , . . . , 1Ak of those components. For Lsym, the eigenspace of 0 is spanned by the vectors D1/21Ai. So a strategy for clustering is to find the k smallest eigenvectors (assuming k is known). Ideally, the k smallest eigenvectors will be 1A1 , . . . , 1Ak. 19
Why Use Eigenvectors for Clustering? (2) Usually, the k smallest eigenvectors are not indicator vectors 1A1 , . . . , 1Ak. Two ways to convert eigenvectors to indicator vectors For each eigenvector, if the sign of an entry fi of the eigenvector is positive, let fi =1; otherwise, let fi =0. Use K-means to create K clusters out of n points of dimension K. The reason why we call this method “spectral clustering” is that the scheme uses the eigenvectors (i.e., spectral information of a Laplacian matrix) for clustering. 20
Soft Decoding vs. Hard Decoding (1) Spectral clustering (finding K smallest eigenvectors + K-means) is similar to soft decoding in channel decoder. We can also do hard decoding when solving for the K smallest eigenvector. That is, we can enforce the k smallest eigenvectors to be indicator vectors 1A1 , . . . , 1Ak or binary-valued vectors; now each node is associated with a codeword and all nodes having the same codeword belong to the same cluster. In this way, we do not need K-means for post-processing. 21
Soft Decoding vs. Hard Decoding (2) From communication theory, soft decoding achieves better performance than hard decoding. Hard decoding into binary-valued vectors may suffer from the same problem of bisection due to greediness. For example, given three clusters of 1-D data points (i.e., left/middle/right cluster), it is possible that the first binary-valued eigenvector results in the middle cluster’s split into two clusters, i.e., some points merge with the left cluster and some points merge with the right cluster. Then, the points in the middle cluster will not be detected as points of the same cluster, no matter what values the entries of the second/third eigenvector will take. 22
Soft Decoding vs. Hard Decoding (3) But spectral clustering (finding K smallest eigenvectors + K-means) may relieve the problem of hard decoding, by combining K smallest eigenvectors, creating K-dimensional vectors for K-means clustering. Experimental results are needed to validate the above hypotheses. 23
Limitations of Spectral Clustering We use the first M eigenvectors of Laplacian (corresponding to M smallest eigenvalues) for K-means clustering. But how to determine the value of M? A possible way is to choose M smallest eigenvalues such that their differences are less than a threshold. In K-means clustering, why the M eigenvectors are treated equally? Assume different eigenvector is associated with different eigenvalues. If the gap between the selected eigenvalues is large, we should use Mahalanobis distance instead of Euclidean distance; then how to determine the weights? If the gap between the selected eigenvalues is small, using Euclidean distance is justifiable. 24
Approximation Algorithms for NP Hard Problems Exact algorithm: computationally infeasible in practice Approximation algorithm: achieves theoretically proven good performance, e.g., close to the optimal Polynomial time but may have high complexity Heuristic algorithm: lower complexity than approximation algorithm but no performance guarantee 25
Compressive Sensing Approach for NP Hard Problems (1) Compressive sensing (CS) is a technique for acquiring and reconstructing a signal utilizing the prior knowledge that it is sparse or compressible. A new paradigm to avoid NP hardness Convert a mixed-integer program to a nonlinear program Solve the nonlinear program Decoding: map the real-valued solution to integer valued feasible solution (a valid codeword) 26
Compressive Sensing Approach for NP Hard Problems (2) CS = Model selection = penalized least squares We can apply CS to the graph learning problem parameter estimation problem of a large-scale network 27
Outline Similarity graph Laplacian matrix of a graph Spectral clustering algorithms Why spectral clustering works? Perspective from graph cut Perspective from random walk Perspective from perturbation theory
Graph Cuts (1) Minimum cut For a given number k of subsets, the mincut approach chooses a partition A1, . . . ,Ak that minimizes the total crosstalk: However, in practice min-cut often does not lead to satisfactory partitions. The problem is that in many cases, the solution of min-cut simply separates one individual vertex from the rest of the graph. A community should consist of multiple nodes rather than one node. To mitigate the problem, use ratio cut and normalized cut. 29
Graph Cuts (2) Ratio cut: minimize cardinality-normalized crosstalk Normalized cut: minimize volume-normalized crosstalk Unfortunately, minimizing ratio cut and minimizing Ncut are both NP hard problems. Need approximation algorithms to solve them. 30
Approximating RatioCut for k = 2 (1) Goal: For unnormalized Laplacian L, we have Hence, minimizing RatioCut becomes an integer program 31
Approximating RatioCut for k = 2 (2) (3) is an NP hard problem since there are (2n-2) possible configurations for subset A. We can use an approximation algorithm to solve (3). The approximation technique we use is called relaxation, i.e., replacing integer-valued constraint by real-valued constraint. min 𝑓 ′ 𝐿𝑓 𝑠.𝑡. | 𝑓 |= 𝑛 min 𝑓 ′ 𝐿𝑓 𝑛 𝑠.𝑡. | 𝑓 |= 𝑛 min 𝑓 ′ 𝐿𝑓 𝑓 ′ 𝑓 From Rayleigh-Ritz theorem, the solution f for min 𝑓 ′ 𝐿𝑓 𝑓 ′ 𝑓 is the eigenvector corresponding to the smallest eigenvalue of L. Then, for min 𝑓 ′ 𝐿𝑓 𝑓 ′ 𝑓 𝑠.𝑡. 𝑓 orthogonal to 1, the solution f is the eigenvector corresponding to the second smallest eigenvalue of L. 32
Approximating RatioCut for k = 2 (3) However, in order to obtain a partition of the graph we need to re-transform the real-valued solution vector f of the relaxed problem into a discrete indicator vector. Two ways to convert eigenvector f to an indicator vector If the sign of an entry fi of the eigenvector is positive, let fi =1; otherwise, let fi =0. Use K-means to create 2 clusters out of n 1-D points, specified by f. This is exactly the unnormalized spectral clustering algorithm for the case of k = 2. 33
Approximating RatioCut & Ncut for arbitrary k Similarly, we can show that the unnormalized spectral clustering algorithm (for arbitrary k) produces an approximate solution to the problem of minimizing RatioCut for arbitrary k. We can also show that the normalized spectral clustering algorithm (for normalized Laplacian Lrw) produces an approximate solution to the problem of minimizing Ncut for arbitrary k. 34
Comments on Relaxation Approach For some polynomial time approximation scheme (PTAS), it is possible to prove how far away the value obtained by the approximation solution is from the optimum value. For example, in the case of a ρ-approximation algorithm A, it has been proven that the value f(x) of the approximate solution A(x) to an instance x will not be more (resp. less) than a factor ρ times the value, OPT, of a minimum (resp. maximum) solution. For minimization: For maximization: 35
Hardness of approximation But there is no guarantee on the quality of the solution of the relaxed problem for minimizing RatioCut/Ncut, compared to the exact solution. Hardness of approximation For some NP hard problems, it is impossible to design an r-approximation algorithm (PTAS) with small r, unless P=NP. For minimization problems, r=ρ; for maximization problems, r=1/ρ. Performance ratio r>1 for both min/max problem, while approximation ratio ρ>1 for min problem and ρ<1 for max. It has been proven that the approximation problem for minimizing RatioCut or Ncut with a small r is itself NP hard -- hardness of approximation. 36
Determining the number of clusters Information Criterion (also used for model selection) Akaike information criterion (AIC) Bayesian information criterion (BIC) Minimum description length (MDL) Ad-hoc measures such as the ratio of within-cluster to between-cluster similarities Eigen-gap heuristic: The eigen-gap (i.e., gap between two adjacent eigen-values) can be used as a quality criterion for spectral clustering, and a criterion for choosing the number of clusters to construct. 37
Eigen-gap heuristic Method: choose the number k such that all eigenvalues λ1, . . . , λk are very small, but λk+1 is relatively large. A justification for this procedure is based on perturbation theory. In the ideal case of k completely disconnected clusters, the eigenvalue 0 has multiplicity k, and then there is a gap to the (k + 1)th eigenvalue λk+1 > 0. For a slightly-perturbed data set (e.g., adding a few links between disconnected clusters), λ1, . . . , λk are still very small, and λk+1 is still relatively large. So the gap between λk and λk+1 is still relatively large. 38
Example Figure: Three data sets, and the smallest 10 eigenvalues of Laplacian Lrw. 39
Remarks (1) Other relaxation approach: Bie et al. proposed semi-definite programming (SDP) based relaxation for approximating Ncut. The reason why the spectral relaxation is so appealing is not that it leads to particularly good solutions. Its popularity is mainly due to its simplicity (finding smallest eigenvectors + K-means). 40
Remarks (2) Spectral clustering is similar to kernel K-means in the sense that we can first use a kernel to generate a similarity matrix from data samples, then use spectral clustering (eigen-decomposition + K-means). Both kernel K-means and spectral clustering can be used to identify clusters that are non-linearly separable in input space. Sometimes, we prefer to map L-dimensional points to a similarity graph and apply spectral clustering instead of directly applying K-means to L-dimensional points; this is because spectral clustering is similar to kernel K-means and can achieve better performance than K-means for nonlinear structure embedded in the points. 41
Spectral Clustering: Flatten a Curved Space Use a kernel to map points xi in RM to a similarity graph. An embedding function maps the vertices vi of the graph to points yi in RK where yi is row i of the matrix formed by the first K eigenvectors of Laplacian matrix as column vectors. Now, points xi embedded in a nonlinear structure are mapped to yi in RK so that geodesic distance is replaced by Euclidean distance and K-means can be applied to {yi}. In this way, a curved space is flattened by this process. 42
Outline Similarity graph Laplacian matrix of a graph Spectral clustering algorithms Why spectral clustering works? Perspective from graph cut Perspective from random walk Perspective from perturbation theory
Random Walk on a Graph (1) For any weighted graph/digraph, we can always create a random walk on the graph. The transition probability of jumping in one step from vertex vi to vertex vj is proportional to the edge weight wij and is given by pij := wij/di. The transition matrix P = [pij] (i,j=1,...,n) of the random walk is thus defined by If the graph is connected and non-bipartite, then the random walk always possesses a unique stationary distribution π= (π1, . . . , πn)T, where πi = di/ vol(V). 44
Random Walk on a Graph (2) There is a tight relationship between Laplacian Lrw and P, as Lrw = I −P. Hence, λ is an eigenvalue of Lrw with eigenvector u if and only if 1− λ is an eigenvalue of P with eigenvector u. Therefore, the largest eigenvectors of P and the smallest eigenvectors of Lrw can be used to describe cluster properties of the graph. 45
Random Walk vs. Ncut Proposition 5 implies that minimizing Ncut is equivalent to finding a cut that minimizes the total probability of a random walk transition from A to \A and from \A to A. That is, when minimizing Ncut, we actually look for a cut through the graph such that a random walk seldom transitions from A to \A and vice versa. 46
Outline Similarity graph Laplacian matrix of a graph Spectral clustering algorithms Why spectral clustering works? Perspective from graph cut Perspective from random walk Perspective from perturbation theory
Perturbation Theory Perturbation theory studies the question of how eigenvalues and eigenvectors of a matrix A change if we add a small perturbation H, that is we consider the perturbed matrix A˜ := A+H. Most perturbation theorems state that a certain distance between eigenvalues or eigenvectors of A and A˜ is bounded by a constant times a norm of H. The constant usually depends on which eigenvalue we are looking at, and how far this eigenvalue is separated from the rest of the spectrum. 48
How do Eigenvectors Change with Perturbation? Davis-Kahan theorem tells us that the eigenspaces corresponding to the first k eigenvalues of the ideal matrix L and the first k eigenvalues of the perturbed matrix ˜L are very close to each other if ||H|| is small and eigen-gap δ=|λk+1-λk| is large. 49
Homework 3 Write a Matlab program to implement three methods for spectral clustering, i.e., unnormalized spectral clustering, normalized spectral clustering according to Shi and Malik (normalized cuts), and normalized spectral clustering according to Ng, Jordan, and Weiss; and apply them to detect communities in the weighted graph specified by the Karate Club data. Please submit 1) a report (in *.doc or *.pdf file) that includes the results produced by your programs, and 2) your Matlab programs in *.m files. Submit your homework in the format of WORD file, Matlab file, or pdf through E-Learning web site under the directory of Homework 3. 50
Reading Assignment E. Kolaczyk, “Statistical Analysis of Network Data,” Chapter 4 51