Clustering (Part II) 11/26/07
Spectral Clustering
Represent data similarity by a graph
For example, Connect two data points if their similarity is greater than a threshold Weight each edge inversely proportional to distance Represent data similarity by a graph
Similarity Matrix Each edge is weighted by the similarity between two data points. The similarity matrix contains all the weights W = (w ij )
Spectral Clustering Mincut: Min cutsize cutsize = total weight of cut edges (Chris Ding)
Spectral Clustering Mincut: Min cutsize cutsize = total weight of cut edges Constraint on sizes: for example |A| = |B| (Chris Ding)
Why this might be useful
K-means clustering (k=2) cannot separate red from green.
Why this might be useful K-means clustering (k=2) cannot separate red from green. Spectral clustering separates the two groups naturally.
Partition into two clusters Graph Laplacian Allow q i to take not just discrete but also continuous values. Solution is the eigenvalues of L=D-W. Minimize the following
Properties of graph Laplacian L is semi-positive definite. y T Ly >= 0, for any y. First eigenvector is q 1 =(1,…,1), with 1 =0. Second eigenvector is the desired solution. Smaller 2 means better partitioning.
Convert q to partition Method 1: A = {i | q i = 0}. But this does satisfy the size constraint.
Method 1: A = {i | q i = 0}. But this does not satisfy the size constraint. J is not changed if q i is replaced by q i + c. Method 2: A = {i | q i = c}. Find c so that |A| = |B|. Convert q to partition
Partition into more than two clusters Recursively apply 2-cluster procedure. Or Use higher order eigenvectors.
More general size constraints Ratio cut Normalized cut Min-max cut
Solution Ratio cut –2 nd eigenvector of L=D-W Normalized cut –Solution is eigenvector of Min-max cut –Solution is eigenvector of
A simple example
More than 2 clusters Ratio cut Normalized cut Min-max Cut
Solution Solution lies in the subspace spanned by the first k-eigenvectors.
Applications Lymphoma Cancer (Alizadeh et al. 2000) 4025 genes in total 900 genes selected by variable selection methods. (Chris Ding)
Affinity Propagation
Main Idea Data points can be exemplar (cluster center) or non-examplar (other data points). Message is passed between exemplar (centroid) and non-exemplar data points. The total number of clusters will be automatically found by the algorithm.
Responsibility r(j,k) A non-exemplar data point informs each candidate exemplar whether it is suitable for joining as a member. candidate exemplar k data point j
Availability a(j,k) A candidate exemplar data point informs other data points whether it is a good exemplar. candidate exemplar k data point j
Self-availability a(k,k) A candidate exemplar data point evaluates itself whether it is a good exemplar. candidate exemplar k data point j
An iterative procedure Update r(j, k) candidate exemplar k data point j r(j,k) a(j,k’) similarity between i and k
An iterative procedure Update a(j, k) candidate exemplar k data point j r(j’,k) a(j,k)
An iterative procedure Update a(k, k)
Step-by-step affinity propagation
Applications Multi-exon gene detection in mouse. Expression level at different exons within a gene are corregulated among different tissue types. 37 mouse tissues involved. 12 tiling arrays. (Frey et al. 2005)
Biclustering
Gene expression conditions genes 1D-approach: To identify condition cluster, all genes are used. But probably only a few genes are differentially expressed. Motivation
Gene expression conditions genes 1D-approach: To identify gene cluster, all conditions are used. But a set of genes may only be expressed under a few conditions. Motivation
Gene expression conditions genes Bi-clustering Objective: To isolate genes that are co- expressed under a specific set of conditions. Motivation
Coupled Two-Way Clustering An iterative procedure involving the following two steps. –Within a cluster of conditions, search for gene clusters. –Using features from a cluster of genes, search for condition clusters. (Getz et al. 2001)
SAMBA – A bipartite graph model V = GenesU = Conditions Tanay et al. 2002
V = GenesU = Conditions E = “respond” = differential expression Tanay et al SAMBA – A bipartite graph model
V = GenesU = Conditions E = “respond” = differential expression Cluster = subgraph (U’, V’, E’) =subset of corregulated genes V’ in conditions U’ Tanay et al SAMBA – A bipartite graph model
SAMBA -- algorithm Goal: Find the “heaviest” subgraphs. H = (U’, V’, E’) Tanay et al. 2002
SAMBA -- algorithm Goal: Find the “heavy” subgraphs. missing edge H = (U’, V’, E’) Tanay et al. 2002
SAMBA -- algorithm p u,v -- probability of edge expected at random p c – probability of edge within cluster Compute a weight score for H. H = (U’, V’, E’) Tanay et al. 2002
SAMBA -- algorithm Finding the heaviest graph is an NP-hard problem. Use a polynomial algorithm to search for minima efficiently. H = (U’, V’, E’) Tanay et al. 2002
Significance of weight Let H = (U’, V’, E’) be a subgraph. Fix U’, random select a new V” with the same size as V’. The weight for the new subgraph (U’, V”, E”) gives a background distribution. Estimate p-value bp comparing log L(H) with the background distribution.
Model evaluation The p-value distribution for the top candidate clusters. If biological classification data are available, evaluate the purity of class membership within each bicluster.
Reading List Luxberg 2006 (Pages 1-12) –A tutorial for spectral clustering Frey and Dueck 2007 –Affinity propagation Tanay et al –SAMBA for biclustering