Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cluster Analysis II 10/03/2012.

Similar presentations


Presentation on theme: "Cluster Analysis II 10/03/2012."— Presentation transcript:

1 Cluster Analysis II 10/03/2012

2 K-means Clustering looks informative.
A common situation for gene clustering in microarray: k=10 k=15 k=30 K-means Clustering looks informative. A closer look, however, finds lots of noises in each cluster.

3 Challenge: Lots of scattered genes. i. e
Challenge: Lots of scattered genes. i.e. genes not belonging to any tight cluster of biological function.

4 Methods that deal with scattered points
Tight Clustering Penalized K-means

5 Tight Clustering Traditional Clustering
Estimate the number of clusters, k. (except for hierarchical clustering) Perform clustering through assigning all genes into clusters. Tight Clustering: Directly identify informative, tight and stable clusters with reasonable size, say, 20~60 genes. Need not estimate k !! Need not assign all genes into clusters.

6 Basic Idea 11 x y 1 2 3 4 5 whole data

7 Tight Clustering co-membership matrix Original Data D[C(X', k), X] X
sub-sample X' cluster centers C(X', k)=(C1,…, Ck) K-means

8 Tight Clustering • X={xij}nd : data to be clustered.
• X'={x'ij}n/2d : random sub-sample • C(X', k)=(C1, C2,…, Ck): the cluster centers obtained from clustering X' into k clusters. • D[C(X', k), X] : an nn matrix denoting co-membership relations of X classified by C(X', k). (Tibshirani 2001) D[C(X', k), X]ij =1 if i and j in the same cluster. =0 o.w.

9 Algorithm A Take a random subsample X from the original data X, say with 70% of the original sample size. Apply K-means with the pre-specified k on X to obtain the cluster centers C(X, k) = (C1, C2, ,Ck ). Use the clustering result C(X, k) as a classifier to cluster the original data X according to the distances from each point to the cluster centers. The resulting clustering is represented by a comembership matrix D[C(X, k), X] where D[C(X, k), X]ij , the element of the matrix in row i and column j, takes value 1 if points i and j are in the same cluster and 0 otherwise.

10 Algorithm A cont’d Repeat independent random subsampling B times to obtainsubsamples X(1), X(2), ,X(B). The average comembership matrix is defined as Search for a set of points such that where α is a constant close to 0. Order sets with this property by size to obtain These V sets are candidates of tight clusters.

11 11 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 11 1.0 1 2 3 4 5 6 7 8 9 10 11 1.0 0.5 1 2 3 4 5 6 7 8 9 10 11 0.5 1.0

12 Sequential identification of tight and stable clusters
After a tight and stable cluster is identified, it is removed from the whole data and the same procedure is repeated to identify the next tightest cluster. Define the similarity of two sets of genes:

13 Sequential identification of tight and stable clusters cont’d
Start with a suitable . Apply algorithm A on consecutive k starting from . Choose the top q tight cluster candidates for each k, namely Stop when Select to be the tightest cluster.

14 Sequential identification of tight and stable clusters cont’d
Identify the tightest cluster and remove it from the whole data. Decrease k0 by 1. Repeat 1.~3. to identify the next tight cluster. Remark: and k0 determines the tightness and size of resulting clusters.

15 Tight Clustering Algorithm: (relax estimation of k)
0.7 0.01 0.17 0.3 0.05 0.01 0.1 0.21 0.95 0.01 1 0.52 0.03 0.14 0.03 0.11 0.01 0.11 0.23

16 A simple simulation on 2-D:
Example: A simple simulation on 2-D: 14 clusters normally distributed (50 points each) plus 175 scattered points. Stdev=0.1, 0.2, …, 1.4.

17 Tight clustering on simulated data:
Example: Tight clustering on simulated data:

18 Example:

19 Tight clustering: real example
Gene expression during the life cycle of Drosophila melanogaster. (2002) Science 297: 4028 genes monitored. Reference sample is pooled from all samples. 66 sequential time points spanning embryonic (E), larval (L), pupal (P) and adult (A) periods. Filter genes without significant pattern (1100 genes) and standardize each gene to have mean 0 and stdev 1.

20 4.1 Tight clustering Example:
Comparison of various K-means and tight clustering: Seven mini-chromosome maintenance (MCM) deficient genes K-means k=30 K-means k=50

21 4.1 Tight clustering Example: K-means k=70 K-means k=100 Tight

22 Penalized and weighted K-means (PW-Kmeans)

23 PW-Kmeans

24 PW-Kmeans

25 PW-Kmeans

26 PW-Kmeans a special case

27 Relationship to classification likelihood

28 Relationship to classification likelihood

29 Estimate k and λ

30 Estimate k and λ Divide the dataset into training , ,and testing sets . Cluster the training data, . Cluster the testing data, . Measure how well the training set results predicts co-memberships in the testing set by calculating for each cluster in the test data, the proportion of all pairs of objects that are also assigned in the same cluster by the training cluster centroids judgment.

31

32

33 Application: Yeast cell cycle array data

34 Application: Yeast cell cycle array data

35 Penalized K-means

36 PW-Kmeans

37 Weight function

38 PW-Kmeans

39 Evaluate the clustering results

40 Evaluate the clustering results

41 Evaluate the clustering results

42 Conclusion P-kmeans generally better than Kmeans.
P-kmeans makes fewer predictions than Kmeans but produce much higher accuracy. Smaller λ result in smaller clusters and # of prediction made but with better accuracy.


Download ppt "Cluster Analysis II 10/03/2012."

Similar presentations


Ads by Google