Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning for Biological Data Mining 장 병 탁 서울대 컴퓨터공학부 And Modified by Kim Hye Jin 포항공과대학교 컴퓨터공학과 IM.

Similar presentations


Presentation on theme: "Machine Learning for Biological Data Mining 장 병 탁 서울대 컴퓨터공학부 And Modified by Kim Hye Jin 포항공과대학교 컴퓨터공학과 IM."— Presentation transcript:

1 Machine Learning for Biological Data Mining 장 병 탁 서울대 컴퓨터공학부 And Modified by Kim Hye Jin 포항공과대학교 컴퓨터공학과 IM

2 Clustering Searching for groups (clusters) in the data –In two or three dimensions, cluster can be visualized –With more than three dimensions, we need some kind of analytical assistance The categories of cluster analysis –Partitioning algorithm K-means Partitioning around medoids Fuzzy partitioning –Hierarchical algorithm Hierarchical clustering Model-based hierarchical clustering

3 An example of partitioning algorithm - K-means Clustering Cluster 의 수 (k) 가 정해지면 k 개의 initial seeds 를 선정 한다.

4 Assignment - K-mean clustering 각 seed 에 근접한 case 에 해당 cluster 를 assign 한 다. 그리고 cluster 의 평균을 구한 후 cluster 의 seed 를 평균으로 이동한 다.

5 Reassignment - K-means clustering 앞에서의 과 정을 반복하 여 seed 의 이 동이 아주 작 을 때까지 계 속한다.

6 Hierarchical algorithm Producing a hierarchical structure displaying the order in which groups are merged or divided –Agglomerative methods : starting with each observation in a separate group, and proceeding until all observations are in a single group –Divisive methods : reverse of agglomerative methods The between-cluster dissimilarity – group average method single linkage method complete linkage method

7 a part of yeast data

8 Clustering Tree by hierarchical clustering

9 Image of clustered result

10 Questions Significant 한 cluster 와 개수는 ? 각 슬라이드가 다른 cell-line 으로 부터의 결과인 경우도 기존의 clustering 기법을 적용하여야 하는가 ? 즉, 실험과 그 목적에 적합한 cluster analysis 는 ? Gene Shaving By Hastie, Tibshirani,… (2000)

11 Motivation and Details We favor subsets of genes that –All behave in a similar manner (coherence) –And all show large across the cell lines. Given an expression array, we seek a sequence of nested gene clusters of size k. has the property that the variance of the cluster mean is maximum over all clusters of size k.

12 Gene Shaving Algorithm-1 STEP 1. Start with the entire expression data X, each row centered to have zero mean. STEP 2. Compute the leading principal component of the rows of X. STEP 3. Shave off the proportion alpha (typically 10%) of the rows having smallest inner-product with the leading principal component. STEP 4. Repeat step 2 and 3 until only one gene remains.

13 Gene Shaving Algorithm-2 STEP 5. This produces a sequence of nested gene clusters where denotes a cluster of k genes. Estimate the optimal cluster size STEP 6. Orthogonalize each row of X with respect to, the average gene in STEP 7. Repeat steps 1-5 above with the orthogonalized data, to find the second optimal cluster. This process is continued until a maximum of M clusters are found, with M chosen apriori.

14 Principal Component of the rows slides genes slides genes Derive the first principal component Z 1 Z1Z1 slide Super-gene

15 The Gap estimate of cluster size We then select as the optimal number of genes that value k producing The largest gap:

16 Variances

17

18 Gene Shaving Process

19

20 Simulated Data-formula Generated data with N=100, p=60, ~ N(0,1) where for and for

21 Simulated Data-image slides p=60 genes, N=100

22 Clusters by hierarchical clustering-1 Looks good, but…

23 Clusters by hierarchical clustering-2 Gene # 6,9,10,1,4,5,8,7,2,3 Gene # 17,11,15,20,16,12,14,13,19,18,70,84,100,65,79

24 Running Gene Shaving Alpha=0.1 B=20 No. Cluster M=4

25 Clusters by gene shaving –cluster#1 36 8 2 9 1 7 Gene # ordered by column mean

26 Clusters by gene shaving –cluster#2 ordered by column mean 121913 16 171418 Gene # 11

27 Clusters by gene shaving –cluster#3 ordered by column mean 672245754107 Gene #

28 Clusters by gene shaving –cluster#4 ordered by column mean 1816Gene #

29 Gap Statistics Gap statistics 가 3 번째 cluster 부터 급격히 감소 2 개의 cluster 가 가장 적절하다고 판단할 수 있음

30


Download ppt "Machine Learning for Biological Data Mining 장 병 탁 서울대 컴퓨터공학부 And Modified by Kim Hye Jin 포항공과대학교 컴퓨터공학과 IM."

Similar presentations


Ads by Google