Machine Learning for Biological Data Mining 장 병 탁 서울대 컴퓨터공학부 And Modified by Kim Hye Jin 포항공과대학교 컴퓨터공학과 IM.

Machine Learning for Biological Data Mining 장 병 탁 서울대 컴퓨터공학부 And Modified by Kim Hye Jin 포항공과대학교 컴퓨터공학과 IM

Clustering Searching for groups (clusters) in the data –In two or three dimensions, cluster can be visualized –With more than three dimensions, we need some kind of analytical assistance The categories of cluster analysis –Partitioning algorithm K-means Partitioning around medoids Fuzzy partitioning –Hierarchical algorithm Hierarchical clustering Model-based hierarchical clustering

An example of partitioning algorithm - K-means Clustering Cluster 의 수 (k) 가 정해지면 k 개의 initial seeds 를 선정 한다.

Assignment - K-mean clustering 각 seed 에 근접한 case 에 해당 cluster 를 assign 한 다. 그리고 cluster 의 평균을 구한 후 cluster 의 seed 를 평균으로 이동한 다.

Reassignment - K-means clustering 앞에서의 과 정을 반복하 여 seed 의 이 동이 아주 작 을 때까지 계 속한다.

Hierarchical algorithm Producing a hierarchical structure displaying the order in which groups are merged or divided –Agglomerative methods : starting with each observation in a separate group, and proceeding until all observations are in a single group –Divisive methods : reverse of agglomerative methods The between-cluster dissimilarity – group average method single linkage method complete linkage method

a part of yeast data

Clustering Tree by hierarchical clustering

Image of clustered result

Questions Significant 한 cluster 와 개수는 ? 각 슬라이드가 다른 cell-line 으로 부터의 결과인 경우도 기존의 clustering 기법을 적용하여야 하는가 ? 즉, 실험과 그 목적에 적합한 cluster analysis 는 ? Gene Shaving By Hastie, Tibshirani,… (2000)

Motivation and Details We favor subsets of genes that –All behave in a similar manner (coherence) –And all show large across the cell lines. Given an expression array, we seek a sequence of nested gene clusters of size k. has the property that the variance of the cluster mean is maximum over all clusters of size k.

Gene Shaving Algorithm-1 STEP 1. Start with the entire expression data X, each row centered to have zero mean. STEP 2. Compute the leading principal component of the rows of X. STEP 3. Shave off the proportion alpha (typically 10%) of the rows having smallest inner-product with the leading principal component. STEP 4. Repeat step 2 and 3 until only one gene remains.

Gene Shaving Algorithm-2 STEP 5. This produces a sequence of nested gene clusters where denotes a cluster of k genes. Estimate the optimal cluster size STEP 6. Orthogonalize each row of X with respect to, the average gene in STEP 7. Repeat steps 1-5 above with the orthogonalized data, to find the second optimal cluster. This process is continued until a maximum of M clusters are found, with M chosen apriori.

Principal Component of the rows slides genes slides genes Derive the first principal component Z 1 Z1Z1 slide Super-gene

The Gap estimate of cluster size We then select as the optimal number of genes that value k producing The largest gap:

Variances

Gene Shaving Process

Simulated Data-formula Generated data with N=100, p=60, ~ N(0,1) where for and for

Simulated Data-image slides p=60 genes, N=100

Clusters by hierarchical clustering-1 Looks good, but…

Clusters by hierarchical clustering-2 Gene # 6,9,10,1,4,5,8,7,2,3 Gene # 17,11,15,20,16,12,14,13,19,18,70,84,100,65,79

Running Gene Shaving Alpha=0.1 B=20 No. Cluster M=4

Clusters by gene shaving –cluster#1 36 8 2 9 1 7 Gene # ordered by column mean

Clusters by gene shaving –cluster#2 ordered by column mean 121913 16 171418 Gene # 11

Clusters by gene shaving –cluster#3 ordered by column mean 672245754107 Gene #

Clusters by gene shaving –cluster#4 ordered by column mean 1816Gene #

Gap Statistics Gap statistics 가 3 번째 cluster 부터 급격히 감소 2 개의 cluster 가 가장 적절하다고 판단할 수 있음

Machine Learning for Biological Data Mining 장 병 탁 서울대 컴퓨터공학부 And Modified by Kim Hye Jin 포항공과대학교 컴퓨터공학과 IM.

Similar presentations

Presentation on theme: "Machine Learning for Biological Data Mining 장 병 탁 서울대 컴퓨터공학부 And Modified by Kim Hye Jin 포항공과대학교 컴퓨터공학과 IM."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning for Biological Data Mining 장 병 탁 서울대 컴퓨터공학부 And Modified by Kim Hye Jin 포항공과대학교 컴퓨터공학과 IM.

Similar presentations

Presentation on theme: "Machine Learning for Biological Data Mining 장 병 탁 서울대 컴퓨터공학부 And Modified by Kim Hye Jin 포항공과대학교 컴퓨터공학과 IM."— Presentation transcript:

Similar presentations

About project

Feedback