Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 Dept. Computer Science and Information Engineering National Cheng Kung University Taiwan, R.O.C. August 13, 2001
2 Outline Microarray Techniques Goal of Microarray Data Mining Clustering Methods Efficient Microarray Data Mining Conclusions
3 Current Status Human genome project is at finishing stage, revealing that there are about 30,000 functional genes in a human cell For more than 90% of the genes, we know little about their real functions
4 Microarray Techniques Main Advantage of Microarray Techniques allow simultaneous studies of the expression of thousands of genes in a single experiment Microarray Process Arrayer Experiments: Hybridization Image Capturing of Results Analysis
5 Goal of Microarray Mining gene test A B C … … …. … Multi-Conditions Expression Analysis
6 Goal of Microarray Mining gene test A B C … … …. … Multi-Conditions Expression Analysis
7 Sample Clustering Results
8 Clustering Methods Types of Clustering Methods Partitioning : K-Means, K-Medoids, PAM, CLARA … Hierarchical : HAC 、 BIRCH 、 CURE 、 ROCK Density-based : CAST, DBSCAN 、 OPTICS 、 CLIQUE… Grid-based : STING 、 CLIQUE 、 WaveCluster… Model-based : COBWEB 、 SOM 、 CLASSIT 、 AutoClass…
9 Clustering Methods (cont.) Partitioning Hierarchical
10 Clustering Methods (cont.) Density-basedGrid-based
11 CAST Clustering Input S : a symmetic n × n Similarity Matrix , S(i, j) ∈ [0, 1] t : Affinity Threshold (0 < t < 1) Method 1. Choose a seed for generating a new cluster 2. ADD: add qualified items to the cluster 3. REMOVE: remove unqualified items from the stable cluster 4. Repeat Steps 1-3 till no more clusters can be generated
12 Similarity Measurements : Correlation Coefficients The most popular correlation coefficient is Pearson correlation coefficient (1892) correlation between X={X 1, X 2, …, X n } and Y={Y 1, Y 2, …, Y n } : where
13 Similarity Measurements : Correlation Coefficients (cont.) It captures the similarity of the ‘‘shapes’’ of two expression profiles, and ignores differences between their magnitudes.
14 Problems in Microarray Mining How to cluster microarray data with the following requirements met simultaneously ? Efficiency Accuracy Automation
15 Problems in Microarray Mining (cont.) How to cluster microarray data with the following requirements met simultaneously ? Efficiency Accuracy Automation Good Clustering Methods + Validation Techniques
16 Efficient Microarray Mining Improved CAST algorithm for clustering Hubert’s Γ statistic for validation Iterative sampled computation for automatic clustering
17 Reduce the Computation 1. Narrow down the threshold range 2. Split and Conquer: find “nearly-best” result m = 4 threshold 0100% LM RM LM: Left Margin RM: Right Margin
18 Experimental Results Dataset Source : Lawrence Berkeley National Lab (LBNL) Michael Eisen's Lab ( ) Microarray expression data of yeast saccharomyces cerevisiae, containing 6221 genes with 80 conditions Similarity matrix was obtained in advance
19 Experimental Results (cont.) Without Range Narrow down Executions : 19 Execution Time : 246 sec Γ statistic : With Range Narrow down Executions : 13 Execution Time : 27 sec Γ statistic :
20 Experimental Results (cont.) Comparison Method Execution Time (Sec) Cluster Number Best Γ Statistic Our Method K-means (k= 3 ~ 21) K-means (k= 3 ~ 39)
21 Conclusions Microarray data analysis is an emerging field needing support of data mining techniques Accuracy Efficiency Automation