Download presentation
Presentation is loading. Please wait.
Published byLoreen McDonald Modified over 6 years ago
1
Subspace Clustering for Microarray Data Analysis:
Multiple Criteria and Significance Assessment Biotechnology Center at the University of Illinois at Urbana-Champaign Hui Fang, ChengXiang Zhai and Lei Liu University of Illinois at Urbana Champaign Introduction 1 Microarray Data : Microarray is a powerful tool for monitoring the expression patterns of thousands of genes simultaneously. GO based Measure 4 GO based Measure: The quality of a cluster is defined as the depth of the common parent nodes on GO hierarchy shared by the genes in the cluster. The more specific the node, the better the cluster. For example, given the following cluster of genes { YBL028C, YCR100C, YFL044C, YHR143W-A, YIL069C, YIL131C, YJR133W, YKL173W, YKR056W, YLL028W, YLL056C, YLR361C, YNL159C, YNL166C, YNR038W, YOL092W}. The GO-based measure is 5. Experiment Results 6 Importance of multiple criteria (GO-based measure) Effectiveness of Confidence-based Measure Our idea: Use synthetic data set to show that the noisy clusters are always ranked lower than other clusters. Correlation between Two Types of Measures There is strong correlation between GO-based measure and confidence-based measure the mean of several replicates Biological Process (GO: ) Physiological Processes (GO: ) Cellular Process (GO: ) Common Parent Node with Single criteria Common Parent Node with Multiple criteria 22 genes 14 genes Metabolism (GO: ) Cell Communication Cell Growth and/or Maintenance expression level genes g1 … gn 0.9 … -0.2 -0.1 0.5 Repeated several times conditions c1 … cm Goal: Find gene clusters that are truly meaningful biologically Challenges: How to explore multiple criteria in clustering? How to access the significance and quality of clusters? Gene Ontology (GO: ) Cellular Component (GO: ) … … Cell (GO: ) … Objectives 2 (1) Perform subspace clustering Intracellular (GO: ) … … Nucleus (GO: ) a b c d e f g h i j Expression Level Conditions Raw Data Subspace Clusters Nucleolus (GO: ) Nucleoplasm (GO: ) DNA-directed RNA polymerase I complex (GO: ) True Clusters: cluster1: {11, 20, 21, 35, 54} cluster2: {4, 62, 70, 74, 92} cluster3: {14, 15, 58, 71, 79} cluster4: {45, 48, 52, 83, 94} Found Clusters by Methods: cluster1: {11, 20, 21, 35, 54} cluster2: {4, 62, 70, 74, 92} cluster3: {14, 15, 58, 71, 79} cluster4: {45, 48, 52, 83, 94} cluster5: {6, 43, 48, 51, 85} cluster1: {4, 14, 53, 54, 76} cluster2: {25, 29, 31, 49, 80} cluster3: {1, 6, 61, 85 , 92} cluster4: {2, 86, 89, 93, 97} cluster1: {4, 14, 53, 54, 76} cluster2: {25, 29, 31, 49, 80} cluster3: {1, 6, 61, 85, 92} cluster4: {2, 57, 86, 89, 93, 97} 4 DNA-directed RNA polymerase III complex (GO: ) DNA-directed RNA polymerase II, holoenzyme (GO: ) (2) Study whether multiple criteria improve the quality of clusters (3) Design the quality measure of clusters based on Domain knowledge ( e.g. gene ontology) Significance (e.g. variance in replicates ) DNA-directed RNA polymerase II core complex ;(GO: ) Confidence based Measure 5 Motivation A gene may fall into a cluster purely because of the high variance in the replicates rather than biological relevance. Goal Give the priority to the clusters that are not generated by chance Our idea measure the quality of each generated cluster based on the original variances of each data point. Method 1: compute the confidence level by taking the average of the standard deviation for each data point in the cluster Method 2: compute how likely the cluster satisfies the constraints The higher variance, the lower confidence. Our Solution---Multiple Criteria: Combine fluctuation constraint and trend constraint Fluctuation Constraint The difference of expression levels between two genes over every condition needs to be similar. Trend Constraint When the expression level of one gene goes up under some condition, the expression level of the correlated genes should also go up accordingly. Why need both? Subspace Clustering Model 3 Ranking lists based on confidence-based measure Conditions Expression Level g1 g2 c1 c2 g1 g2 c1 c2 Conclusion: Propose multiple criteria model to help biologists discover meaningful gene clusters Experiment results show the important of multiple criteria. Define two ways to measure the quality of clusters based on statistical significance based on their biological meaning Synthetic data sets shows the effectiveness of the confidence-based measure. Future Work: Approximation algorithm to discover subspace clusters Employ multiple types of genomic data Conclusions 7 true distribution sample mean of replicates Expression Level Expression Level The work is supported in part by the NIH grant No. 2 P30 AR to Lei Liu. Acknowledgement 8 Conditions Conditions Trend Constraint satisfied Fluctuation Constraint unsatisfied Trend Constraint unsatisfied, Fluctuation Constraint satisfied
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.