Subspace Clustering for Microarray Data Analysis:

Slides:



Advertisements
Similar presentations
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Advertisements

Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.
Gene Ontology John Pinney
A hierarchical unsupervised growing neural network for clustering gene expression patterns Javier Herrero, Alfonso Valencia & Joaquin Dopazo Seminar “Neural.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Genomic Signal Processing: Ensemble Dependence Model for Classification and Prediction of Cancer Based on Gene Expression Data Joseph DePasquale Engineering.
Evolutionary Computation Introduction Peter Andras s.
Yeast Dataset Analysis Hongli Li Final Project Computer Science Department UMASS Lowell.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
GO::TermFinder Gavin Sherlock Department of Genetics Stanford University
Gene Set Enrichment Analysis (GSEA)
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Grant Number: IIS Institution of PI: Arizona State University PIs: Zoé Lacroix Title: Collaborative Research: Semantic Map of Biological Data.
Significance analysis of microarrays (SAM) SAM can be used to pick out significant genes based on differential expression between sets of samples. Currently.
Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
Combined Central and Subspace Clustering for Computer Vision Applications Le Lu 1 René Vidal 2 1 Computer Science Department, Johns Hopkins University,
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
Chapter 15 – Analysis of Variance Math 22 Introductory Statistics.
1 …continued… Part III. Performing the Research 3 Initial Research 4 Research Approaches 5 Hypotheses 6 Data Collection 7 Data Analysis.
Statistical Testing with Genes Saurabh Sinha CS 466.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Modeling Promoter and Untranslated Regions in Yeast Abstract T ranscriptional regulation is the primary form of gene regulation in eukaryotes. Approaches.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.
DNA Microarray. Microarray Printing 96-well-plate (PCR Products) 384-well print-plate Microarray.
WIS/COLLNET’2016 Nancy, France
Cluster Analysis II 10/03/2012.
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
An Artificial Intelligence Approach to Precision Oncology
Tutorial 6 : RNA - Sequencing Analysis and GO enrichment
Introduction to IR Research
A Simple Approach to Ranking Differentially Expressed Gene Expression Time Courses through Gaussian Process Regression By Alfredo A Kalaitzis and Neil.
i) Two way ANOVA without replication
PCB 3043L - General Ecology Data Analysis.
Statistical Testing with Genes
Statistical Data Analysis
Cristian Ferent and Alex Doboli
Functional Genomics in Evolutionary Research
 The human genome contains approximately genes.  At any given moment, each of our cells has some combination of these genes turned on & others.
Significance analysis of microarrays (SAM)
Significance Analysis of Microarrays (SAM)
Overview Expression data basics Introduction Biological network data
1 Department of Engineering, 2 Department of Mathematics,
A Short Tutorial on Causal Network Modeling and Discovery
1 Department of Engineering, 2 Department of Mathematics,
Overview Gene Ontology Introduction Biological network data
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
1 Department of Engineering, 2 Department of Mathematics,
Gene expression analysis
Multivariate Statistical Methods
Pairwise Sequence Alignment (cont.)
Simple Linear Regression
MAPPFinder and You: An Introductory Presentation
Statistical Data Analysis
Statistical analysis.
Statistical Testing with Genes
Identification of aging-related genes and affected biological processes. Identification of aging-related genes and affected biological processes. (A) Experimental.
Inferring Cellular Processes from Coexpressing Genes
MGS 3100 Business Analysis Regression Feb 18, 2016
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Subspace Clustering for Microarray Data Analysis: Multiple Criteria and Significance Assessment Biotechnology Center at the University of Illinois at Urbana-Champaign Hui Fang, ChengXiang Zhai and Lei Liu University of Illinois at Urbana Champaign Introduction 1 Microarray Data : Microarray is a powerful tool for monitoring the expression patterns of thousands of genes simultaneously. GO based Measure 4 GO based Measure: The quality of a cluster is defined as the depth of the common parent nodes on GO hierarchy shared by the genes in the cluster. The more specific the node, the better the cluster. For example, given the following cluster of genes { YBL028C, YCR100C, YFL044C, YHR143W-A, YIL069C, YIL131C, YJR133W, YKL173W, YKR056W, YLL028W, YLL056C, YLR361C, YNL159C, YNL166C, YNR038W, YOL092W}. The GO-based measure is 5. Experiment Results 6 Importance of multiple criteria (GO-based measure) Effectiveness of Confidence-based Measure Our idea: Use synthetic data set to show that the noisy clusters are always ranked lower than other clusters. Correlation between Two Types of Measures There is strong correlation between GO-based measure and confidence-based measure the mean of several replicates Biological Process (GO:0008150) Physiological Processes (GO:0007582) Cellular Process (GO:0009987) Common Parent Node with Single criteria Common Parent Node with Multiple criteria 22 genes 14 genes Metabolism (GO:0008152) Cell Communication Cell Growth and/or Maintenance expression level genes g1 … gn 0.9 … -0.2 -0.1 0.5 Repeated several times conditions c1 … cm Goal: Find gene clusters that are truly meaningful biologically Challenges: How to explore multiple criteria in clustering? How to access the significance and quality of clusters? Gene Ontology (GO:0003673) Cellular Component (GO:0005575) … … Cell (GO:0005623) … Objectives 2 (1) Perform subspace clustering Intracellular (GO:0005622) … … Nucleus (GO:0005634) a b c d e f g h i j Expression Level Conditions Raw Data Subspace Clusters Nucleolus (GO:0005730) Nucleoplasm (GO:0005654) DNA-directed RNA polymerase I complex (GO:0005736) True Clusters: cluster1: {11, 20, 21, 35, 54} cluster2: {4, 62, 70, 74, 92} cluster3: {14, 15, 58, 71, 79} cluster4: {45, 48, 52, 83, 94} Found Clusters by Methods: cluster1: {11, 20, 21, 35, 54} 3 cluster2: {4, 62, 70, 74, 92} 2 cluster3: {14, 15, 58, 71, 79} 1 cluster4: {45, 48, 52, 83, 94} 4 cluster5: {6, 43, 48, 51, 85} 5 cluster1: {4, 14, 53, 54, 76} cluster2: {25, 29, 31, 49, 80} cluster3: {1, 6, 61, 85 , 92} cluster4: {2, 86, 89, 93, 97} cluster1: {4, 14, 53, 54, 76} 1 cluster2: {25, 29, 31, 49, 80} 3 cluster3: {1, 6, 61, 85, 92} 2 cluster4: {2, 57, 86, 89, 93, 97} 4 DNA-directed RNA polymerase III complex (GO:0005666) DNA-directed RNA polymerase II, holoenzyme (GO:0016591) (2) Study whether multiple criteria improve the quality of clusters (3) Design the quality measure of clusters based on Domain knowledge ( e.g. gene ontology) Significance (e.g. variance in replicates ) DNA-directed RNA polymerase II core complex ;(GO:0005665) Confidence based Measure 5 Motivation A gene may fall into a cluster purely because of the high variance in the replicates rather than biological relevance. Goal Give the priority to the clusters that are not generated by chance Our idea measure the quality of each generated cluster based on the original variances of each data point. Method 1: compute the confidence level by taking the average of the standard deviation for each data point in the cluster Method 2: compute how likely the cluster satisfies the constraints The higher variance, the lower confidence. Our Solution---Multiple Criteria: Combine fluctuation constraint and trend constraint Fluctuation Constraint The difference of expression levels between two genes over every condition needs to be similar. Trend Constraint When the expression level of one gene goes up under some condition, the expression level of the correlated genes should also go up accordingly. Why need both? Subspace Clustering Model 3 Ranking lists based on confidence-based measure Conditions Expression Level g1 g2 c1 c2 g1 g2 c1 c2 Conclusion: Propose multiple criteria model to help biologists discover meaningful gene clusters Experiment results show the important of multiple criteria. Define two ways to measure the quality of clusters based on statistical significance based on their biological meaning Synthetic data sets shows the effectiveness of the confidence-based measure. Future Work: Approximation algorithm to discover subspace clusters Employ multiple types of genomic data Conclusions 7 true distribution sample mean of replicates Expression Level Expression Level The work is supported in part by the NIH grant No. 2 P30 AR41940-10 to Lei Liu. Acknowledgement 8 Conditions Conditions Trend Constraint satisfied Fluctuation Constraint unsatisfied Trend Constraint unsatisfied, Fluctuation Constraint satisfied