Cluster validation Integration ICES Bioinformatics.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
2013/12/10.  The Kendall’s tau correlation is another non- parametric correlation coefficient  Let x 1, …, x n be a sample for random variable x and.
Uncertainty in fall time surrogate Prediction variance vs. data sensitivity – Non-uniform noise – Example Uncertainty in fall time data Bootstrapping.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Correlation and regression
Basic Data Analysis for Quantitative Research
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
QUANTITATIVE DATA ANALYSIS
PSY 307 – Statistics for the Behavioral Sciences
Mutual Information Mathematical Biology Seminar
University at BuffaloThe State University of New York Cluster Validation Cluster validation q Assess the quality and reliability of clustering results.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
The Simple Regression Model
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
PSYC512: Research Methods PSYC512: Research Methods Lecture 8 Brian P. Dyre University of Idaho.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 
APPENDIX B Data Preparation and Univariate Statistics How are computer used in data collection and analysis? How are collected data prepared for statistical.
Analyzing Reliability and Validity in Outcomes Assessment (Part 1) Robert W. Lingard and Deborah K. van Alphen California State University, Northridge.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Significance analysis of microarrays (SAM) SAM can be used to pick out significant genes based on differential expression between sets of samples. Currently.
Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
Statistical Analysis. Statistics u Description –Describes the data –Mean –Median –Mode u Inferential –Allows prediction from the sample to the population.
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Chapter 10 The t Test for Two Independent Samples
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
For starters - pick up the file pebmass.PDW from the H:Drive. Put it on your G:/Drive and open this sheet in PsiPlot.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Analysis of Experimental Data; Introduction
Step 3: Tools Database Searching
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Chapter Eleven Performing the One-Sample t-Test and Testing Correlation.
Accuracy, Reliability, and Validity of Freesurfer Measurements David H. Salat
Computacion Inteligente Least-Square Methods for System Identification.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Multidimensional data analysis Kathleen Marchal. Clustering.
Two-Sample Hypothesis Testing
Numerical Measures: Centrality and Variability
Significance analysis of microarrays (SAM)
I. Statistical Tests: Why do we use them? What do they involve?
Descriptive Statistics
Chapter 10 Introduction to the Analysis of Variance
MGS 3100 Business Analysis Regression Feb 18, 2016
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Cluster validation Integration ICES Bioinformatics

Overview INTRODUCTION MICROARRAY ANALYSIS VALIDATION OF THE RESULTS Statistical validation Biological validation INTEGRATION

Cluster validation Preprocessing 1 Clustering Algorithm 1 Preprocessing 2 Clustering Algorithm 2 Clustering Algorithm 3 Parameter Setting 1 Parameter Setting 2 Parameter Setting 3 Validation Why cluster validation? Different algorithms, parameters Intrinsic properties of the dataset (sensitivity to noise, to outliers)

STATISTICAL VALIDATION Sensitivity analysis –Leaf one out cross validation (FOM) –Sensitivity analysis Gaussian noise ANOVA Cluster coherence testing –Euclidean distance score –Gap statistics Statistical validation Validation

Figure of Merit (sensitivity towards an experiment) Tested cluster algorithm is applied to all experimental conditions except the left out condition Hypothesis: if the cluster algorithm is robust it can predict the measured values of the left out condition To estimate the predictive power of the algorithm FOM is calculated FOM is the root mean square deviation in the left-out condition e of the individual gene expression levels relative to their cluster means This is repeated for all conditions and the average FOM is calculated Statistical validation Validation Yeung et al., 2001

Sensitivity analysis towards the signal to noise ratio Sensitivity analysis = A way of assigning confidence to the cluster membership –create new in silico replica's of the dataset of interest by adding a small amount of noise on the original data – treat new datasets as the original one and cluster –Genes consistently clustered together over all in silico replicas are considered as robust towards adding noise How to determine the noise? Statistical validation Validation

Gaussian noise with  0 and standard deviation  estimated as the median standard deviation for the log-ratios for all genes across the different experiment Bittner et al How to determine the noise? How to generate simulated datasets? noise based on the appropriate ANOVA model  describes the noise term The values are the estimates from the original fit The  are drawn with replacement from the studentized residuals of the original fit Clustering is repeated on the simulated datasets Statistical validation Validation

Comparing cluster results cluster label known: determine the stability of a gene: the percent of bootstrap cluster experiments in which the gene matches to the same cluster cluster label unknown: Identify pairs of genes that cluster together in C^ and count the frequency with which such pairs cluster together in the bootstrapped clusters C^*. When each pair of genes clusters together reliably stable clusters will emerge RAND INDEX (Yeung et al. 2001) Jaccard coefficient (Ben-Hur et al. 2002) Approximate the confidence in the clustering output of a gene Statistical validation Validation Cluster exp 1 C1 Cluster exp 2 C1 Cluster exp 3 C1 Cluster exp 4 C1 Cluster exp 1 C1 C2 C3 … Cluster exp 2 C1 C2 C3 … Cluster exp 3 C1 C2 C3 … Cluster exp 4 C1 C2 C3 …

RAND index statistic designed to assess the degree of agreement between two partitions Usually an unknown partition against an external standard Adjusted RAND index adjusted so that the expected value of the RAND index between two random partions is zero Statistical validation Validation a: the number of object pairs that are clustered together in data set 1 and in dataset 2 b: the number of object pairs that are clustered together in data set 1 but not in dataset 2 c: the number of object pairs that are clustered together in data set 2 but not in dataset 1 d: the number of object pairs that are put in different clusters in both datasets a, d: agreement between cluster results b, c: disagreement between cluster results The rand index is defined as the fraction of agreement that is the number of pairs of objects that are either in same groups in both partitions (a) or in different groups in both partitions (b) divided by the total number of pairs of objects (a + b + c +d). The rand index lies between 0 and 1.

Jaccard coefficient Statistical validation Jaccard coefficient Based on the clusters of one dataset, binary pair vectors are calculated, where each element corresponds to a unique pair of genes and had a value one if both genes were clustered into the same cluster and zero otherwise. From two such pairvectors, where one was derived from the first dataset and the other from the second dataset, the jaccard coefficient is computed. This coefficient compares the correlation between both obtained binary matrices.

Cluster coherence testing k points (genes) in cluster p experiments (dimensions) average profile of cluster j Vw: Variance of the genes about the the cluster average averaged over all experiments Maximizes coherence of the genes within a cluster Statistical validation Euclidian distance Validation

p experiments Cluster average profile VB: Describes how the average at each experimental point oscillates around the average of the average cluster profile Maximizes variance across experiments average profile of cluster Statistical validation Gap score Validation

Score function: R 2 select clusters containing tightly co-expressed genes (minimal Vw) showing a high variable profile (high V B ) across the experiments (ie affected by the signal studied). Score is compared to a similar score calculated based on a randomly generated cluster (bootstrapping) The difference between the score of the randomly generated cluster and the cluster of interest is calculated. (gapstatistics) Gap statistics Statistical validation Validation

Overview INTRODUCTION MICROARRAY ANALYSIS VALIDATION OF THE RESULTS Statistical validation Biological validation INTEGRATION

Biological validation dataset small clusters contain genes with highly similar profile (+) some information given up in first step (-) validate “core” clusters Motif finding DNA level literature/ knowledge extend clusters big clusters contain all real positives (+) increasing number of false positives (-) Validation

Microarrays and TextMining Rationale: Clustering Accession Nrs AC0020 D11428 SRS, Medline, GeneCards,.. Manual Query : huge task data Literature/ knowledge Validation Biological validation Controlled vocabularies

Cumulative hypergeometric distribution Biological validation p-value that this degree of enrichment could have occurred by chance (implemented in Ontoexpress)

 2 test or Fisher exact test (as implemented in FATIGO software) Biological validation N1: number of genes on the chip N2: number of differentially expressed genes

Microarrays and Motif Finding cDNA arrays Motif finding Clustering Preprocessing of the data EMBL BLAST Upstream regions Gibbs sampling Validation Biological validation

Overview INTRODUCTION MICROARRAY ANALYSIS VALIDATION OF THE RESULTS Statistical validation Biological validation INTEGRATION IT level Algorithmic level

Integration

Need for integrated tool Validation

Overview INTRODUCTION MICROARRAY ANALYSIS VALIDATION OF THE RESULTS Statistical validation Biological validation INTEGRATION IT level Algorithmic level

Integration Need for integrated algorithms Validation

Retain high sensitivity (minimize number of false negatives) Reduce level of noise (minimize number of false positives) In corporate a priori information Combine data from different sources that can mutually confirm each other Example: sequence information and expression profiles Server rMotif (Lapidot and Pilpel, 2003) Selects genes from a microarray if –Contain a motif –Have a highly correlated expression profile Integration Validation

Motif diagnosis tool measures the extent to which a set of genes that contain a given motif in their promoter) display expression profiles similar to each other at a given set of conditions (analyzed by microarrays) score (EC expression coherence) of a set of N genes is defined as the number of p pairs of genes in the set for which the Euclidean distance between the mean and variance normalized profiles falls below a threshold D, divided by the total number of pairs in the set EC= p/[(0.5(N)(N-1)] Integration Validation

Integration Validation