Discovery of differentially expressed genes by statistical methods Esa Uusipaikka Department of Statistics University of Turku Microarray Bioinformatics.


Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department

Multiple testing and false discovery rate in feature selection
Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize.
Statistical tests for differential expression in cDNA microarray experiments (2): ANOVA Xiangqin Cui and Gary A. Churchill Genome Biology 2003, 4:210 Presented.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Microarray Data Preprocessing and Clustering Analysis
Gene Expression Data Analyses (3)
Differentially expressed genes
Statistical Analysis of Microarray Data
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Lecture 9: One Way ANOVA Between Subjects
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Statistics for Microarrays
Today Concepts underlying inferential statistics
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
\department of mathematics and computer science Supervised microarray data analysis Mark van de Wiel.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)
Chapter 14 Inferential Data Analysis
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Introduction The goal of translational bioinformatics is to enable the transformation of increasingly voluminous genomic and biological data into diagnostics.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
False Discovery Rate (FDR) = proportion of false positive results out of all positive results (positive result = statistically significant result) Ladislav.
Inference for regression - Simple linear regression
Multiple testing correction
Multiple testing in high- throughput biology Petter Mostad.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 – Multiple comparisons, non-normality, outliers Marshall.
Candidate marker detection and multiple testing
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Essential Statistics in Biology: Getting the Numbers Right
1 Use of the Half-Normal Probability Plot to Identify Significant Effects for Microarray Data C. F. Jeff Wu University of Michigan (joint work with G.
Differential Expression II Adding power by modeling all the genes Oct 06.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
1 Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms Critical Assessment.
Introduction to Microarrays Dr. Özlem İLK & İbrahim ERKAN 2011, Ankara.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
1 Statistical Significance Testing. 2 The purpose of Statistical Significance Testing The purpose of Statistical Significance Testing is to answer the.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
Statistics for Differential Expression Naomi Altman Oct. 06.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
1 Significance analysis of Microarrays (SAM) Applied to the ionizing radiation response Tusher, Tibshirani, Chu (2001) Dafna Shahaf.
Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Canadian Bioinformatics Workshops
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
ODP and SVA European Institute of Statistical Genetics Liege, Belgium September 4, 2007 Greg Gibson.
Differential Gene Expression
I. Statistical Tests: Why do we use them? What do they involve?
Presentation transcript:

Discovery of differentially expressed genes by statistical methods Esa Uusipaikka Department of Statistics University of Turku Microarray Bioinformatics Seminar DataCity Turku, May 6-7, 2003

Molecular portraits and the family tree of cancer

Overview 1. Statistical issues 2. Design of experiment 3. ‘Low-level' analysis

Overview 4. ‘High-level' analysis - fold-change with fixed cutt-off - model for fold-change - standard statistical tests - permutation tests - multiple testing - False Discovery Rate (FDR) - time-series analysis

Statistical issues 1. Design of experiment 2. ‘Low-level' analysis data-cleaning

Statistical issues 3. ‘High-level' analysis 1. select differentially expressed (DE) genes 2. find groups of genes whose expression profiles can reliably classify the different RNA sources into meaningful groups

Experimental design Kerr, M. K., and Churchill, G. A. (2001). Experimental design for gene expression microarrays. Biostatistics 2, Glonek, G. F. V., and Solomon, P. J. (2002). Factorial designs for microarray experiments. Technical Report, Department of Applied Mathematics, University of Adelaide, Australia. apply ideas from optimal experimental designs to suggest efficient designs for the some of the common microarray experiments

Experimental design Pan, W., Lin, J. and Le, C. (2002). How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biology 3(5): research considers sample size

Experimental design Speed, T. P., and Yang, Y. H. (2002). Direct versus indirect designs for cDNA microarray experiments. Technical Report 616, Department of Statistics, University of California, Berkeley. examines the efficiency of using a reference sample as against direct comparison

Experimental design It is not possible to give universal recommendations appropriate for all situations but the general principles of statistical experiment design apply to microarray experiments Churchill, G.A. Fundamentals of experimental design for cDNA microarrays. Nature Genet. 32, (2002). Yang, Y.H. & Speed, T. Design issues for cDNA microarray experiments. Nature Rev. Genet. 3, (2002).

Image Analysis and data- cleaning Yang, Y. H., Buckley, M. J., Dudoit, S., and Speed, T. P. (2002). Comparison of methods for image analysis on cDNA microarray data. Journal of Computational and Graphical Statistics 11, compare various segmentation and background estimation methods

Image Analysis and data- cleaning Kerr, M. K., Martin, M., and Churchill, G. A. (2000). Analysis of variance for gene expression microarray data. Journal of Computational Biology 7, and Wolfinger, R. D., Gibson, G., Wolfinger, E. D., Bennett, L., Hamadeh, H., Bushel, P., Afshari, C., and Paules, R. S. (2001). Assessing gene significance from cDNA microarray expression data via mixed models. Journal of Computational Biology 8, have proposed the use of ANOVA models for normalization

Image Analysis and data- cleaning Quackenbush, J. Microarray data normalization and transformation. Nature Genet. 32, (2002).

Selecting differentially expressed genes 1.simply generating the data is not enough; one must be able to extract from it meaningful information about the system being studied 2.there is no one-size-fits-all solution for the analysis and interpretation of genome-wide expression data

Selecting differentially expressed genes 3.statistical methods for interpreting the data have proliferated 4.there are now so many options available that choosing among them is challenging 5.understanding of both the biology and the computational methods is essential for tackling the associated ‘data mining’ tasks

Selecting differentially expressed genes One of the core goals of microarray data analysis is to identify which of the genes show good evidence of being DE. This goal has two parts. 1. The first is select a statistic which will rank the genes in order of evidence for differential expression, from strongest to weakest evidence. 2. The second is to choose a critical-value for the ranking statistic above which any value is considered to be significant.

k-fold change 1.measure of differential expression by the ratio of expression levels between two samples 2.genes with ratios above a fixed cut-off k that is, those whose expression underwent a k-fold change, were said to be differentially expressed 3.this test is not a statistical test, and there is no associated value that can indicate the level of confidence in the designation of genes as differentially expressed or not differentially expressed

k-fold change 4.replication is essential in experimental design because it allows an estimate of variability 5. ability to assess such variability allows identification of biologically reproducible changes in gene expression levels

Model for fold-change 1.model that accounts for random, array- and probe- specific noise 2.evaluation of whether the 90% confidence interval for each gene’s fold-change excludes this method incorporates available information about variability in the gene-expression measurements 4.can suffer when the data set is either too small or too heterogeneous estimates of variation

Model for fold-change Li, C. & Hung Wong, W. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2, research0032 (2001). Roberts, C.J. et al. Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. Science 287, (2000). Ideker, T., Thorsson, V., Siegel, A.F. & Hood, L.E. Testing for differentially expressed genes by maximum-likelihood analysis of microarray data. J. Comput. Biol. 7, (2000).

Standard statistical tests 1.More typically, researchers now rely on variants of common statistical tests. 2.These generally involve two parts: calculating a test statistic and determining the significance of the observed statistic. 3.A standard statistical test for detecting significant change between repeated measurements of a variable in two groups is the t-test; 4.this can be generalized to multiple groups via the ANOVA F statistic.

Standard statistical tests variations on the t-test statistic (often called ‘t-like tests’) for microarray analysis are abundant Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, (2001). Golub, T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, (1999). Model, F., Adorjan, P., Olek, A. & Piepenbrock, C. Feature selection for DNA methylation based cancer classification. Bioinformatics 17 Suppl 1, S157-S164 (2001).

Standard statistical tests 1.use of non-parametric rank-based statistics is also common, via both traditional statistical methods and hoc ones designed specifically for microarray data Zhan, F. et al. Global gene expression profiling of multiple myeloma, monoclonal gammopathy of undetermined significance, and normal bone marrow plasma cells. Blood 99, (2002). Ben-Dor, A., Friedman, N. & Yakhini, Z. Scoring genes for relevance. Technical Report (Institute of Computer Science, Hebrew University, Jerusalem, 2000). Park, P.J., Pagano, M. & Bonetti, M. A nonparametric scoring algorithm for identifying informative genes from microarray data. Pac. Symp. Biocomput (2001).

Standard statistical tests 1.For most practical cases, computing a standard t or F statistic is appropriate, although referring to the t or F distributions to determine significance is often not. 2. The main hazard in using such methods occurs when there are too few replicates to obtain an accurate estimate of experimental variances. In such cases, modeling methods that use pooled variance estimates may be helpful.

Standard statistical tests Xiangqin Cui and Gary A Churchill (2003). Statistical tests for differential expression in cDNA microarray experiments. Genome Biology 4(4):

Standard statistical tests 1.Regardless of the test statistic used, one must determine its significance 2.Standard interpretations of t-like tests assume that the data are sampled from normal populations with equal variances 3.Expression data may fail to satisfy either or both of these constraints

Standard statistical tests 4.Although log transformation can improve normality and help equalize variances, ultimately the best estimates of the data’s distribution come from the data themselves Quackenbush, J. Microarray data normalization and transformation. Nature Genet. 32, (2002).

Permutation tests Permutation tests, generally carried out by repeatedly scrambling the samples’ class labels and computing t statistics for all genes in the scrambled data, best capture the unknown structure of the data. Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, (2001). Golub, T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, (1999). Dudoit, S., Yang, Y.-H., Callow, M.J. & Speed, T.P. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578 (Department of Statistics, University of California at Berkeley, Berkeley, CA, 2000).

Permutation tests Such permutation tests are ideal when the number of arrays is sufficient to offer the desired degree of confidence.

Multiple testing 1. One advantage of permutation methods is that they allow more reliable correction for multiple testing. 2.The issue of multiple tests is crucial, as microarrays typically monitor the expression levels of thousands of genes. 3.Standard Bonferroni correction (that is, multiplying the uncorrected p-value by the number of genes tested) is overly restrictive.

Multiple testing 1.Step-down methods designed to minimize this overcorrection are little better for thousands of genes. 2.Both methods are overly strict because they are based on the assumption that each gene represents an independent test. 3.In fact, the correlation structure between gene-expression patterns is significant and complex. Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, (1979).

Multiple testing To capture this structure, Dudoit et al. propose a permutation-based approximation of Westfall and Young’s method Dudoit, S., Yang, Y.-H., Callow, M.J. & Speed, T.P. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578 (Department of Statistics, University of California at Berkeley, Berkeley, CA, 2000). C code is available online

Multiple testing A package of R functions for other techniques evaluated in Dudoit et al is available at zarray/Software/smacode.html

Multiple testing The advantage of permutation- based adjustment for multiple testing. The x-axis shows unadjusted p-values derived from independent t tests for each gene to detect differential expression between sensitive and resistant cell lines. The y- axis shows the adjusted p- values using Bonferroni correction (black circles) and Westfall and Young’s permutation-based method (blue squares). At the adjusted cutoff of 0.05, the permutation method finds 11 significantly changing genes (instead of 7 without permutation).

False discovery rate 1.All these approaches focus on determining the ‘family-wise error rate,’ the overall chance that at least one gene is incorrectly identified as differentially expressed. 2.For microarray studies focusing on finding sets of predictive genes, it may instead be acceptable to bound the ‘false discovery rate’ (FDR), the probability that a given gene identified as differentially expressed is a false positive.

False discovery rate 3.A simple method for bounding the FDR is proposed by Benjamini and Hochberg. 4.While this, too, assumes that each gene is an independent test, a permutation-based approximation of this method is implemented in the SAM (Significance Analysis of Microarrays) program by Tusher et al. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B 57, (1995). Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, (2001).

False discovery rate Efron, B., Storey, J. & Tibshirani, R. Microarrays, Empirical Bayes Methods, and False Discovery Rates. (2001). Storey, J., Taylor, J. & Siegmund, D. Strong Control, Conservative Point Estimation, and Simultaneous Conservative Consistency of False Discovery Rates: A Unified Approach. (2003).

Comparison of SAM to conventional methods for analyzing microarrays Falsely significant genes plotted against number of genes called significant. Of the 57 genes most highly ranked by the fold change method, 5 were included among the 46 genes most highly ranked by SAM. Of the 38 genes most highly ranked by the pairwise fold change method, 11 were included among the 46 genes most highly ranked by SAM. These results were consistent with the FDR of SAM compared to the FDRs of the fold change and pairwise fold change methods.

False discovery rate 5.A more permissive permutation-based approach to bounding the FDR appears in the Whitehead’s GeneCluster software package. Golub, T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, (1999).

False discovery rate Although in some data sets even the lowest FDR may be prohibitively high, this can be a valuable approach to finding some valid leads when more stringent analyses find none.

Time series analysis 1.The canonical time-series data in the field come from two experiments following the yeast cell cycle. 2.Spellman’s analysis incorporates a Fourier transform to test the periodicity of individual genes in three separate data sets, before combining these into a single significance score used to rank the genes. Cho, R.J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2, (1998). Spellman, P.T. et al. Comprehensive identification of cell cycle- regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9, (1998).

Time series analysis 3.Later analyses of the same data sets look at other time-warping or phase-shifting algorithms to test periodicity. 4.Software for several of these is available online. Aach, J. & Church, G.M. Aligning gene expression time series with time warping algorithms. Bioinformatics 17, (2001). Filkov, V., Skiena, S. & Zhi, J. Analysis techniques for microarray time-series data. J. Comput. Biol. 9, (2002).

Time series analysis 5.Evaluating or modifying time-series analysis methods for the microarray domain, particularly given the difficulty of taking sufficiently frequent array measurements to monitor many processes of interest, is an area ripe for additional attention. 6.Also of interest is the suitability of such methods for analysis of samples related in other ways, such as cells exposed to different doses of a drug, or expression patterns from related bacterial strains.

Other Approaches - Bayes/ Posterior odds (Newton et al.) - Bayesian networks (Friedman et al.) - Empirical bayes (Tibshirani) - Support Vector (Brown et al.) -Mixed model (MacKay & Miskin) - Parametric bootstrap (van der Laan & Bryan)

Sources Slonim, D.K. From patterns to pathways: gene expression data analysis comes of age. Nature Genet. 32, (2002). Churchill, G.A. Fundamentals of experimental design for cDNA microarrays. Nature Genet. 32, (2002). Yang, Y.H. & Speed, T. Design issues for cDNA microarray experiments. Nature Rev. Genet. 3, (2002). Quackenbush, J. Microarray data normalization and transformation. Nature Genet. 32, (2002).