Relating Gene Expression to a Phenotype and External Biological Information Richard Simon, D.Sc. Chief, Biometric Research Branch, NCI

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Publications Reviewed Searched Medline Hand screening of abstracts & papers Original study on human cancer patients Published in English before December.
Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Chapter 7 Statistical Data Treatment and Evaluation
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
The Multiple Regression Model Prepared by Vera Tabakova, East Carolina University.
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Model and Variable Selections for Personalized Medicine Lu Tian (Northwestern University) Hajime Uno (Kitasato University) Tianxi Cai, Els Goetghebeur,
Gene Expression Data Analyses (3)
Differentially expressed genes
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
. Differentially Expressed Genes, Class Discovery & Classification.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Chapter 11: Inference for Distributions
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Use of Genomics in Clinical Trial Design and How to Critically Evaluate Claims for Prognostic & Predictive Biomarkers Richard Simon, D.Sc. Chief, Biometric.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Inferential Statistics
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 7 – T-tests Marshall University Genomics Core Facility.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Gene Expression Profiling Illustrated Using BRB-ArrayTools.
Essential Statistics in Biology: Getting the Numbers Right
Chapter 8 Introduction to Hypothesis Testing
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
Differential Expression II Adding power by modeling all the genes Oct 06.
1 Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting Authors: A. Dupuy and R.M. Simon.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.
Gene Expression Profiling. Good Microarray Studies Have Clear Objectives Class Comparison (gene finding) –Find genes whose expression differs among predetermined.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Analysis of Variance (ANOVA) Brian Healy, PhD BIO203.
ANOVA: Analysis of Variance.
Statistics for Differential Expression Naomi Altman Oct. 06.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Chapter 10 The t Test for Two Independent Samples
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 26 Chapter 11 Section 1 Inference about Two Means: Dependent Samples.
1 Significance analysis of Microarrays (SAM) Applied to the ionizing radiation response Tusher, Tibshirani, Chu (2001) Dafna Shahaf.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Lecture 8 Estimation and Hypothesis Testing for Two Population Parameters.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Today’s lesson (Chapter 12) Paired experimental designs Paired t-test Confidence interval for E(W-Y)
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses pt.1.
Canadian Bioinformatics Workshops
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Factorial Experiments
Differential Gene Expression
Significance Analysis of Microarrays (SAM)
Significance Analysis of Microarrays (SAM)
Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.
Varying Intolerance of Gene Pathways to Mutational Classes Explain Genetic Convergence across Neuropsychiatric Disorders  Shahar Shohat, Eyal Ben-David,
Presentation transcript:

Relating Gene Expression to a Phenotype and External Biological Information Richard Simon, D.Sc. Chief, Biometric Research Branch, NCI

Good Microarray Studies Have Clear Objectives Gene Finding –Class Comparison Find genes whose expression differs among predetermined classes –Find genes whose expression is correlated with quantitative measure or survival Class Prediction –Prediction of predetermined class (phenotype) using information from gene expression profile Class Discovery –Discover clusters of specimens having similar expression profiles –Discover clusters of genes having similar expression profiles

Class Comparison and Class Prediction Not clustering problems –Global similarity measures generally used for clustering arrays may not distinguish classes –Don’t control multiplicity or for distinguishing data used for classifier development from data used for classifier evaluation Supervised methods Requires multiple biological samples from each class

Major Flaws Found in 40 Studies Published in 2004 Cluster Analysis of samples –13/28 studies invalidly claimed that expression clusters based on differentially expressed genes could help distinguish clinical outcomes Outcome related gene finding –9/23 studies had unclear or inadequate methods to deal with false positives 10,000 genes x.05 significance level = 500 false positives Supervised prediction –12/28 reported a misleading estimate of prediction accuracy 50% of studies contained one or more major flaws

Levels of Replication Technical replicates –RNA sample divided into multiple aliquots and re- arrayed Biological replicates –Multiple subjects –Replication of the tissue culture experiment

Biological conclusions generally require independent biological replicates. Analyses should distinguish biological replicates from technical replicates The power of statistical methods for finding differentially expressed genes depends on the number of biological replicates. For class comparison with a common reference design, dye swap technical references are not needed

Common Reference Design A1A1 R A2A2 B1B1 B2B2 RRR RED GREEN Array 1Array 2Array 3Array 4 A i = ith specimen from class A R = aliquot from reference pool B i = ith specimen from class B

The reference generally serves to control variation in the size of corresponding spots on different arrays and variation in sample distribution over the slide. The reference provides a relative measure of expression for a given gene in a given sample that is less variable than an absolute measure. The reference is not the object of comparison. The relative measure of expression will be compared among biologically independent samples from different classes.

Class Comparison Blocking Paired data –Pre-treatment and post-treatment samples of same patient –Tumor and normal tissue from the same patient Blocking –Multiple animals in same litter –Any feature thought to influence gene expression Sex of patient Batch of arrays

Technical Replicates Multiple arrays on the same RNA sample Analyses should distinguish biological replicates from technical replicates –Select the best quality technical replicate or –Average expression values over technical replicates

t-test Comparisons of Expression for Gene j x j ~N(  j1,  j 2 ) for class 1 x j ~N(  j2,  j 2 ) for class 2 H 0j :  j1 =  j2

Estimation of Within-Class Variance Estimate separately for each gene –Limited degrees-of-freedom (precision) unless number of samples is large –Gene list dominated by genes with small fold changes and small variances Assume all genes have same variance –Poor assumption Random (hierarchical) variance model –Wright G.W. and Simon R. Bioinformatics19: ,2003 –Variances are independent samples from a common distribution; Inverse gamma distribution used – Results in exact F (or t) distribution of test statistics with increased degrees of freedom for error variance –For any normal linear model

Simple Control for Multiple Testing If each gene is tested for significance at level  and there are n genes, then the expected number of false discoveries is n . –e.g. if n=1000 and  =0.001, then 1 false discovery –To control E(FD)  u –Conduct each of k tests at level  = u/k

False Discovery Rate (FDR) FDR = Expected proportion of false discoveries among the tests declared significant Studied by Benjamini and Hochberg (1995):

Not rejectedRejectedTotal True null hypotheses False discoveries 900 False null hypotheses 1090 True discoveries

If you analyze n probe sets and select as “significant” the k genes whose p ≤ p* FDR ~ n p* / k

Limitations of Simple Procedures Based on Univariate p values p values based on normal theory are not accurate in the extreme tails of the distribution Difficult to achieve stringent significance levels for permutation p values of individual genes with small numbers of samples Multiple comparisons controlled by adjustment of univariate p values do not take account of correlation among genes

Additional Procedures “SAM” - Significance Analysis of Microarrays –Tusher et al., 2001 Multivariate permutation tests –Korn et al., 2001 –Control number or proportion of false discoveries –Can specify confidence level of control

Multivariate Permutation Test (Korn et al., 2001) Allows statements like: FD Procedure: We are 95% confident that the (actual) number of false discoveries is no greater than 5. FDP Procedure: We are 95% confident that the (actual) proportion of false discoveries does not exceed.10.

Biological Annotations of Differentially Expressed Genes Types of annotations –GO, pathways – pubmed citations – published signatures – TF targets Built-in annotations in statistical software used to generate the list of differentially expressed genes Submitting the list of differentially expressed genes to a website or program that does annotations

Over-Representation Analysis 10,000 genes on array 100 genes found differentially expressed between phenotype classes O = observed number of differentially expressed genes in specified GO set –e.g O = genes on array in specified GO set E = Expected number of differentially expressed genes in specified GO set –E = (200/10,000)*100 = 2.0

Limitation of Over-Representation Analysis Gene list is usually based on stringent significant threshold –Number of individual genes is large –Statistical power for identifying differentially expressed genes is limited and therefore list is often incomplete Construction of list of differentially expressed genes based on univariate analysis of individual genes does not permit results for genes in set to reinforce each other for detecting differentially expressed gene set

Gene Set Enrichment Analysis and Variants Compute p value of differential expression for each gene in a gene set (k=number of genes) Compute a summary (S) of these p values –Average of log p values –Kolmogorov-Smirnov statistic; largest distance between the cumulative distribution of the p values and the uniform distribution expected if none of the genes were differentially expressed –Modified K-S statistic –Average of t statistics –P value for regression model on all genes in set under assumption that regression coefficients come from common N(0,v) distribution

Null Hypotheses for Gene Set Enrichment Analyses Determine whether the value of S is more extreme than would be expected if none of the genes in the set were differentially expressed –Permute class labels randomly and re-calculate p values and summary S –Repeat for all or many permutations and generate the distribution of S under the null hypothesis –Compute p*=the proportion of the random permutations gave a value of S at least as great as with the true class labels Determine whether the value of S is more extreme than would be expected from a random sample of k genes on that platform

Gene Set Expression Comparison p value for significance of summary statistic need not be as extreme as.001 usually, because the number of gene sets analyzed is usually much less than the number of individual genes analyzed Conclusions of significance are for gene sets in this tool, not for individual genes

Comparison of Gene Set Expression Comparison to O/E Analysis in Class Comparison Gene set expression tool is based on all genes in a set, not just on those significant at some threshold value

P Pavlidis, DP Lewis, WS Noble. Pac Symp Biocomp, , 2002 VK Mootha, CM Lindgren, KF Eriksson, A Subramanian, et al. Nature Genetics 34:267-73, 2003 P Pavlidis, J Qin, V Arango, JJ Mann, E Sibille. NeurochemicalResearch 29: , 2004 JJ Goeman, SA van de Geer, F de Kort, HC van Houwelingen, Bioinformatics 20:93-99, 2004 A Subramanian, P Tamayo, VK Mootha, et al. PNAS 102: , 2005 WT Barry, AB Nobel, FA Wright. Bioinformatics 21: , 2005 L Tian, SA Greenberg, SW Kong, JAltschuler, IS Kohane, PJ Park, PNAS 102: , 2005 SW Kong, WT Pu, PJ Park, Bioinformatics 22: , 2006