Analysis of gene expression data (Nominal explanatory variables) Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH)

Slides:



Advertisements
Similar presentations
Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize.
Advertisements

CHAPTER 2 Building Empirical Model. Basic Statistical Concepts Consider this situation: The tension bond strength of portland cement mortar is an important.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Genomic Profiles of Brain Tissue in Humans and Chimpanzees II Naomi Altman Oct 06.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Differentially expressed genes
9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Statistical Methods for Analyzing Ordered Gene Expression Microarray Data Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
1 Test of significance for small samples Javier Cabrera.
Chapter 11 Multiple Regression.
False Discovery Rate Methods for Functional Neuroimaging Thomas Nichols Department of Biostatistics University of Michigan.
 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.
One-way Between Groups Analysis of Variance
Topic 3: Regression.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Experimental Evaluation
Inferences About Process Quality
Statistics for Microarrays
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Chapter 14 Inferential Data Analysis
Multiple Testing Procedures Examples and Software Implementation.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
False Discovery Rate (FDR) = proportion of false positive results out of all positive results (positive result = statistically significant result) Ladislav.
Chapter 9 Title and Outline 1 9 Tests of Hypotheses for a Single Sample 9-1 Hypothesis Testing Statistical Hypotheses Tests of Statistical.
Objectives of Multiple Regression
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Multiple testing correction
Hypothesis Testing Statistics for Microarray Data Analysis – Lecture 3 supplement The Fields Institute for Research in Mathematical Sciences May 25, 2002.
Multiple Testing in the Survival Analysis of Microarray Data
Multiple testing in high- throughput biology Petter Mostad.
Candidate marker detection and multiple testing
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Essential Statistics in Biology: Getting the Numbers Right
9-1 Hypothesis Testing Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.
Differential Expression II Adding power by modeling all the genes Oct 06.
Significance Testing of Microarray Data BIOS 691 Fall 2008 Mark Reimers Dept. Biostatistics.
False Discovery Rates for Discrete Data Joseph F. Heyse Merck Research Laboratories Graybill Conference June 13, 2008.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Introduction to Microarrays Dr. Özlem İLK & İbrahim ERKAN 2011, Ankara.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Statistics for Differential Expression Naomi Altman Oct. 06.
Lecture 10: Correlation and Regression Model.
Statistical Testing with Genes Saurabh Sinha CS 466.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Ark nr.: 1 | Forfatter: Øyvind Langsrud - a member of the Food Science Alliance | NLH - Matforsk - Akvaforsk Rotation Tests - Computing exact adjusted.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
The Broad Institute of MIT and Harvard Differential Analysis.
1 Significance analysis of Microarrays (SAM) Applied to the ionizing radiation response Tusher, Tibshirani, Chu (2001) Dafna Shahaf.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Differential Gene Expression
Presentation transcript:

Analysis of gene expression data (Nominal explanatory variables) Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH) Research Triangle Park, NC

Outline of the talk Two types of explanatory variables (“experimental conditions”) Some scientific questions of interest A brief discussion on false discovery rate (FDR) analysis Some existing statistical methods for analyzing microarray data

Types of explanatory variables

Types of explanatory variables (“experimental conditions”) Nominal variables: – No intrinsic order among the levels of the explanatory variable(s). – No loss of information if we permuted the labels of the conditions. E.g. Comparison of gene expression of samples from “normal” tissue with those from “tumor” tissue.

Types of explanatory variables (“experimental conditions”) Ordinal/interval variables: – Levels of the explanatory variables are ordered. – E.g. Comparison of gene expression of samples from different stages of severity of lessions such as “normal”, “hyperplasia”, “adenoma” and “carcinoma”. (categorically ordered) Time-course/dose-response experiments. (numerically ordered)

Focus of this talk: Nominal explanatory variables

Types of microarray data Independent samples – E.g. comparison of gene expression of independent samples drawn from normal patients versus independent samples from tumor patients. Dependent samples – E.g. comparison of gene expression of samples drawn from normal tissues and tumor tissues from the same patient.

Possible questions of interest Identify significant “up/down” regulated genes for a given “condition” relative to another “condition” (adjusted for other covariates). Identify genes that discriminate between various “conditions” and predict the “class/condition” of a future observation. Cluster genes according to patterns of expression over “conditions”. Other questions?

Challenges Small sample size but a large number of genes. Multiple testing – Since each microarray has thousands of genes/probes, several thousand hypotheses are being tested. This impacts the overall Type I error rates. Complex dependence structure between genes and possibly among samples. – Difficult to model and/or account for the underlying dependence structures among genes.

Multiple Testing: Type I Errors - False Discovery Rates …

The Decision Table Number of Not rejected Number of rejected Total Number of True Total The only observable values

Strong and weak control of type I error rates Strong control: control type I error rate under any combination of true Weak control: control type I error rate only when all null hypotheses are true Since we do not know a priori which hypotheses are true, we will focus on strong control of type I error rate.

Consequences of multiple testing Suppose we test each hypothesis at 5% level of significance. – Suppose n = 10 independent tests performed. Then the probability of declaring at least 1 of the 10 tests significant is 1 – = – If 50,000 independent tests are performed as in Affymetrix microarray data then you should expect 2500 false positives!

Types of errors in the context of multiple testing Per-Family Error “Rate” (PFER): E(V ) – Expected number of false rejection of Per-Comparison Error Rate (PCER): E(V )/m – Expected proportion of false rejections of among all m hypotheses. Family-Wise Error Rate (FWER): P( V > 0 ) – Probability of at least one false rejection of among all m hypotheses

Types of errors in the context of multiple testing False Discovery Rate (FDR): – Expected proportion of Type I errors among all rejected hypotheses. Benjamini-Hochberg (BH): Set V/R = 0 if R = 0. Storey: Only interested in the case R > 0. (Positive FDR)

Some useful inequalities

Conclusion It is conservative to control FWER rather than FDR! It is conservative to control pFDR rather than FDR!

Some useful inequalities

However, in most applications such as microarrays, one expects In general, there is no proof of the statement

q-vlaues versus p-values. Suppose and suppose we are interested in a one-sided test. Suppose is the value of the test stat. for a given data set.

q-vlaues versus p-values. The pFDR can be rewritten as Suppose is the value of the test stat. for a given data set. Then the q-value is the posterior-Bayesian p-value

Some popular Type I error controlling procedures Let denote the ordered p-values for the ‘m’ tests that are being performed. Let denote the ordered levels of significance used for testing the ‘m’ null hypotheses, respectively.

Some popular controlling procedures Step-down procedure:

Some popular controlling procedures Step –up procedure:

Some popular controlling procedures Single-step procedure A stepwise procedure with critical same critical constant for all ‘m’ hypotheses.

Some typical stepwise procedures: FWER controlling procedures Bonferroni: A single-step procedure with Sidak: A single-step procedure with Holm: A step-down procedure with Hochberg: A step-up procedure with minP method: A resampling-based single-step procedure with where be the α quantile of the distribution of the minimum p-value.

Comments on the methods Bonferroni: Very general but can be too conservative for large number of hypotheses. Sidak: More powerful than Bonferroni, but applicable when the test statistics are independent or have certain types of positive dependence.

Comments on the methods Holm: More powerful than Bonferroni and is applicable for any type of dependence structure between test statistics. Hochberg: More powerful than Holm’s procedure but the test statistics should be either independent or the test statistic have a MTP2 property.

Comments on the methods Multivariate Total Positivity of Order 2 (MTP2)

Some typical stepwise procedures: FDR controlling procedure Benjamini-Hochberg: A step-up procedure with

An Illustration Lobenhofer et al. (2002) data: Expose breast cancer cells to estrodial for 1 hour or (12, hours). Number of genes on the cDNA 2 spot array Number of samples per time point 8., Compare 1 hour with (12, 24 and 36 hours) using a two-sided bootstrap t-test.

Some Popular Methods of Analysis

1. Fold-change

1. Fold-change in gene expression For gene “g” compute the fold change between two conditions (e.g. treatment and control):

1. Fold-change in gene expression : pre-defined constants. : gene “g” is “up-regulated”. : gene “g” is “down-regulated”.

1. Fold-change in gene expression Strengths: – Simple to implement. – Biologists find it very easy to interpret. – It is widely used. Drawbacks: – Ignores variability in mean gene expression. – Genes with subtle gene expression values can be overlooked. i.e. potentially high false negative rates – Conversely, high false positive rates are also possible.

2. t-test type procedures

2.1 Permutation t-test For each gene “g” compute the standard two-sample t-statistic: where are the sample means and is the pooled sample standard deviation.

2.1 Permutation t-test Statistical significance of a gene is determined by computing the null distribution of using either permutation or bootstrap procedure.

2.1 Permutation t-test Strengths: – Simple to implement. – Biologists find it very easy to interpret. – It is widely used. Drawback: – Potentially, for some genes the pooled sample standard deviation could be very small and hence it may result in inflated Type I errors and inflated false discovery rates.

2.2 SAM procedure (Significance Analysis of Microarrays) (Tusher et al., PNAS 2001) For each gene “g” modify the standard two-sample t-statistic as: The “fudge” factor is obtained such that the coefficient of variation in the above test statistic is minimized.

3. F-test and its variations for more than 2 nominal conditions Usual F-test and the P-values can be obtained by a suitable permutation procedure. Regularized F-test: Generalization of Baldi and Long methodology for multiple groups. – It better controls the false discovery rates and the powers comparable to the F-test. Cui and Churchill (2003) is a good review paper.

4. Linear fixed effects models Effects: – Array (A) - sample – Dye (D) – Variety (V) – test groups – Genes (G) – Expression (Y)

4. Linear fixed effects models (Kerr, Martin, and Churchill, 2000) Linear fixed effects model:

4. Linear fixed effects models All effects are assumed to be fixed effects. Main drawback – all genes have same variance!

5. Linear mixed effects models (Wolfinger et al. 2001) Stage 1 (Global normalization model) Stage 2 (Gene specific model)

5. Linear mixed effects models Assumptions:

5. Linear mixed effects models (Wolfinger et al. 2001) Perform inferences on the interaction term

A popular graphical representation: The Volcano Plots A scatter plot of vs Genes with large fold change will lie outside a pair of vertical “threshold” lines. Further, genes which are highly significant with large fold change will lie either in the upper right hand or upper left hand corner.

A useful review article Cui, X. and Churchill, G (2003), Genome Biology. Software: R package: statistics for microarray analysis. SAM: Significance Analysis of Microarray

Supervised classification algorithms

Discriminant analysis based methods A. Linear and Quadratic Discriminant analysis based methods: Strength: – Well studied in the classical statistics literature Limitations: – Based on normality – Imposes constraints on the covariance matrices. Need to be concerned about the singularity issue. – No convenient strategy has been proposed in the literature to select “best” discrminating subset of genes.

Discriminant analysis based methods B. Nonparametric classification using Genetic Algorithm and K- nearest neighbors. – Li et al. (Bioinformatics, 2001) Strengths: – Entirely nonparametric – Takes into account the underlying dependence structure among genes – Does not require the estimation of a covariance matrix Weakness: – Computationally very intensive

GA/KNN methodology – very brief description Computes the Euclidean distance between all pairs of samples based on a sub-vector on, say, 50 genes. Clusters each sample into a treatment group (i.e. condition) based on the K-Nearest Neighbors. Computes a fitness score for each subset of genes based on how many samples are correctly classified. This is the objective function. The objective function is optimized using Genetic Algorithm

X Expression levels of gene 1 Expression levels of gene 2 K-nearest neighbors classification (k=3)

Expression levels of gene 1 Expression levels of gene 2 Subcategories within a class

Advantages of KNN approach Simple, performs as well as or better than more complex methods Free from assumptions such as normality of the distribution of expression levels Multivariate: takes account of dependence in expression levels Accommodates or even identifies distinct subtypes within a class

Expression data: many genes and few samples There may be many subsets of genes that can statistically discriminate between the treated and untreated. There are too many possible subsets to look at. With 3,000 genes, there are about ways to make subsets of size 30.

The genetic algorithm Computer algorithm (John Holland) that works by mimicking Darwin's natural selection Has been applied to many optimization problems ranging from engine design to protein folding and sequence alignment Effective in searching high dimensional space

GA works by mimicking evolution Randomly select sets (“chromosomes”) of 30 genes from all the genes on the chip Evaluate the “fitness” of each “chromosome” – how well can it separate the treated from the untreated? Pass “chromosomes” randomly to next generation, with preference for the fittest

Summary Pay attention to multiple testing problem. – Use FDR over FWER for large data sets such as gene expression microarrays Linear mixed effects models may be used for comparing expression data between groups. For classification problem, one may want to consider GA/KNN approach.