Download presentation
Presentation is loading. Please wait.
Published byRandolph Wilkinson Modified over 9 years ago
1
Differential Expression II Adding power by modeling all the genes Oct 06
2
Linear Model for Several Conditions Y ij = + i + error ij is the mean expression for the gene over the entire experiment. i is the deviation of the mean of the i th condition from the overall mean i i =0 Notice that this is a model for 1 gene at a time.
3
A basic feature of models The more items we estimate from the data, the less precisely we can do the estimation. The fewer items we estimate from the data, the more we have to rely on the correctness of the model. e.g. Y ij = + i + error ij Because we are interested in hypotheses about the means, and are less interested in the variability, we often assume that the variability does not depend on the treatment. When we do an analysis, we check this assumption, but we do not worry about small violations.
4
A basic feature of models Because we are interested in hypotheses about the means, and are less interested in the variability, we often assume that the variability does not depend on the treatment. When we do an analysis, we check this assumption, but we do not worry about small violations. For the 2-sample t-test, we have only 2 means, so if we have sufficient observations, we may allow the variability to depend on the treatment. (This is the default.) If we have only a few observations per treatment, we might assume that the variances are the same ("pooled variance") which gives more power (if the assumption is correct.)
5
Precision and Degrees of Freedom The precision of t and F tests is determined in part by the (error) degrees of freedom. Assuming constant variance gives more d.f. F - distributions t - distributions
6
Adding Power by Modeling The main idea: Since differential expression focuses on differences in means, use a model for the variance that uses data from all the genes. 1) all genes have the same variance under all treatments (frequentist) 2) the variances have a known distribution (Bayesian) 3) the variances have a distribution that can be estimated (Empirical Bayes) 4) regularization or shrinkage - combine information from individual genes with information from all the genes
7
Adding Power by Modeling 1) all genes have the same variance under all treatments (frequentist) expression log 2 (expression)
8
Adding Power by Modeling as a function of mean We could assume that Var(y ij ) = ij Then the t-test would be where we would use a pooled estimate over all genes to estimate 2
9
Adding Power by Modeling We could assume that Var(y ij ) = 2 or Var(y ij ) = ij E.g. with 2 treatments and 2 samples per treatment, we have only (n+m-2)=2 d.f. for the usual t-test. If we use the pooled estimate of variance, we obtain 2 x number genes d.f. for the test (practically infinite)
10
Adding Power by Modeling as a distribution This leads to Bayes and Empirical Bayes models. Bayes: We assume that we know the distribution before we even see the data. Empirical Bayes: We assume a shape for the distribution (e.g. Normal or a multiple of a chi-square) and estimate parameters such as the mean and variance or we use a smoothed version of the sample histogram
11
The Effect of (Empirical) Bayes Let S p 2 be the (estimate of the) mean of the prior. Then the (empirical) Bayes estimate of variance of X is a weighted average of S x 2 and S p 2 and similarly for Y. The associated d.f. for is a weighted average of the d.f. of S x 2 (sample size -1) and the pooled variance (the d.f. depend on the model for the prior). The t-test then becomes: I am not aware of any software that uses the formula here, as the software assumes that the variance depends on the gene, but not on the treatment
12
Adding Power by Regularization The idea comes from what is called regularization in matrix inversion - adding a small constant before taking inverses to avoid dividing by numbers close to zero. So, we might replace S x 2 by S x 2 +s where s is some small number. What number should be chosen? Usually s is based on the histogram of within gene variances, so this method is similar to empirical Bayes But: there are no d.f. associated with the estimate, so resampling (permutation or bootstrap) is used to estimate the null distribution.
13
Some Available Software MAANOVA: available for Matlab and in Bioconductor Fits a very general ANOVA model with fixed and random effects with the following options: a) a variance for each gene b) all genes have the same variance c) regularized variance estimate
14
Some Available Software SAM: available for Excel samr: in Bioconductor Fits a few ANOVA models and a survival model using a regularized test statistic and a null distribution based on permutations (simple interface in Excel)
15
Some Available Software CyberT: http://visitor.ics.uci.edu/genex/cybert/ Fits paired, 2-sample and one-way ANOVA models using a full Bayesian model Can be run on a web interface, or you can download the R code.
16
Some Available Software in Bioconductor Limma: Fits an empirical Bayes model based on modeling the variances for each gene as S p 2 Chi-squared, where S p 2 and the d.f. are estimated from the data. Allows 1 random effect estimated as a fraction of the variance.
17
Some Available Software in Bioconductor LPE: Estimates the variance of each gene as a function of the mean expression level multtest: does permutation tests based on t and F statistics (including regularized versions) geneTS: for time-course experiments
18
Some Other DE Software in Bioconductor EBayes: Bioconductor LMGene:
19
Multiple Comparisons Regardless of our choice of analysis method, the end result is a measure of statistical significance for each gene. For frequentist methods, the measure is: P(observed test statistic| H 0 is true) For Bayesian methods, the measure is P(H 0 is true | observed data)
20
Multiple Comparisons To understand the multiple comparisons problem, we start with the p-value for the i th gene: P(observed i th test statistic| H 0i is true)=p i Typically, we select a cut-off such as =.05 or =.01 and declare that gene i differentially expresses (reject H 0i ) if p i ≤
21
Types of Error AcceptRejectTotal H 0 trueTVm0m0 H A trueUSm-m 0 Totalm-RRm If we test at level , E(V)= m 0 If is the average power is , E(U)= m-m 0 ) e.g. =0.05 =0.8 m=10,000 m 0 =9000 E(V)=0.05*9000=450 E(U)=0.2*1000=200 Most MC procedures focus on V m=number of genes m 0 = number of genes that do not differentially express R=number of genes for which we reject H 0
22
3.5 main ideas (x 2) 1. Family-wise error rate: adjust the value of to force P(V>0) ≤ some predefined level such as.05 or.01 2. False discovery rate: choose a value of (estimated from the data) to keep E(V/R) ≤ some predefined level such as.05 or.01 3. Estimate m 0 (or 0 = m 0 /m) from the data, and then apply 1 or 2. 3.5. Pick and report the estimate of AcceptRejectTotal H 0 trueTVm0m0 H A trueUSm-m 0 Totalm-RRm
23
3.5 main ideas (x 2) 1. Family-wise error rate: adjust the value of to force P(V>0) ≤ some predefined level such as.05 or.01 2. False discovery rate: choose a value of (estimated from the data) to keep E(V/R) ≤ some predefined level such as.05 or.01 3. Estimate m 0 (or 0 = m 0 /m) from the data, and then apply 1 or 2. 3.5. Pick and report the estimate of x2 Apply the same ideas to the false negatives AcceptRejectTotal H 0 trueTVm0m0 H A trueUSm-m 0 Totalm-RRm
24
Family-wise Error Rate adjust the value of to force P(V>0) ≤ some predefined level such as.05 or.01 The best-known method is the Bonferroni method. If you are doing m tests, use m. Problem: This is much to conservative and explodes the U, the number of non-detections. e.g. =0.05 m=10,000 * /10000=.0000005 The adjusted p-value is p i /m
25
Family-wise Error Rate e.g. =0.05 m=10,000 * /10000=.0000005 With 5 d.f., reject for p 2.57 reject for p 20.59 If we can estimate m 0 we only need to adjust by dividing by m 0 which provides a slight improvement. e.g. m 0 =9000 reject when |t*|>20.16
26
False Discovery Rate FDR=E(V/R) where V/R==0 if R=0. There is a related method called the positive FDR pFDR=E(V/R | R>0)
27
Controlling False Discovery Rate Benjamini and Hochberg (1995) If the p-values are independent, then the following procedure controls the FDR at level Sort the p-values: p (1) ≤p (2)... ≤p (m) Let k be the largest value of i for which p (i) ≤ i /m. We then reject all the hypotheses with p-value ≤ p (k) (BH method) Benjamini and Hochberg (2000) Same procedure with an estimate of m 0 Benjamini and Yekutieli (2001) The BH procedure works with some types of dependent tests. The BH procedure works with arbitrary dependency if we replace by i/m) ≈ /ln(m) (Compare with Bonferroni)
28
Estimating pFDR Storey and Tibshirani (2001) Let n 0 be an estimate of m 0. An estimate of pFDR = n 0 /R if we reject for p< They also introduced the q-value - for each observed value of p, the q-value is the largest estimated pFDR if we reject for smaller p-values. The q-value (or the FDR equivalent) is sometimes called an adjusted p-value, but it is not a p-value=prob(observed|H 0 ) The pFDR cannot be "controlled" because Pr(R=0)≠0, but it is readily estimated.
29
Estimating m 0 When m 0 =m, all the genes are independent, and the test statistics follow the theoretical null distributions, the p- values are uniformly distributed. When m 0 <m the differentially expressed genes tend to have smaller p-values
30
Estimating m 0 When m 0 <m the differentially expressed genes tend to have smaller p-values. The flat part of the histogram is used to estimate m 0. There are several methods, and they work quite well as long as the peak near zero is sharp.
32
Problems in estimating m 0 and FDR When the power is low (small sample size or noisy data) we may estimate m 0 =m and FDR>50% even for the smallest p-value. Even if we estimate m 0 <m (so m-m 0 genes DE) we will not "detect m-m 0 genes. If m 0 is very small, there may not be a large enough flat region of the histogram for estimation. Occasionally, the distribution of p-values is bi-modal - probably due to the distribution of expression too far from normal.
33
Bayesian Methods and Local FDR So far, we have discussed Bayesian ANOVA methods which use a prior for the expression variance. To adjust for multiple comparisons, Bayesians model the log2(expression ratio), t-value or p- value. As an example, I will show how the method works for modeling p-values.
34
Distribution of p-values The p-values of the nonDE genes is uniform. The p- values of the DE genes are skewed so that small p- values are more prevalent. The percentage of nonDE genes is 0 =m 0 /m.
35
Distribution of p-values The distribution of p-values is o f o (p) + (1- o )f 1 (p). f 0 (p) f 1 (p) 0 f 0 (p)+(1- 0 )f 1 (p)
36
Bayesian Interpretation If you reject when p i < the FDR will be Socan be thought of as a local FDR It seems like we should reject H 0i when local FDR is small more importantly than when FDR is small
37
Frequentists, Bayesians and FDR In the same Bayesian context, and with we find that So, pFDR, the rejection region, power and the Bayesian posterior probability that the gene does not differentially expression are all connected in a way that makes intuitive sense.
38
Frequentists, Bayesians and q-values For each observed value of p, the q-value is the largest estimated pFDR if we reject for smaller p-values. In the same Bayesian context, the q-value is the largest probability that H 0 is true, if we reject for smaller p- values. Storey 2003 Efron 2006 Obtains e-Bayes estimates of the distributions of the test statistic under the null and alternative hypotheses, defines local FDR using this Bayesian interpretation and notes the connection between local FDR and q- values.
39
Testing for DE and Multiple Comparisons Adjustments Most of the software previously mentioned includes the capability to incorporation FWER, FDR or some variant. Each uses a slightly different definition of these quantities. The qvalue library in Bioconductor computes variants of FDR from a list of p-values.
40
Number of genes = 68665 estimated 0 = 0.5669248 estimated m-m 0 =29737 number of rejections if pFDR≈0.05 2619 0.1 9202 pFDR if we reject at =0.05 pFDR ≈ 0.14
41
Number of genes = 68665 estimated 0 = 0.934315 estimated m-m 0 =4510 number of rejections if pFDR≈0.05 0 0.1 0 pFDR if we reject at =0.05 pFDR ≈ 0.78
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.