Fewer permutations, more accurate P-values Theo A. Knijnenburg 1,*, Lodewyk F. A. Wessels 2, Marcel J. T. Reinders 3 and Ilya Shmulevich 1 1Institute for Systems Biology, Seattle, WA, USA, 2Bioinformatics and Statistics, The Netherlands Cancer Institute, Amsterdam and 3Information and Communication Theory Group, Delft University of Technology, Delft, The Netherlands Bioinformatics (12):i161-i168
How to obtain accurate p-values with fewer permutations? P-value of a permutation test is a probability of obtaining a result at least as extreme as the test statistic, given that the null hypothesis is true. Null hypothesis: labels assigning samples to classes are interchangeable. Normal conditionsStress conditions Array 1Array 2Array 3Array 4Array 5Array 6Array 7Array 8T gene gene gene gene gene
How to obtain accurate p-values with fewer permutations? P-value of a permutation test is a probability of obtaining a result at least as extreme as the test statistic, given that the null hypothesis is true. Null hypothesis: labels assigning samples to classes are interchangeable. Normal conditionsStress conditions Array 8Array 4Array 6Array 3Array 5Array 1Array 7Array 2T.perm gene gene gene gene gene For each test (gene) The P-value is assessed by performing all possible permutations and computing the fraction of permutation values that are at least as extreme as the test statistic obtained from the unpermuted data. In practice, because performing all permutations may be infeaseable, only subset of Nall is computed.
Problems: -permutation-obtained p-values need N permutations to achieve 1/N accuracy -smallest achievable p-value is 1/N 6 samples, 2 conditions -> N=20 p min = multiple tests adjustment of p-values leads to even bigger (less meaningful) p-values in most conservative adjustment p adj =p*N tests -large number of permutations may be computationally intensive, infeasible or impossible
Authors propose to estimate the small P-values from permutation test using extreme value theory (Gumbel, 1958). The set of extreme (very large or very small) permutation values that forms the tail of the distribution of permutation values is modeled as a generalized Pareto distribution (GPD).1958 z - exceedances, z i = t i - t 0 a - the scale parameter k - the shape parameter Maximum likelihood (ML) estimation is employed to estimate a and k given Z. k 1 impossible Original distribution of statisticGeneral Pareto distribution fitted to the extreme values
(a) From the PDF of the F distribution, 5000 samples are drawn. Samples that exceed 5 are defined as the exceedances and are modeled using a GPD. The GPD approximation of the tail (scaled to the interval [(1–Nexc/N), 1] is depicted alongside the theoretical CDF. (b) The theoretical P-value, which is derived from the CDF of the F distribution (Pf) is compared with the ECDF approximation (Pecdf) and the GPD approximation (Pgpd) for values of x 0 >5. GPD tail approximation of an F distribution. PDF-Probability Density Function; GPD-Generalized Pareto Distribution; CDF- Cumulative Distribution Function; ECDF-Empirical Cumulative Distribution Function
Selection of the threshold Too low – it is not an extreme value, can’t be modeled by GPD Too high – only few samples available, large errors in estimates -perform a certain number of permutations -treat 250 most extreme permutation values as exceedances -perform goodness-of-fit test to assess if these 250 values follow GPD -if not, decrease number of exceedances iteratively by 10 until good fit to GPD is reached
When to use GPD Can only be used when the test statistic is extreme, i.e. in the tail of distribution. If, say, 50 out of the 100 permutation values exceed the test statistic, then GPD tail approximation is useless and standard empirical method is adequate. Criterion: if the value >= test statistic appears at least 10 times among permutation values, compute standard empirical p-value; otherwise use GPD This is because: -Number of extreme permutation values M follows a binomial distribution (Bernoulli trials with probability p perm of success) -according to central limit theorem, if M>=10, binomial distribution of M can be approximated by a normal distribution M=P ecdf *N
Minimum number of permutations (N c ) required for convergence to the correct P-value Results on 7 theoretical distributions Light-tailedHeavy-tailed -always fewer permutations required for GPD -difference between methods bigger for smaller p-values and for distributions with heavier tail Theoretical permutation test P-value obtained by evaluating the CDF at the value of the test statistic ECDF GPD
Pecdf and Pgpd for an F distribution The ECDF approximation converges to the correct P-value linearly with the number of permutations, N. GPD approximation converges with far fewer permutations. a decent estimate of Pperm is obtained with 10 4 permutation values However, when N >> 1/Pperm, there is a lot of variability in Pgpd
Application to differential gene expression analysis Chose 132 relevant genes, computed t-statistic for differential expression Did permutations until M>25 (or N=10^9), then estimated *true* p-perm from permutations Computed Pecdf and Pgpd for different N until N=10^5. Repeated 200 times
Application to differential gene expression analysis transformed the test statistic and its permutation values, such that k<0, i.e. the tail becomes more heavy. raised all test statistics and corresponding permutation values to the power three better estimate with much less variance Now a reasonable estimate of P-values <10 –7 can be made using only 10^5 permutations.
Application to GSEA Comparison of Cecdf and Cgpd for different values of N T-statistics for differentially expressed gene sets Pperm - generate permutation values until M becomes >25 (no more than 10^6) Chose 89 gene sets for which M>25 within the 10 6 permutations and which had a Pperm <0.01 compare the correctly ordered list of 89 gene sets based on Pperm with the ordered lists based on Pgpd and Pecdf (Spearman rank correlation)
where P est (N) is the estimated P-value (either P ecdf or P gpd ) after N permutations, P est (N) is the % confidence bound on the estimated P-value. N c is the minimum amount of permutations at which these criteria are met. The convergence criteria developed to suit practical applications: 1. convergence: little variation of P est with increasing N from Nc/10 to Nc 2. accuracy; the 25th–75th confidence bounds of the P-value estimate deviate <10% from P est (criterion only for GPD)
The 25th and 75th percentile values of N c and the corresponding P-value estimates (P ecdf and P gpd ) for five different genes Using these convergence criteria on the 5 exemplary genes (so that range of p-values is big) from the differential gene expression analysis Number of permutations was increased until convergence criteria were met (or reached 10^6) Repeated 25 times Examined effect of order-retaining transformations on the test statistic and its permutation values. Application to differential gene expression analysis Z - power to which statistic is raised
Usually, same number of permutations is performed for each test statistic Different test statistics require different numbers of permutations In most applications, large majority of test statistics will require only a small number of permutations to reliably compute their large (and hence, insignificant) P-values while only a small fraction of the test statistics will be significant, i.e. they will require a lot of permutations to reliably estimate their small P-values. Simple convergence criteria and confidence bounds on the estimate can be used to indicate when enough permutations have been performed to have certain statistical confidence in the P-value estimate. Such an approach can lead to a decrease in the total number of permutations, and thus computational time while producing more accurate P-value estimates. Web interface for the proposed method is under development. Summary
Fewer permutations, more accurate P-values Theo A. Knijnenburg 1,*, Lodewyk F. A. Wessels 2, Marcel J. T. Reinders 3 and Ilya Shmulevich 1 1Institute for Systems Biology, Seattle, WA, USA, 2Bioinformatics and Statistics, The Netherlands Cancer Institute, Amsterdam and 3Information and Communication Theory Group, Delft University of Technology, Delft, The Netherlands Bioinformatics (12):i161-i168