Differential Methylation Analysis Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews
A basic question…
Factors to consider Number of observations Magnitude of effect Technical considerations Biological variability Biological common sense
The problem of power… Ideally want to cover every Cytosine (CpG) Have to correct for the number of tests There’s no way you’ll collect enough data to analyse each C and have p-values which survive multiple testing correction Stats have to find a way to work round this.
Maximising power Options Analyse in windows Pre-filter Hierarchical or Adaptive filtering
Window sizes Small windows Large windows Good resolution Specific biological effects High MTC burden Small observations High p-values Lots of data High statistical power Low MTC burden Low p-values Effect averaging
Simple Statistical Approach Is the proportion of methylated calls different between two samples, given the number of observations? Meth count A Unmeth count A Meth count B Unmeth count B % change Significant? 2 100 No 200 198 5 1.5 50 75 60 11 Probably
Contingency tests Chi-square / G-test / Fisher’s exact test Differ only at low observations Significant changes require enough observations that any of these should give the same answer Operates on single replicates Technical measure of difference Meth A Unmeth A Meth B Unmeth B
Chi-Square results
Biological considerations Minimum relevant effect size? Balance power vs change What makes biological sense (what would you follow up?) Minimum coverage worth testing No point testing poorly covered regions
Effect of pre-filtering
Distribution of methylation Chi square assumes a normal distribution, and methylation data isn’t normally distributed
Beta binomial distribution More relevant statistics than chi-square. Need to fit custom model to actual data.
Implications of a beta distribution Many summaries assume normality Mean Standard Deviation Boxplots None of these is strictly appropriate when looking at methylation data
Dealing with replicates Simple approach Merge data from replicates together Single test, High power Post-hoc test for consistency Explicitly account for batch effects Logistic regression Measures batch effects and excludes them from final significance calculation Work with methylation values Normalise percentage methylation values Use conventional statistics (t-tests etc) for comparing groups
Hierarchical testing Test larger regions Windows / Features etc. Take significant hits and subdivide Smaller windows Individual CpGs Correct only for these tests Assemble hits together to make up DMRs
X X Hierarchical testing Genome CGI Genome CGI X Genome CGI X Statistically ‘creative’ solution to not having enough data
Methylation statistics packages swDMR (Perl/R-package) Sliding window DMR finding (choose between t_test, Kolmogorov, Fisher, ChiSquare, Wilcoxon for n = 2; ANOVA, Kruskal for n > 3) methylKit* (R-package by A. Akalin et al.) Sliding window, Fisher’s exact test or logistic regression. Adjusts p-values to q-values using SLIM method. bsseq* (R/Bioconductor by K.D. Hansen) Implements the BSmooth smoothing algorithm. Numerous CpG-wise t-tests and p-value cutoff to define DMRs. Outperforms Fisher’s exact test. Requires biological replicates for DMR detection BiSeq* (R/Bioconductor by K. Hebestreit et al.) Beta regression model, impractical for very large data other than RRBS or targeted BS-Seq RnBeads* (R package by F. Mueller et al.) works for 450K arrays, BS-Seq, MeDIP or MBD-Seq data DMAP* (C command line tool by P. Stockwell et al.) RRBS fragment or fixed window approach, Fisher’s exact test, Chi-squared or ANOVA RADMeth (C++ command line tool by E. Dolzhenko and A.D. Smith) Beta-binomial regression analysis to find DMCs or DMRs, local likelihood, adjust for neighbouring CpGs MOABS* (C++ command line tool by D. Sun et al.) Beta binomial hierarchical model to capture sampling and biological variation, Credible Methylation Difference (CDIF) single metric that combines biological and statistical significance ComMet (Y. Saito et al., 2014) Bisulfighter suite; DMR detection based on hidden Markov models (HMMs) that enable automated adjustment of DMC chaining criteria. Does not require biological replicates DSS (R/Bioconductor by Feng et al., 2014) Constructs genome-wide prior distribution for beta-binomial dispersion. Bayesian hierarchical model to detect differentially methylated loci more appearing every other week… * interface well with
Tool Statistical test Suitable for Implementation Notes bsseq Sample-wise smoothing, then group differences via CpG-wise t-tests (p-value cutoff to define adjacent CpG sites as DMRs) WGBS; not designed for targeted BS-Seq or RRBS R package/ Bioconductor Outperforms Fisher’s exact test; intended to compare 2 groups; replicates required BiSeq Define CpG clusters, smooth methylation data, model and test group effect (fitting beta regression model to smoothed methylation levels and testing for group effect using the Wald test), hierarchical testing procedure on CpG clusters, then define DMR boundaries RRBS; targeted BS-Seq; for WGBS Very computationally intensive; Not limited to 2 groups MethylKit Models CpG methylation within a logistic regression. Sliding linear model (SLIM) to correct for multiple testing (e)RRBS R package * WGBS = whole genome BS-Seq; (e)RRBS = (enhanced) reduced representation BS-Seq
bsseq – for whole genome BS-Seq Smoothing of low coverage BS-Seq first to get reliable semi-local methylation estimation estimates Not suitable for captured or restricted data After smoothing it uses biological replicates to estimate biological variation and identify methylated regions (DMRs) Smoothing suitable for even a single sample Works for CpG context in humans, will probably not scale to 2x585M Cs in non-CG context
BSmooth algorithm black: 25x (Lister) pink: 4x (Lister)
Bsmooth t-values