A Different Paradigm to Detect Differential Abundance of Taxa in Microbiome Data Mateen Shaikh and Joseph Beyene McMaster University December
The microbiome is a microscopic collection of organisms which both influences and is influenced by its environment One of Nutrigen’s objectives is to determine how the infant gut microbiome both impacts and is impacted by a variety of factors One factor which influences the infant gut microbiome is breastfeeding
Breastfeeding StatusSouth AsianWhite EuropeanTotal Breastfeeding Not Breastfeeding Total Samples processed in Mike Surette’s lab Picking up at the OTU table
SA10 SA11 SA12 SA13 SA14 SA15 SA16 SA17 SA18 … … … … … … … … … … … … … … ……………………… …
Main Question: Typically we: 1.Estimate differential abundance of taxa independently 2.Perform some test using the estimate 3.Perform multiple correction Which (few) taxa exhibit the most considerable differential abundance between samples collected while the child was (or not) breastfeeding?
The Poisson is the basic distribution for counts when there is no set maximum (0,1,2,3,… ; e.g. #of conifers in a forest) Some concerns how well it and variants model real microbiome data Improve modelling if the average non-randomly varies as in regression Imagine wanting to know the botanical composition of several forests Samples from different forests will vary in size Compositions can differ, though we don’t know what ‘significantly’ differs (our objective)
Varying parameters by sample size and incidence is an established technique edgeR 1 for DESeq 2 use this approach for RNASeq Extending this for differential expression 3 has also been introduced Making appropriate changes for abundance should have the same modelling performance 1.Robinson, MD, McCarthy, DJ, Smyth, GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, Anders S and Huber W (2010). Differential expression analysis for sequence count data. Genome Biology, 11, pp. R Witten, Daniela M. Classification and clustering of sequencing data using a Poisson model. Annals of Applied Statistics. 5 (2011), no. 4, 2493–2518.
Response was a measure of disease progression after one year. Covariates included Efron, Bradley; Hastie, Trevor; Johnstone, Iain; Tibshirani, Robert. Least angle regression. Ann. Statist. 32 (2004), no. 2, 407–499. #variableusual p-value 1age0.86 2sex BMI4.3e-14 4MAP1.0e-6 5tc0.06 6ldl0.16 7hdl0.63 8tch0.27 9ltg1.6e-5 10glu0.31
Something fishy about the results from the Poisson At the phyla level, the order of most considerable differences match the order of abundance
The approach works at any level, even when there are more taxa than samples* Can even mix different taxa levels in a single model, but they should be disjoint (e.g. don’t include a phyla if a subclass is in the mode)
There is still a threshold to choose (we have a ranking of several hundred) Many inferences can still be found in this paradigm e.g. p-values. There are yet more flexible Poisson-based models to consider Promising agreements between models so far Not discussed: different implementations can lead to different results So far, there has also been a large agreement between different implementations This approach does cannot determine whether ethnicity/diet are driving differences. At best, it chooses one over the other for numerical reasons.
The model can be forced into a regression framework, permitting other covariates, including continuous, categorical, etc., e.g.: ethnicity time since weaning diet This is often of greater interest; but, fortunately, less challenging once the final details of the approach are finalized