A Different Paradigm to Detect Differential Abundance of Taxa in Microbiome Data Mateen Shaikh and Joseph Beyene McMaster University December 18 2015.

Slides:



Advertisements
Similar presentations
+ Gladstone Bioinformatics Core Kirsten E. Eilertson + Statistics in Science: Best Practices.
Advertisements

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
CHAPTER 23: Two Categorical Variables: The Chi-Square Test
1 Multiple Regression Interpretation. 2 Correlation, Causation Think about a light switch and the light that is on the electrical circuit. If you and.
DEG Mi-kyoung Seo.
Part V The Generalized Linear Model Chapter 16 Introduction.
MALD Mapping by Admixture Linkage Disequilibrium.
RNA-seq: the future of transcriptomics ……. ?
Data Analysis for High-Throughput Sequencing
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
The Experimental Approach September 15, 2009Introduction to Cognitive Science Lecture 3: The Experimental Approach.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
PSYC512: Research Methods PSYC512: Research Methods Lecture 14 Brian P. Dyre University of Idaho.
Impact Evaluation Session VII Sampling and Power Jishnu Das November 2006.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Brief workflow RNA is isolated from cells, fragmented at random positions, and copied into complementary DNA (cDNA). Fragments meeting a certain size specification.
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
Gene Set Enrichment Analysis (GSEA)
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Introduction To Biological Research. Step-by-step analysis of biological data The statistical analysis of a biological experiment may be broken down into.
Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.
Eng.Mosab I. Tabash Applied Statistics. Eng.Mosab I. Tabash Session 1 : Lesson 1 IntroductiontoStatisticsIntroductiontoStatistics.
Statistical Power 1. First: Effect Size The size of the distance between two means in standardized units (not inferential). A measure of the impact of.
HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Section 12.4.
RNAseq analyses -- methods
Fundamentals of Data Analysis Lecture 10 Management of data sets and improving the precision of measurement pt. 2.
RESULTS & DATA ANALYSIS. Descriptive Statistics  Descriptive (describe)  Frequencies  Percents  Measures of Central Tendency mean median mode.
From Theory to Practice: Inference about a Population Mean, Two Sample T Tests, Inference about a Population Proportion Chapters etc.
Introduction to Statistics Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Planning and Data Collection
Defining Success Understanding Statistical Vocabulary.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 11 Inference for Distributions of Categorical.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
 Statistics The Baaaasics. “For most biologists, statistics is just a useful tool, like a microscope, and knowing the detailed mathematical basis of.
1 Identifying differentially expressed genes from RNA-seq data Many recent algorithms for calling differentially expressed genes: edgeR: Empirical analysis.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
CORRELATION: Correlation analysis Correlation analysis is used to measure the strength of association (linear relationship) between two quantitative variables.
Generalized linear MIXED models
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
PCB 3043L - General Ecology Data Analysis.
Compositionality and Sparseness in 16S rRNA data Anthony Fodor Associate Professor Bioinformatics and Genomics UNC Charlotte.
Essentials of Business Statistics: Communicating with Numbers By Sanjiv Jaggia and Alison Kelly Copyright © 2014 by McGraw-Hill Higher Education. All rights.
No reference available
Lecture 12 RNA – seq analysis.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
1.  The practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring* proportions in a.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Doc.: IEEE /00144r0 Submission 3/01 Nada Golmie, NISTSlide 1 IEEE P Working Group for Wireless Personal Area Networks Dialog with FCC Nada.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Chapter 11: Test for Comparing Group Means: Part I.
 Occupancy Model Extensions. Number of Patches or Sample Units Unknown, Single Season So far have assumed the number of sampling units in the population.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
An Introduction to RNA-Seq Data and Differential Expression Tools in R
RNA-Seq analysis in R (Bioconductor)
The RNA-Seq Bid Idea: Statistical Design and Analysis for RNA Sequencing Data The RNA-Seq Big Idea Team: Yaqing Zhao1,2, Erika Cule1†, Andrew Gehman1,
Inferential statistics,
A Non-Parametric Bayesian Method for Inferring Hidden Causes
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
3. Use an in-line sensor to sense when the effects of tool wear...
Conducting a Microbiome Study
Inference for Distributions of Categorical Data
Differential Expression of RNA-Seq Data
Presentation transcript:

A Different Paradigm to Detect Differential Abundance of Taxa in Microbiome Data Mateen Shaikh and Joseph Beyene McMaster University December

The microbiome is a microscopic collection of organisms which both influences and is influenced by its environment One of Nutrigen’s objectives is to determine how the infant gut microbiome both impacts and is impacted by a variety of factors One factor which influences the infant gut microbiome is breastfeeding

Breastfeeding StatusSouth AsianWhite EuropeanTotal Breastfeeding Not Breastfeeding Total Samples processed in Mike Surette’s lab Picking up at the OTU table

SA10 SA11 SA12 SA13 SA14 SA15 SA16 SA17 SA18 … … … … … … … … … … … … … … ……………………… …

Main Question: Typically we: 1.Estimate differential abundance of taxa independently 2.Perform some test using the estimate 3.Perform multiple correction Which (few) taxa exhibit the most considerable differential abundance between samples collected while the child was (or not) breastfeeding?

The Poisson is the basic distribution for counts when there is no set maximum (0,1,2,3,… ; e.g. #of conifers in a forest) Some concerns how well it and variants model real microbiome data Improve modelling if the average non-randomly varies as in regression Imagine wanting to know the botanical composition of several forests Samples from different forests will vary in size Compositions can differ, though we don’t know what ‘significantly’ differs (our objective)

Varying parameters by sample size and incidence is an established technique edgeR 1 for DESeq 2 use this approach for RNASeq Extending this for differential expression 3 has also been introduced Making appropriate changes for abundance should have the same modelling performance 1.Robinson, MD, McCarthy, DJ, Smyth, GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, Anders S and Huber W (2010). Differential expression analysis for sequence count data. Genome Biology, 11, pp. R Witten, Daniela M. Classification and clustering of sequencing data using a Poisson model. Annals of Applied Statistics. 5 (2011), no. 4, 2493–2518.

Response was a measure of disease progression after one year. Covariates included Efron, Bradley; Hastie, Trevor; Johnstone, Iain; Tibshirani, Robert. Least angle regression. Ann. Statist. 32 (2004), no. 2, 407–499. #variableusual p-value 1age0.86 2sex BMI4.3e-14 4MAP1.0e-6 5tc0.06 6ldl0.16 7hdl0.63 8tch0.27 9ltg1.6e-5 10glu0.31

Something fishy about the results from the Poisson At the phyla level, the order of most considerable differences match the order of abundance

The approach works at any level, even when there are more taxa than samples* Can even mix different taxa levels in a single model, but they should be disjoint (e.g. don’t include a phyla if a subclass is in the mode)

There is still a threshold to choose (we have a ranking of several hundred) Many inferences can still be found in this paradigm e.g. p-values. There are yet more flexible Poisson-based models to consider Promising agreements between models so far Not discussed: different implementations can lead to different results So far, there has also been a large agreement between different implementations This approach does cannot determine whether ethnicity/diet are driving differences. At best, it chooses one over the other for numerical reasons.

The model can be forced into a regression framework, permitting other covariates, including continuous, categorical, etc., e.g.: ethnicity time since weaning diet This is often of greater interest; but, fortunately, less challenging once the final details of the approach are finalized