Noha Youssef, Mostafa Elshahed

Slides:



Advertisements
Similar presentations
Unsupervised Learning
Advertisements

Metabarcoding 16S RNA targeted sequencing
Robert May ecologist Photo: Hubble Telescope We have a catalog of all the celestial bodies our instruments can detect in the universe, but …
Yaron Fireizen, Vinay Rao, Lacy Loos, Nathan Butler, Dr. Julie Anderson, Dr. Evan Weiher ▪ Biology Department ▪ University of Wisconsin-Eau Claire From.
Phylogenetic reconstruction
Using the Maryland Biological Stream Survey Data to Test Spatial Statistical Models A Collaborative Approach to Analyzing Stream Network Data Andrew A.
Evaluating Hypotheses
Parametric Inference.
A PCR-generated chimeric sequence usually comprises two phylogenetically distinct parent sequences and occurs when a prematurely terminated amplicon reanneals.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
1 10. Joint Moments and Joint Characteristic Functions Following section 6, in this section we shall introduce various parameters to compactly represent.
Community Ordination and Gamma Diversity Techniques James A. Danoff-Burg Dept. Ecol., Evol., & Envir. Biol. Columbia University.
Microbial Community Biomarker in Barnegat Bay Evangelina Pena 1, Lora McGuinness 1, Gary Taghon 1, Lee Kerkhof 1 Introduction Efforts to remediate anthropogenic.
On Model Validation Techniques Alex Karagrigoriou University of Cyprus "Quality - Theory and Practice”, ORT Braude College of Engineering, Karmiel, May.
Populations, Samples, Standard errors, confidence intervals Dr. Omar Al Jadaan.
MANAGEMENT AND ANALYSIS OF WILDLIFE BIOLOGY DATA Bret A. Collier 1 and T. Wayne Schwertner 2 1 Institute of Renewable Natural Resources, Texas A&M University,
Quantitative Skills 1: Graphing
Microbial diversity: a super quick intro, I swear Meade Krosby.
Roadmap for Soil Community Metagenomics of DOE’s FACE & OTC Sites
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Diversity and quantification of candidate division SR1 in various anaerobic environments James P. Davis and Mostafa Elshahed Microbiology and Molecular.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Microbial biomass and community composition of a tallgrass prairie soil subjected to simulated global warming and clipping A. Belay-Tedla, M. Elshahed,
Elucidating factors behind pair wise distances discrepancies between short and near full-length sequences. We hypothesized that since the 16S rRNA molecule.
Lecture 2: Statistical learning primer for biologists
PCB 3043L - General Ecology Data Analysis.
Joint Moments and Joint Characteristic Functions.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
MOLECULAR BIOLOGY IN ACTION In this project, students will use what they have learned in the previous courses to complete a larger multi-step molecular.
Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness Patric D. Schloss and Jo Handelsman Department.
Single Season Study Design. 2 Points for consideration Don’t forget; why, what and how. A well designed study will:  highlight gaps in current knowledge.
Presented by Samuel Chapman. Pyrosequencing-Intro The core idea behind pyrosequencing is that it utilizes the process of complementary DNA extension on.
Patterns of HIV-1 evolution in individuals with differing rates of CD4 T cell decline Markham RB, Wang WC, Weisstein AE, Wang Z, Munoz A, Templeton A,
Use of Slow Release Nitrogen Fertilizer and its effect on soil quality. Soil bacterial population Hernandez, Jorge D., Garcia, Rosalia. and Lightfoot,
Fewer permutations, more accurate P-values Theo A. Knijnenburg 1,*, Lodewyk F. A. Wessels 2, Marcel J. T. Reinders 3 and Ilya Shmulevich 1 1Institute for.
Date of download: 7/7/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A) DNA.
Soil Microbiome of Native and Invasive Marsh Grasses in Blackbird Creek, Delaware Lathadevi K.Chintapenta 1#, Gulnihal Ozbay 1#, Venu Kalavacharla 1* Figure.
Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments Xinjun Zhang.
Ø Novel approaches for linkage mapping in dairy cattle
Micelle PCR reduces artifact formation in 16S microbiota profiling
The Original Question:
PNAS 2012 Alpha diversity: how many species are in each sample?
Gene expression from RNA-Seq
Oklahoma State University - Dept. Microbiol. & Molec. Genetics
PCB 3043L - General Ecology Data Analysis.
Figure 1. The relationships of bacterial operational taxonomic unit richness (A) and phylogenetic diversity (B) with aridity index based on 97% sequence.
Research in Computational Molecular Biology , Vol (2008)
Break and Noise Variance
SIMPLE LINEAR REGRESSION MODEL
Gene-sequence analysis reveals at least three species hidden in Zausodes arenicolus Erin Easton November 13, 2008.
Stochastic Hydrology Hydrological Frequency Analysis (II) LMRD-based GOF tests Prof. Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.
Adjustment of Temperature Trends In Landstations After Homogenization ATTILAH Uriah Heat Unavoidably Remaining Inaccuracies After Homogenization Heedfully.
Alternative Computational Analysis Shows No Evidence for Nucleosome Enrichment at Repetitive Sequences in Mammalian Spermatozoa  Hélène Royo, Michael Beda.
Comparisons among methods to analyze clustered multivariate biomarker predictors of a single binary outcome Xiaoying Yu, PhD Department of Preventive Medicine.
Sensitivity of RNA‐seq.
H = -Σpi log2 pi.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Patterns of HIV-1 evolution in individuals with differing rates of CD4 T cell decline Markham RB, Wang WC, Weisstein AE, Wang Z, Munoz A, Templeton A,
Introduction Previous lessons have demonstrated that the normal distribution provides a useful model for many situations in business and industry, as.
Effect of protocol modifications.
Computing and Statistical Data Analysis / Stat 7
Patterns of HIV-1 evolution in individuals with differing rates of CD4 T cell decline Markham RB, Wang WC, Weisstein AE, Wang Z, Munoz A, Templeton A,
Ruth E. Ley, Daniel A. Peterson, Jeffrey I. Gordon  Cell 
Volume 90, Issue 10, Pages (May 2006)
Chapter 5: Sampling Distributions
General overview of the bioinformatic pipelines for the 16S rRNA gene microbial profiling and shotgun metagenomics. General overview of the bioinformatic.
Fig. 3 Postnatal assembly of the humanized gut microbiota.
Presentation transcript:

Species Richness in Soil Bacterial Communities: A Proposed Approach to Overcome Sample-Size Bias Noha Youssef, Mostafa Elshahed Oklahoma State University, Stillwater, OK Funding Provided By NSF Microbial Observatories Program JGI-Laboratory sequencing program OSU Start-up funds to Mostafa Elshahed N-229 Abstract Estimates of species richness based on 16S rRNA gene clone libraries are increasingly utilized to gauge the level of bacterial diversity within various ecosystems. However, previous studies have indicated that regardless of the method utilized, species richness estimates obtained is dependent on the size of clone libraries analyzed. We here propose an approach to overcome sample size bias in species richness estimates in complex microbial communities. Parametric (Maximum likelihood-based and rarefaction curve-based) and nonparametric approaches were used to estimate species richness in a library of 13,001 near full-length 16S rRNA clones derived from soil, as well as in multiple subsets of the original library. Species richness estimates obtained increased with the increase in library size. To obtain a sample size-unbiased estimate of species richness, we calculated the theoretical clone library sizes required to encounter the estimated species richness at various clone library sizes, used curve fitting to determine the theoretical clone library size required to encounter the “true” species richness, and subsequently determined the corresponding sample size-unbiased species richness value. Using this approach, sample size unbiased estimates of 17,230, 15,571, and 33,912 were obtained for the ML-based, rarefaction curve-based, and ACE-1 estimators, compared to bias uncorrected values of 15,009, 11,913, and 20,909, respectively. Methods: DNA extraction from 0.5 g of soil using a modified lysis and bead-beating protocol. PCR: A near complete 16S rRNA gene fragment was amplified using primer pair 27F and 1391R. Cloning: Invitrogen TOPO-TA cloning kit. Sequencing: 13,486 near complete 16S rRNA gene clones were generated, and sequenced at the Department of Energy Joint Genome Institute Alignment/ Distance matrix: Greengenes OTU assignment: DOTUR Chimera check: using Mallard, 485 sequences were identified as potential chimeras and removed from the dataset. The remaining 13,001 sequences were designated Kessler farm soil clones (KFS clones). Subsets of 100, 500, 1000, and 3289 clones were randomly drawn from the 13,001 KFS dataset and treated as smaller clone libraries. Species richness estimates for all subset clone libraries were estimated using all previously described parametric and nonparametric methods. Regardless of the model used, the estimate of species richness increased with sample size. The estimate versus sample size plot was generally linear until sample size 3289-clone after which it started to approach an asymptote. Effect of sample size on various species richness estimates. Towards a sample-size unbiased estimate of species richness. Since species richness estimate increases with clone library size, the estimates species richness is only a fraction of the true richness. To obtain a sample-size unbiased estimate of species richness we proposed the following approach. Using parametric ML-based species richness estimates (SRest) obtained at different clone library sizes, we calculated the theoretical clone library sizes required to observe the absolute majority (99, 99.9, or 99.99%) (CLth-99, CLth-99.9, or CLth-99.99) of the SRest at different actual clone library sizes (CLact). Repeating the procedure described above with rarefaction curve-based, and ACE-1 estimators, values of 15,571, and 33,912 were obtained as a sample size-unbiased SR estimate, respectively. The library size at which CLact = CLth i.e. the point at which theoretical clone library effort is met was 6.3 X 106. ML-based SREst Vs CLth-99 suggested 17, 230 as a sample size-unbiased SR estimate Species richness estimates used Parametric Non-Parametric Rarefaction curve-based Fitting models: Michaelis Menten (MM), MM with intercept, double MM, negative exponential, and negative exponential with intercept ML-based ACE, ACE-1 Chao, Chao_bc Poisson Negative Binomial Pareto-mixed Poisson Lognormal-mixed Poisson Inverse Gaussian-mixed Poisson Mixed exponential-mixed Poisson http://www.stat.cornell.edu/~bunge/ EstimateS Introduction Culture-independent 16S rRNA gene-based analysis has been utilized to study bacterial, archaeal, and eukaryotic diversity from various habitats. Collectively, these studies demonstrated that the scope of microbial diversity is far broader than previously implied based on cultivation analysis. The sizes of analyzed clone libraries have steadily been increasing which allowed for the utilization of various statistical approaches in evaluating microbial diversity (Dunbar, 2002; Schloss, 2006; Joen, 2006; Hong, 2006; Hughes, 2001). Species richness estimation methods are either parametric or nonparametric (O' Hara, 2005). Nonparametric estimators have been more commonly used in microbial ecology (Stach, 2004; Schloss, 2005; Hill, 2003). Parametric methods (maximum likelihood-,ML, and rarefaction curve-based) have recently been used to estimate bacterial and microeukaryotic diversities in complex microbial communities (Joen, 2006; Hong, 2006; Behnke, 2006; Stoeck, 2007; Zuendorf, 2006; Roesch, 2007). Regardless of the approach utilized for species richness determination, it has been observed that the estimated species richness value (SRest) is dependent on the size of the clone library used in the calculation (Dunbar, 2002; Roesch, 2007; Schloss, 2006). The problem could theoretically be rectified by adequate sampling effort to encounter all bacterial species in the sample. However, in complex microbial communities, a large fraction of the bacterial species is present in low abundance and even with recent advances in sequencing technologies (Margulies, 2005), a complete census of all bacterial species remains a daunting task. Comparison of mixed-Poisson parametric models used to fit the Kessler Farm Soil dataset frequency dataa References Behnke A, et al. 2006. Appl. Environ. Microbiol. 72, 3623-3636. Dunbar J. et al. 2002. Appl. Environ. Microbiol. 6, 3035-3045. Joen S-O, et al. 2006. Appl. Environ. Microbiol. 72, 6578-6583. Hill TCJ, et al. 2003. FEMS Microbiol. Ecol. 43, 1-11. Hong S-H, et al. 2006. Proc. Natl. Acad. Sci. USA 103, 117-122. Hughes JB, et al. 2001. Appl. Environ. Microbiol. 67, 4399-4406. Marguiles M, et al 2005. Nature 437, 376-380. O’Hara RB 2005. J. Animal Ecol. 74, 375-386. Roesch LFW, et al. 2007 ISME J. 1, 283-290. Schloss PD & Handleman J 2006. PLoS Comput. Biol. 2, 786-793. Schloss PD & Handleman J 2005. Appl. Environ. Microbiol. 71, 1501-1506. Stach EM & Bull AT 2004. Antonie Van Leeuwenhoek 87, 3-9. Stoeck T, et al. 2007 PLos One 8, e278. Zuendorf A, et al. 2006 FEMS Microbiol. Ecol. 58, 476-491. We utilized a large 16S rRNA gene clone library (13,001) to demonstrate the problems associated with SR determination. In spite of the large size of the library (13,001 clones), all SRest values continued to increase regardless of the estimation method used. We used a ML-based, rarefaction curve-based, and nonparametric-based estimate for our sample size-unbiased approach. A sample size-unbiased estimate (17,230) is still 15% higher than the SRest value obtained using the entire 13,001 clone library (15,009). The approach highlights the value of utilizing large clone library datasets in SR estimates. The larger the clone library, the closer the rarefaction curve, CLth Vs CLact, and CLth Vs SRest plots will be to reaching an asymptote. Pyrosequencing-based approaches could be useful for obtaining large 16S rRNA gene datasets from various habitats. Conclusions and Discussions Comparison of different models used to fit the rarefaction curve in Kessler Farm Soil dataseta Nonparametric estimators of species richness for Kessler Farm Soil dataset. Study Site An undisturbed tall grass prairie preserve in Kessler Farm Field Laboratory biological research station in central Oklahoma