Download presentation
Presentation is loading. Please wait.
Published byLora Johns Modified over 5 years ago
1
Toward Accurate and Quantitative Comparative Metagenomics
Stephen Nayfach, Katherine S. Pollard Cell Volume 166, Issue 5, Pages (August 2016) DOI: /j.cell Copyright © 2016 Elsevier Inc. Terms and Conditions
2
Figure 1 Challenges Associated with Estimating the Composition of a Microbial Community from Shotgun DNA Sequencing (A) A sample from a microbial community composed of four different microbial species. Colored cells (blue, red, green) indicate “known” species that have at least one genome sequence in reference databases. The green cell indicates a species that is rare within the microbial community. DNA contamination includes DNA from the host, laboratory environment, or experimental reagents. (B) DNA is extracted from the microbial cells in the sample. Extraction efficiency varies for different taxa, depending on the experimental protocol. The amount of DNA extracted per cell depends on growth rate—actively dividing cells yield more genomic DNA, which accumulates at the origin of replication. (C) Extracted DNA is broken into fragments by mechanical or enzymatic methods. Certain sequences are more likely to be breakpoints. (D) A library is prepared from DNA fragments and sequenced. DNA fragments with high or low GC% are under-represented in the sequencing reads. Typically millions of short (e.g., 150 bp) reads are generated per sample. (E) Bioinformatics quality-control steps may be performed to eliminate duplicate reads, trim low-quality bases from read ends, and remove reads from contamination sources or with low-quality scores. (F) To infer the composition of the microbial community, high-quality reads are either compared to reference sequences or assembled de novo. Reference-based classification cannot account for unknown species and overestimates the abundances of known species. Metagenomic assembly may not detect rare species and overestimates abundance of abundant species. Cell , DOI: ( /j.cell ) Copyright © 2016 Elsevier Inc. Terms and Conditions
3
Figure 2 Parameters Used for Taxonomic and Functional Profiling
When computing the abundance of taxa and genes, it is important to think about what parameter of the underlying community one wishes to quantify. (A) A community with ten cells composed of three taxa with different subsets of four different gene families (colored arrows). Two cellular abundance parameters and four gene abundance parameters are defined by examples. (B) A comparison of gene relative abundance, average genomic copy number, and absolute abundance across three communities (top, middle, and bottom). The red gene is present at one copy per cell and has constant absolute abundance in all communities, but its relative abundance decreases with increasing genome size. The copy number of the blue gene increases with genome size, but its relative abundance is constant. Cell , DOI: ( /j.cell ) Copyright © 2016 Elsevier Inc. Terms and Conditions
4
Figure 3 Differences in Functional Profiles due to Read Length, Library Size, and Quality Control Are Small Compared to Biological Variation Publicly available metagenomes often differ in their library sizes, read lengths, and quality-control measures, which leads one to ask, how comparable are metagenomes from different studies? Twenty-six human gut metagenomes of varying quality were processed using different quality-control methods, and the resulting reads were used to estimate the relative abundance of KEGG Orthology Groups (KOs). We compared the variation introduced by these factors (top) with the variation observed between a large set of technical (N = 1,474), biological (N = 144), and non-replicate gut metagenomes (N = 179) from the Human Microbiome Project (Consortium, 2012) that contained at least one million reads (bottom). Trimming reads from their 5′ ends was done to simulate libraries of different read length; downsampling metagenomes by 95% was done to simulate libraries of different size; fastq-mcf (Aronesty, 2011) was used for de-duplication and quality filtering. To estimate the average genomic copy number of functional groups, reads were mapped to the integrated catalog of reference genes in the human gut microbiome (Li et al., 2014a, 2014b) using bowtie2 (Langmead and Salzberg, 2012) and normalized by the median coverage of 30 universal single-copy genes (Wu et al., 2013). The percent variation between two metagenomes was measured by the following: (1) taking the sum of absolute deviations across KOs, (2) dividing this by the total abundance of KOs in both metagenomes, and (3) multiplying this by 100. Cell , DOI: ( /j.cell ) Copyright © 2016 Elsevier Inc. Terms and Conditions
5
Figure 4 The Presence of Duplicated Reads Is Largely a Function of Library Size and Microbial Diversity FASTQC was used to estimate the percent of duplicated reads across 181 human gut metagenomes from the Human Microbiome Project and compared to (A) library size and (B) species-level alpha diversity using the Shannon diversity index (Keylock, 2005). Species abundance of bacteria and archaea was estimated with mOTU (Sunagawa et al., 2013). Together, library size and Shannon diversity explain 63% of the variation in sequence duplication rates. Cell , DOI: ( /j.cell ) Copyright © 2016 Elsevier Inc. Terms and Conditions
6
Figure 5 Growth of Shotgun Metagenome Data in the NCBI Sequence Read Archive Cumulative size in terabases of publicly available shotgun metagenomic data in the NCBI Sequence Read Archive (SRA). Sequencing runs were identified using the SRAdb database (Zhu et al., 2013) by the following: library_source = “METAGENOMIC,” study_type = “Metagenomics,” and library_strategy = “WGS.” Cell , DOI: ( /j.cell ) Copyright © 2016 Elsevier Inc. Terms and Conditions
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.