Metagenomics: From Bench to Data Analysis 19-23rd September 2016 16S rRNA-based surveys for Community Analysis: How Quantitative are they? Dr Mark Alston Computational Biologist Organisms and Ecosystems Group mark.alston@earlham.ac.uk
Outline Compare sequencing platforms and 16S rRNA regions Amplicon choice amplicons vs. full-length rRNA sequencing Bias and quantification Comparison to WGS approaches
16S Microbial Community Profiling 16S rRNA gene sequence conserved (green) and hypervariable (blue) regions Most common phylogenetic marker ‘gold standard’ in molecular surveys of bacterial and archaeal diversity Pros ubiquitous, highly conserved, evolutionarily stable Cons often multiple copy, little resolution at/below species level
Comparing Different Platforms and Target Regions ‘A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling’ DOI: 10.1186/s12864-015-2194-9 Compare sequencing platforms MiSeq (Illumina), Pacific Biosciences RSII 454 GS-FLX/+ (Roche) IonTorrent (Life Technologies) Compare target regions Assess performance via synthetic microbial communities mix gDNA from 49 bacterial and 10 archaeal species even / uneven distribution Summary of primers and platforms used
Ability of Different Platforms and Regions to Reconstruct the Synthetic Community Even synthetic community Platform had a significant effect Species’ frequencies highly unbalanced Possible causes primer mismatches rRNA copy number amplification bias (associated with target length) Bacterial Species Target Region
How do Different rRNA Regions reflect Composition? ‘Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities’ DOI:10.1111/1462-2920.12086 Synthetic Bacteria community Heat map represents accuracy ratio Perfect agreement has value of 1 underestimated abundance overestimated abundance
How do Different rRNA Regions reflect Composition? ‘Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities’ DOI:10.1111/1462-2920.12086 Synthetic Bacteria community Heat map represents accuracy ratio Perfect agreement has value of 1 underestimated abundance overestimated abundance Regions suffer from substantial bias
Which Region Should I Choose? 16S rRNA gene sequence conserved (green) and hypervariable (blue) regions Most common approach V4, V3–V4 or V4–V5 primers on Illumina platforms ~ 250–430 bp read length e.g. 16S for V4 on MiSeq http://www.illumina.com/content/dam/illumina-marketing/documents/products/appnotes/appnote_miseq_16S.pdf
Full-length vs. Amplicon 16S Sequencing Factors affecting taxon abundance estimates and tree-placement Sequencing platform, primer choice, read length, environmental source, reference database, assignment method [or a combination] New technologies short reads sequence ~15-30 % of the full 16S rRNA gene more quantitative information reduced taxonomic resolution species level assignment can be elusive implications for inferring metabolic traits in various ecosystems
Full-length vs. Amplicon 16S Sequencing Factors affecting taxon abundance estimates and tree-placement Sequencing platform, primer choice, read length, environmental source, reference database, assignment method [or a combination] New technologies short reads sequence ~15-30 % of the full 16S rRNA gene more quantitative information reduced taxonomic resolution species level assignment can be elusive implications for inferring metabolic traits in various ecosystems Use full-length 16S rRNA sequencing?
Full-length 16S rRNA Sequencing PacBio long-read, single-molecule real-time (SMRT) technology average read lengths > 8 kb at ~ 87% read accuracy only been used for a few environmental surveys ‘High-resolution phylogenetic microbial community profiling’ DOI: 0.1038/ismej.2015.24 MinION™ USB stick-sized device per-base sequencing accuracy ~85% for 2D reads additional read length helps resolve 16S rRNA to species level ‘Species level resolution of 16S rRNA gene amplicons sequenced through MinIONTM portable nanopore sequencer’ DOI: 10.1186/s13742-016-0111-z
Full-length 16S rRNA Sequencing PacBio long-read, single-molecule real-time (SMRT) technology average read lengths > 8 kb at ~ 87% read accuracy only been used for a few environmental surveys ‘High-resolution phylogenetic microbial community profiling’ DOI: 0.1038/ismej.2015.24 MinION™ USB stick-sized device per-base sequencing accuracy ~85% for 2D reads additional read length helps resolve 16S rRNA to species level ‘Species level resolution of 16S rRNA gene amplicons sequenced through MinIONTM portable nanopore sequencer’ DOI: 10.1186/s13742-016-0111-z
Full-length 16S rRNA Sequencing and Gene Variability non-homogeneous distribution of mutations varies across different phylogenetic groups leads to both over- and underestimation of community diversity
Full-length 16S rRNA Sequencing and Gene Variability non-homogeneous distribution of mutations varies across different phylogenetic groups leads to both over- and underestimation of community diversity
Full-length 16S rRNA Sequencing and Gene Variability non-homogeneous distribution of mutations varies across different phylogenetic groups leads to both over- and underestimation of community diversity 2 Salmonella spp. 97.4% identical across gene 100% identical across V4 region Underestimate community diversity
Full-length 16S rRNA Sequencing and Gene Variability non-homogeneous distribution of mutations varies across different phylogenetic groups leads to both over- and underestimation of community diversity Mutations accumulated in V4 region Overestimate community diversity
Compare FL vs. V4 [Sakinaw lake samples] Community composition profile at genus level Colour pairs denote samples of the same depth Bubble sizes indicate read abundance
Compare FL vs. V4 [Sakinaw lake samples] BUT it looks possible to make the same conclusions because there’s a lot of stuff in common! FL vs. V4 discrepancies highlighted by boxes e.g. Bacillus greatly underrepresented by V4 c.f. PB [50m samples] ‘High-resolution phylogenetic microbial community profiling’ DOI: 0.1038/ismej.2015.24
Platforms and Regions Suffer from Substantial Bias The observed relative frequencies do not reflect the true species frequencies in the community
Platforms and Regions Suffer from Substantial Bias The observed relative frequencies do not reflect the true species frequencies in the community
Platforms and Regions Suffer from Substantial Bias The observed relative frequencies do not reflect the true species frequencies in the community But, the observed differences between samples could still reflect true differences Can we have a quantitative method despite the bias?
Can 16S rRNA Sequencing be Quantitative? ‘A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling’ DOI: 10.1186/s12864-015-2194-9 Assembled 2 synthetic communities one with even distribution, one uneven Take pairs of samples Sequence on MiSeq and PacBio platforms
Can 16S rRNA Sequencing be Quantitative? Compare for each species true ratio of frequencies [known mixtures] and observed ratio of frequencies Highly significant correlation between the two ratios [blue line] and a slope of 1 [red line] MiSeq Ratio of Observed Freq. PacBio Ratio of True Freq.
Can 16S rRNA Sequencing be Quantitative? Compare for each species true ratio of frequencies [known mixtures] and observed ratio of frequencies Highly significant correlation between the two ratios [blue line] and a slope of 1 [red line] Implies 16S rRNA sequencing is strongly quantitative despite being biased MiSeq more quantitative than PacBio MiSeq Ratio of Observed Freq. PacBio Ratio of True Freq.
MiSeq more quantitative than PacBio Species responsible for this difference? Which are more accurately quantified on one platform relative to the other? MiSeq Ratio of Observed Freq. PacBio Ratio of True Freq.
MiSeq vs. PacBio Species with significantly different quantification accuracies:
MiSeq vs. PacBio Species with significantly different quantification accuracies: MiSeq the better platform
MiSeq vs. PacBio Species with significantly different quantification accuracies: MiSeq the better platform Except for strain resolution Full-length 16S rRNA sequencing of benefit Shewanella baltica OS223 Shewanella baltica OS185
16S Microbial Community Profiling 16S rRNA gene sequence conserved (green) and hypervariable (blue) regions Most common approach V4, V3–V4 or V4–V5 primers on Illumina platforms ~ 250–430 bp read length Economy of scale single MiSeq run > 10 million reads High base-calling accuracy e.g. 16S for V4 on MiSeq http://www.illumina.com/content/dam/illumina-marketing/documents/products/appnotes/appnote_miseq_16S.pdf
Compare Error Rates Across Platforms Even synthetic community Platform had a significant effect MiSeq has the most accurate sequence reads
Impact of Overlapping Reads on MiSeq V4 Error Rates Even synthetic community Overlapping forward and reverse reads greatly reduces errors MiSeq Dual Index barcode Illumina barcodes on both reads ‘stitched’ reads
Shotgun Metagenomics vs. Amplicon Sequencing ‘Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities’ DOI: 10.1111/1462-2920.12086 Compare amplicon sequencing to Illumina [HiSeq] and 454 metagenomics sequencing
Shotgun Metagenomics vs. Amplicon Sequencing ‘Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities’ DOI: 10.1111/1462-2920.12086 Compare amplicon sequencing to Illumina [HiSeq] and 454 metagenomics sequencing Metagenomic data tends to outperform amplicon sequencing
Shotgun Metagenomics vs. Amplicon Sequencing ‘A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling’ DOI: 10.1186/s12864-015-2194-9 MiSeq MG sample expected Metagenome sample benchmark should be relatively unbiased as fewer PCR amplification steps in library construction WGS gives the most accurate species estimations
Is 16S “Metagenomics” ? Many papers talk about “metagenomics analysis based on microbial 16S rRNA gene sequencing” “16S metagenomic studies” etc. But rRNA surveys focus on a single gene, not genomes Is this due to a fear of not getting funded if you don’t include a word containing ‘Meta*omics’? “Referring to 16S surveys as metagenomics is misleading and annoying #badomics #OmicMimicry” http://phylogenomics.blogspot.co.uk/2012/08/referring-to-16s-surveys-as.html
In Summary Many sources of bias when we sequence 16S rRNA e.g. platform, region etc. Can still be a quantitative MiSeq V4 a good ‘all round bet’ prior knowledge of taxa may suggest otherwise combinations of primers? full-length for strain resolution Whole genome shotgun better estimations of species abundances
Metagenomics: From Bench to Data Analysis 19-23rd September 2016 Thank You for Listening Dr Mark Alston Computational Biologist Organisms and Ecosystems Group mark.alston@earlham.ac.uk