Download presentation
Presentation is loading. Please wait.
Published byBlaise Sullivan Modified over 8 years ago
1
Canadian Bioinformatics Workshops www.bioinformatics.ca
2
2Module #: Title of Module
3
Module 3 Metagenomic Taxonomic Composition
4
Module 3 bioinformatics.ca Learning Objectives of Module Understand the pros and cons between 16S and metagenomic sequencing Understand different approaches for determining the taxonomic composition of a metagenomics sample Be able to run Metaphlan2 on one or more samples Be able to determine statistically significant differences in taxonomic abundance across sample groups using STAMP
5
Module 3 bioinformatics.ca 16S vs Metagenomics 16S is targeted sequencing of a single gene which acts as a marker for identification Pros – Well established – Sequencing costs are relatively cheap (~10,000 reads/sample) – Only amplifies what you want (no host contamination) Cons – Primer choice can bias results towards certain organisms – Usually not enough resolution to identify to the strain level – Need different primers usually for archaea & eukaryotes (18S) – Doesn’t identify viruses
6
Module 3 bioinformatics.ca 16S vs Metagenomics Metagenomics: sequencing ALL the DNA in a sample Pros – Less bias from sequencing – Can identify all microbes (euks, viruses, etc.) – Provides functional information (“What are they doing?”) Cons – Host/site contamination can be signficant – Expensive (more sequencing depth is required) – May not be able to sequence “rare” microbes – Complex bioinformatics
7
Module 3 bioinformatics.ca Metagenomics: Who is there? Goal: Identify the relative abundance of different microbes in a sample given using metagenomics Problems: – Reads are all mixed together – Reads can be short (~100bp) – Lateral gene transfer Two broad approaches 1.Binning Based 2.Marker Based
8
Module 3 bioinformatics.ca Binning Based Attempts to “bin” reads into the genome from which they originated Composition-based – Uses GC composition or k-mers (e.g. Naïve Bayes Classifier) – Generally not very precise and not recommended Sequence-based – Compare reads to large reference database using BLAST (or some other similarity search method) – Reads are assigned based on “Best-hit” or “Lowest Common Ancestor” approach
9
Module 3 bioinformatics.ca LCA: Lowest Common Ancestor Use all BLAST hits above a threshold and assign taxonomy at the lowest level in the tree which covers these taxa. Notable Examples: – MEGAN: http://ab.inf.uni-tuebingen.de/software/megan5/http://ab.inf.uni-tuebingen.de/software/megan5/ One of the first metagenomic tools Does functional profiling too! – MG-RAST: https://metagenomics.anl.gov/https://metagenomics.anl.gov/ Web-based pipeline (might need to wait awhile for results) – Kraken: https://ccb.jhu.edu/software/kraken/https://ccb.jhu.edu/software/kraken/ Fastest binning approach to date and very accurate. Large computing requirements (e.g. >128GB RAM)
10
Module 3 bioinformatics.ca Marker Based Single Gene Identify and extract reads hitting a single marker gene (e.g. 16S, cpn60, or other “universal” genes) Use existing bioinformatics pipeline (e.g. QIIME, etc.) Multiple Gene Several universal genes – PhyloSift (Darling et al, 2014) » Uses 37 universal single-copy genes Clade specific markers – MetaPhlAn (Segata et al, 2012)
11
Module 3 bioinformatics.ca Marker or Binning? Binning approaches – May be too computationally intensive – May not adequately reflect organism abundances due to genome size Marker approaches – Doesn’t allow functions to be linked directly to organisms – Genome reconstruction is not possible – Very sensitive to choice of markers
12
Module 3 bioinformatics.ca Why MetaPhlAn? Fast (marker database is considerably smaller) Markers for bacteria, archaea, eukaryotes, and viruses (since MetaPhlAn2 was released) Being continuously updated and supported Used by the Human Microbiome Project Generally accepted as a robust method for taxonomy assignment Main Disadvantage: not all reads are assigned a taxonomic label
13
Module 3 bioinformatics.ca MetaPhlAn Uses “clade-specific” gene markers A clade represents a set of genomes that can be as broad as a phylum or as specific as a species Uses ~1 million markers derived from 17,000 genomes – ~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic Can identify down to the species level (and possibly even strain level) Can handle millions of reads on a standard computer within a few minutes
14
Module 3 bioinformatics.ca MetaPhlAn Open-source: – https://bitbucket.org/biobakery/metaphlan2 https://bitbucket.org/biobakery/metaphlan2
15
Module 3 bioinformatics.ca MetaPhlAn Marker Selection
16
Module 3 bioinformatics.ca MetaPhlAn Marker Selection
17
Module 3 bioinformatics.ca Using MetaPhlan MetaPhlan uses Bowtie2 for sequence similarity searching (nucleotide sequences vs. nucleotide database) Paired-end data can be used directly Each sample is processed individually and then multiple sample can be combined together at the last step Output is relative abundances at different taxonomic levels
18
Module 3 bioinformatics.ca Absolute vs. Relative Abundance Absolute abundance: Numbers represent real abundance of thing being measured (e.g. the actual quantity of a particular gene or organism) Relative abundance: Numbers represent proportion of thing being measured within sample In almost all cases microbiome studies are measuring relative abundance – This is due to DNA amplification during sequencing library preparation not being quantitative
19
Module 3 bioinformatics.ca Relative Abundance Use Case Sample A: – Has 10 8 bacterial cells (but we don’t know this from sequencing) – 25% of the microbiome from this sample is classified as Shigella Sample B: – Has 10 6 bacterial cells (but we don’t know this from sequencing) – 50% of the microbiome from this sample is classified as Shigella “Sample B contains twice as much Shigella as Sample A” – WRONG! (If quantified it we would find Sample A has more Shigella) “Sample B contains a greater proportion of Shigella compared to Sample A” – Correct!
20
Module 3 bioinformatics.ca Visualization and Statistics Various tools are available to determine statistically significant taxonomic differences across groups of samples – Excel – SigmaPlot – R – MeV (MultiExperiment Viewer) – Python (matplotlib) – LefSe & Graphlan (Huttenhower Group) – STAMP
21
Module 3 bioinformatics.ca STAMP
22
Module 3 bioinformatics.ca
23
Module 3 bioinformatics.ca STAMP Plots
24
Module 3 bioinformatics.ca STAMP Input 1.“Profile file”: Table of features (samples by OTUs, samples by functions, etc.) Features can form a heirarchy (e.g. Phylum, Order, Class, etc) to allow data to be collapsed within the program 2.“Group file”: Contains different metadata for grouping samples Can be two groups: (e.g. Healthy vs Sick) or multiple groups (e.g. Water depth at 2M, 4M, and 6M) Output – PCA, heatmap, box, and bar plots – Tables of significantly different features
25
Module 3 bioinformatics.ca Questions?
26
Module 3 bioinformatics.ca We are on a Coffee Break & Networking Session
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.