Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Module #: Title of Module
2

Module 4 Metagenomic Taxonomic Composition
Morgan Langille Analysis of Metagenomic Data June 22-24, 2016

Learning Objectives of Module
Contrast 16S and metagenomic sequencing Be able to describe several approaches for taxonomic composition of a metagenomics sample Be able to run Metaphlan2 on one or more samples Be able to determine statistically significant differences in taxonomic abundance across sample groups using STAMP

16S vs Metagenomics 16S is targeted sequencing of a single gene which acts as a marker for identification Pros Well established Sequencing costs are relatively cheap (~50,000 reads/sample) Only amplifies what you want (no host contamination) Cons Primer choice can bias results towards certain organisms Usually not enough resolution to identify to the strain level Need different primers usually for archaea & eukaryotes (18S) Doesn’t identify viruses Consequtive basepairs

16S vs Metagenomics Metagenomics: sequencing all the DNA in a sample
Pros No primer bias Can identify all microbes (euks, viruses, etc.) Provides functional information (“What are they doing?”) Cons More expensive (millions of sequences needed) Host/site contamination can be significant May not be able to sequence “rare” microbes Complex bioinformatics

Who is there? Taxonomic Profiles

Metagenomics: Who is there?
Goal: Identify the relative abundance of different microbes in a sample given using metagenomics Problems: Reads are all mixed together Reads can be short (~100bp) Lateral gene transfer Two broad approaches Binning Based Marker Based

Binning Based Attempts to group or “bin” reads into the genome from which they originated Composition-based Uses sequence composition such as GC%, k-mers (e.g. Naïve Bayes Classifier) Fast Sequence-based Compare reads to large reference database using BLAST (or some other similarity search method) Reads are assigned based on “Best-hit” or “Lowest Common Ancestor” approach

LCA: Lowest Common Ancestor
Use all BLAST hits above a threshold and assign taxonomy at the lowest level in the tree which covers these taxa. Notable Examples: MEGAN: One of the first metagenomic tools Does functional profiling too! MG-RAST: Web-based pipeline (might need to wait awhile for results) Kraken: Fastest binning approach to date and very accurate. Large computing requirements (e.g. >128GB RAM)

Marker Based Single Gene Multiple Gene
Identify and extract reads hitting a single marker gene (e.g. 16S, cpn60, or other “universal” genes) Use existing bioinformatics pipeline (e.g. QIIME, etc.) Multiple Gene Several universal genes PhyloSift (Darling et al, 2014) Uses 37 universal single-copy genes Clade specific markers MetaPhlAn2 (Truong et al., 2015)

Marker or Binning? Binning approaches Marker approaches
Similarity search is computationally intensive Varying genome sizes and LGT can bias results Marker approaches Doesn’t allow functions to be linked directly to organisms Genome reconstruction/assembly is not possible Dependent on choice of markers

Why MetaPhlAn? Fast (marker database is considerably smaller)
Markers for bacteria, archaea, eukaryotes, and viruses (since MetaPhlAn2 was released) Being continuously updated and supported Used by the Human Microbiome Project Generally accepted as a robust method for taxonomy assignment Main Disadvantage: not all reads are assigned a taxonomic label

MetaPhlAn Uses “clade-specific” gene markers
A clade represents a set of genomes that can be as broad as a phylum or as specific as a species Uses ~1 million markers derived from 17,000 genomes ~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic Can identify down to the species level (and possibly even strain level) Can handle millions of reads on a standard computer within a few minutes

MetaPhlAn Marker Selection

Using MetaPhlan MetaPhlan uses Bowtie2 for sequence similarity searching (nucleotide sequences vs. nucleotide database) Paired-end data can be used directly (but are treated as independent reads) Each sample is processed individually and then multiple sample can be combined together at the last step Output is relative abundances at different taxonomic levels

Absolute vs. Relative Abundance
Absolute abundance: Numbers represent real abundance of thing being measured (e.g. the actual quantity of a particular gene or organism) Relative abundance: Numbers represent proportion of thing being measured within sample In almost all cases microbiome studies are measuring relative abundance This is due to DNA amplification during sequencing library preparation not being quantitative

Relative Abundance Use Case
Sample A: Has 108 bacterial cells (but we don’t know this from sequencing) 25% of the microbiome from this sample is classified as Shigella Sample B: Has 106 bacterial cells (but we don’t know this from sequencing) 50% of the microbiome from this sample is classified as Shigella “Sample B contains twice as much Shigella as Sample A” WRONG! (If quantified it we would find Sample A has more Shigella) “Sample B contains a greater proportion of Shigella compared to Sample A” Correct!

Visualization and statistics
What is important? Visualization and statistics

Visualization and Statistics
Various tools are available to determine statistically significant taxonomic differences across groups of samples Excel SigmaPlot R MeV (MultiExperiment Viewer) Python (matplotlib) LefSe & Graphlan (Huttenhower Group) STAMP

Visualization and Statistics
Various tools are available to determine statistically significant taxonomic differences across groups of samples Excel SigmaPlot Past R (many libraries) Python (matplotlib) STAMP

STAMP Plots

STAMP Input “Profile file”: Table of features (samples by OTUs, samples by functions, etc.) Features can form a heirarchy (e.g. Phylum, Order, Class, etc) to allow data to be collapsed within the program “Group file”: Contains different metadata for grouping samples Can be two groups: (e.g. Healthy vs Sick) or multiple groups (e.g. Water depth at 2M, 4M, and 6M) Output PCA, heatmap, box, and bar plots Tables of significantly different features

Metagenomics Workflow
Putting it all together Metagenomics Workflow

Microbiome Helper Microbiome Helper is an open resource aimed at helping researchers process and analyze their microbiome data Combines bioinformatic methods from different groups and is updated as newer and better methods are released Scripts to wrap and integrate existing tools Available as an Ubuntu Virtualbox Tutorials/Walkthroughs

Microbiome Helper Wiki

Microbiome Helper Vbox

Integrated Microbiome Resource
Sequencing and Bioinformatics Support Integrated Microbiome Resource

IMR: Sequencing and bioinformatics service for microbiome projects

(16S / 18S amplicons on the Illumina MiSeq)
Microbiome Amplicon Sequencing Workflow (16S / 18S amplicons on the Illumina MiSeq) DNA extraction 16S (V6-V8) or 18S (V4) PCR Gel verification PCR clean-up & library normalization Illumina MiSeq sequencing Method/kit appropriate to specific samples (ex: stool, urine, etc.) Time = 1-3 d QC Duplicate with template dilutions Multiplexing to 384 samples/run Only 1 PCR w/fusion primers: i5 index F primer R i7 P5 adapter P7 adapter 16S/18S sequence Time = 0.5 d Invitrogen E-gel 96-well high-throughput method QC Time = 1 h Invitrogen SequalPrep 96-well high-throughput method Time = 1.5 h QC bp paired-end reads ~25 M reads = ~15 Gb ~65 k reads/sample (for 384) Time = ~3 d QC Quality-control check/step Total Time = 1 week CGEB-IMR.ca • DalhousieU • March 2015

IMR Achievements

IMR Collaborators & Clients

Pricing

Questions?

Canadian Bioinformatics Workshops

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Canadian Bioinformatics Workshops

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

Similar presentations

About project

Feedback