Microbiome 16S analysis Morgan G. I. Langille, PhD Assistant Professor

Slides:



Advertisements
Similar presentations
Clostridium difficile Colitis or Dysbiosis. Symbiostasis/Dysbiosis.
Advertisements

Metabarcoding 16S RNA targeted sequencing
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Molecular Evolution Revised 29/12/06
Practical Bioinformatics Community structure measures for meta-genomics István Albert Bioinformatics Consulting Center Penn State.
The Microbiome and Metagenomics
Metagenomic Analysis Using MEGAN4
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Abstract Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
The Microbiome and Metagenomics
Accurate estimation of microbial communities using 16S tags
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Convenience Sample of 4 Adults and 6 Infants. Adults 4 visits over 2 weeks; infants 2 visits over 2 weeks Adult specimens: 1) plaque (by method, teeth,
Date of download: 6/23/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A)
Discussion on Genomic/Metagenomic Data for ANGUS Course Adina Howe.
Canadian Bioinformatics Workshops
Date of download: 7/7/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A) DNA.
ESPRIT. Taxonomy ● Works very well and gives accurate results ● Requires a previous blast search that may take long to complete ● When in doubt goes one.
Robert Edgar Independent scientist
Canadian Bioinformatics Workshops
16S rRNA Experimental Design
Unsupervised Learning
Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments Xinjun Zhang.
16S RNA sequencing analysis
Canadian Bioinformatics Workshops
Metagenomic Species Diversity.
Introduction to Bioinformatics Resources for DNA Barcoding
Metagenomics: From Bench to Data Analysis 19-23rd September S rRNA-based surveys for Community Analysis: How Quantitative are they? Dr.
JMP Discovery Summit 2016 Janet Alvarado
Micelle PCR reduces artifact formation in 16S microbiota profiling
MGmapper A tool to map MetaGenomics data
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Peter Sterk EBI Metagenomics Course 2014
Presented By: Chinua Umoja
PNAS 2012 Alpha diversity: how many species are in each sample?
An Artificial Intelligence Approach to Precision Oncology
RNA-Seq analysis in R (Bioconductor)
A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping.
Research in Computational Molecular Biology , Vol (2008)
Workshop on the analysis of microbial sequence data using ARB
Design and Analysis of Single-Cell Sequencing Experiments
Rosie Coates-Brown Final year Bioinformatics trainee
Microbiome: 16S rRNA Sequencing
H = -Σpi log2 pi.
Summary and Recommendations
Dr Tan Tin Wee Director Bioinformatics Centre
Volume 17, Issue 3, Pages (March 2015)
Volume 21, Issue 8, Pages (August 2014)
Intervention effects on taxonomic and functional pathway diversity of the intestinal microbiome. Intervention effects on taxonomic and functional pathway.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Skin Microbiome Surveys Are Strongly Influenced by Experimental Design
Volume 10, Issue 4, Pages (October 2011)
Independent scientist
Volume 14, Issue 6, Pages (December 2013)
Summary and Recommendations
Sequence Analysis - RNA-Seq 2
Statistics for genomics
Unsupervised Learning
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Microbiome 16S analysis Morgan G. I. Langille, PhD Assistant Professor Canada Research Chair in Human Microbiomics Dalhousie University Halifax, Nova Scotia, Canada Oct. 19th, 2016

Interests PICRUSt 2.0 Multi-omics and machine learning Integrating knowledge from hundreds of microbiome studies Dabbling with MinIon sequencing Exercise and the gut microbiome Drug metabolism by the gut microbiome Aging/frailty Pediatric IBD Pediatric cancer Pediatric UTIs Welder’s lungs Wild blueberry agriculture Beer

Module #: Title of Module 3

Acknowledgements Will Hsiao and the Canadian Bioinformatics Workshop Many slides in this presentation Gavin Douglas, PhD student Microbiome Helper Software and Tutorials Andre Comeau, PhD 16S sequencing and Integrated Microbiome Resource Various bioinformatic developer groups QIIME, PEAR, STAMP, etc.

Learning Objectives Understand and Perform marker based microbiome analysis Analyze 16S rRNA marker gene data Use 16S rRNA marker gene to profile and compare microbomes Select suitable parameters for marker gene analysis Explain the advantages and disadvantages of marker based microbiome analysis

Generating data 16S: Samples to DNA

rRNAs – the universal phylogenetic markers Ribosomal RNAs are present in all living organisms rRNAs play critical roles in protein translation rRNAs are relatively conserved and thought to be rarely acquired horizontally Behave like a molecular clock Useful for phylogenetic analysis Used to build tree-of-life (placing organisms in a single phylogenetic tree) 16S rRNA most commonly used Consequtive basepairs Universal phylogenetic tree based on SSU rRNA sequences -N. Pace, Science, 1997

Other Marker Genes Used Eukaryotic Organisms (protists, fungi) 18S (http://www.arb-silva.de) ITS (http://www.mothur.org/wiki/UNITE_ITS_database) Bacteria CPN60 (http://www.cpndb.ca/cpnDB/home.php) ITS (Martiny, Env Micro 2009) RecA gene Viruses Gp23 for T4-like bacteriophage RdRp for picornaviruses Faster evolving markers used for strain-level differentiation

A Few Considerations for Selecting Marker Genes Sufficient resolution to differentiate the different communities you want to study (16S differentiates genera and some species; e.g. HSP65 may be more useful for Mycobacterium) A (good) reference database is needed for taxonomic assignment When comparing across studies, need to standardize markers (different V regions of 16S gene), experimental protocols, and bioinformatic pipelines

DNA Extraction Many protocols/kits available Mo Bio Power Soil/Fecal/Food (now owned by Qiagen) DIY still popular Different extraction methods will give different results HMP and EMP standardize DNA extraction protocol DNA Extraction can also be done after fractionation to separate out different organisms based on cell size and other characteristics Gels Flow Cytometry

Contaminations Lab: Extraction process can introduce contamination from the lab Reagents may be contaminated with bacterial DNA Include Extraction Negative Control in your experiments! Especially crucial if samples have low DNA yield! Host / Environment: Host DNA often ends up in the microbiome sample More of a problem with shotgun metagenomics Low ratio of amplicon to other DNA can sometimes cause amplification problems

Target Selection and Bias 16S rRNA contains 9 hypervariable regions (V1-V9) Different V regions have different phylogenetic resolutions – giving rise to slightly different community composition results Jumpstart Consortium HMP, PLoS One, 2012 http://themicrobiome.com/media/16S_viewer.cfm

Variable Region Comparison d William et al, 2015 mSystems Comeau et al. in review

Target Amplification Adapter barcode primer Comeau et al. in review PCR Primers designed to target specific region of a genome Source: Illumina application Note Comeau et al. in review

Stitching improves read quality Comeau et al. in review

Sample Multiplexing “Multiplexing” means combining samples on the same run Unique DNA barcodes can be incorporated into your amplicons to differentiate samples Dual-indexing uses 16 forward x 24 reverse = 384 samples Amplicon sequences are quite homogeneous Necessary to combine different amplicon types or spike in PhiX control reads to diversify the sequences Not as big a problem with recent firmware release (only 5% PhiX needed)

Processing data 16S: DNA to OTU tables

Available Marker Gene Analysis Platforms QIIME (http://qiime.org) Mothur (http://www.mothur.org)

Comparison Between QIIME and Mothur A python interface to glue together many programs Single program with minimal external dependency Wrappers for existing programs Reimplementation of popular algorithms Large number of dependencies / VM available; work best on single multi-core server with lots of memory Easy to install and setup; work best on single multi-core server with lots of memory More scalable Less scalable Steeper learning curve but more flexible workflow if you can write your own scripts Easy to learn but workflow works the best with built-in tools http://www.ncbi.nlm.nih.gov/pubmed/24060131 http://www.mothur.org/wiki/MiSeq_SOP

Overall Bioinformatics Workflow 1) Preprocessing: remove primers, demultiplex, stitch reads, quality filter 2) OTU Picking / Representative Sequences Sequence data (fastq) 3) Taxonomic Assignment 5) Sequence Alignment Metadata about samples (mapping file) 4) Build OTU Table (BIOM file) 6) Phylogenetic Analysis Inputs Processed Data OTU Table Phylogenetic Tree 7) Downstream analysis and Visualization – knowledge discovery Outputs

Sample de-multiplexing 1) Preprocessing Sample de-multiplexing Reads need to be linked back to the samples they came from using the unique barcodes De-multiplexing removes the barcodes and the primer sequences and separates reads into individual sample files Some machines will demultiplex for you (MiSeq) QIIME has several different scripts to help you prepare your sequence files for analysis http://qiime.org/tutorials/processing_illumina_data.html

Read Stitching Paired-end reads result in a forward and reverse read from the same sequence

Quality Filtering QIIME filtering by: 1) Preprocessing Quality Filtering QIIME filtering by: Maximal number of consecutive low-quality bases ( -r=3 ) Minimum Phred quality score ( -q=3 ) Minimal length of consecutive high-quality bases (as % of total read length) ( -p=0.75 ) Maximal number of ambigous bases (N’s) (-n=0 ) Other quality filtering tools available Cutadapt (https://github.com/marcelm/cutadapt) Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic) Sickle (https://github.com/najoshi/sickle) Sequence quality summary using FASTQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Chimera Sequence Removal 2) OTU Picking Chimera Sequence Removal PCR amplification process can generate chimeric sequences (an artificially joined sequence from >1 templates) Chimeras represent ~1% of the reads Detection based on the identification of a 3-way alignment of non-overlapping sub-sequences to 2 sequences in the search database Source: http://drive5.com/usearch/manual/chimera_formation.html

OTU Picking OTUs: formed arbitrarily based on sequence identity 97% of sequence similarity ≈ species QIIME supports 3 approaches (each with several algorithms) De novo clustering Closed-reference Open-reference More details about OTU picking http://qiime.org/tutorials/otu_picking.html Current recommendation outlined in http://msystems.asm.org/content/1/1/e00003-15

De novo Clustering Groups sequences based on sequence identity Pair-wise comparisons needed Hierarchical clustering commonly used Requires a lot of disk/memory (hundreds Gb) and time Suitable if no reference database is available or if a large proportion of novel taxa Navas-Molina et al 2013, Meth. Enzym

De novo clustering Exhaustive de novo clustering (pair-wise distance matrix followed by hierarchical clustering) is not scalable (e.g. 1000 sequences requires 1000x1000 comparisons) Greedy algorithms (e.g. UCLUST, sumaclust) achieves time/space saving by clustering incrementally using a subset of sequences as “centroids” (centroid = seed of a cluster) Input order of sequences affects centroids picking so input seqs are usually pre-sorted by length or abundance

Closed-Reference Matches sequences to an existing database of reference sequences Unmatched sequences discarded Fast and can be parallelized Suitable if accurate and comprehensive reference is available Allow taxonomic comparison across different markers and different datasets Novel organisms missed/discarded Navas-Molina et al 2013, Meth. Enzym

Open-Reference First matches sequences to reference database (like closed-reference) Unmatched sequences then go through de-novo clustering Best of both worlds Suitable if a mixture of novel sequences and known sequences is expected Results could be biased by the reference database used QIIME Recommended approach Navas-Molina et al 2013, Meth. Enzym

No OTU picking! Collapsing sequences at 97% removes information However, collapsing to 100% identity would allow a single nucleotide error to result in a spurious taxa Use location of nucleotide differences to distinguish sequencing errors from real mutations

Representative Sequence 2) OTU Picking Representative Sequence Once sequences have been clustered into OTUs, a representative sequence is picked for each OTU. This means downstream analyses treats all organisms in one OTU as the same! Representative sequence can be picked based on Abundance (default) Centroid used (open-reference OTU or de novo) Length Existing reference sequence (for closed-reference OTU picking) Random First sequence

3) Taxonomic Assignment OTUs don’t have names but humans have the desire to refer to things by names. I.e. we want to refer to a group of organism as Bacteroides rather than OTU1. Closed-reference: Transfer taxonomy of the reference sequence to the query read Open-reference + de-novo clustering: Match OTUs to a reference data using “similarity” searching algorithms It is important to report how the matching is done in your publication and which taxonomy database is used

Taxonomic Assignment 3) Taxonomic Assignment Taxonomy DBs Bacteria; Bacteroidetes/Chlorobi group; Bacteroidetes; Bacteroidia; Bacteroidales; Bacteroidaceae; Bacterroides Bacteria; Firmicutes; Bacilli; Lactobacillales; Lactobacillaceae; Lactobacillus Etc. RDP Classifier RDP NCBI BLAST GreenGenes rtax Silva tax2tree Specify a matching algorithm Specify a target taxonomy database OTUs Ranked taxonomy names

Taxonomy Databases RDP (Cole et al 2009) 3) Taxonomic Assignment Taxonomy Databases RDP (Cole et al 2009) Most similar to NCBI Taxonomy Has a rapid classification tool (RDP-Classifier) GreenGenes (DeSantis et al 2006) Preferred by QIIME Silva (Quast et al. 2013) Preferred by Mothur Choose DB based on existing datasets you want to compare to and community preference

OTU Table OTU table is a sample-by-observation matrix 4) Build OTU Table OTU Table OTU table is a sample-by-observation matrix OTUs can have corresponding taxonomy information Extremely rare OTUs (singletons or frequencies <0.1% of total reads) can be filtered out to improve downstream analysis These OTUs may be due to sequencing errors and chimeras Encoded in Biological observation Matrix (BIOM) format (http://biom-format.org/ Sample1 Sample2 Sample3 Sample4 OTU1 10 14 33 OTU2 5 54 2 OTU3…

BIOM Format Advantages: Efficient: Able to handle large amount of data; sparse matrix Encapsulation: Able to keep metadata (e.g. sample information) and observational data (e.g. OTU counts) in one file Supported: Have software applications to manipulate the content, check the content to minimize editing errors Standardization: BIOM is a GSC (http://gensc.org/recognized standard; many tools use this file format (e.g. QIIME, MG-RAST, PICRUSt, Mothur, phyloseq, MEGAN, VAMPS, metagenomeSeq, Phinch, RDP Classifier)

Rarefaction As we multiplex samples, the sequencing depths can vary from sample to sample Many richness and diversity measures and downstream analyses are sensitive to sampling depth “more reads sequenced, more species/OTUs will be found in a given environment” So we need to rarefy the samples to the same level of sampling depth for more fair comparison Alternative approach: variance stabilizing transformations (e.g. DeSeq2) McMurdie and Holmes, PLOS CB, 2013 Liu, Hsiao et al, Bioinformatics, 2011

5) Sequence Alignment Sequence Alignment Sequences must be aligned to infer a phylogenetic tree Phylogenetic trees can be used for diversity analyses Traditional alignment programs (clustalw, muscle etc) are slow (Larkin et al 2007, Edgar 2004) Template-based aligners such as PyNAST (Caporaso et al 2010) and Infernal (Nawrocki, 2009) are faster and more scalable for large number of sequences

Phylogenetic Tree Reconstruction 6) Phylogenetic Analysis Phylogenetic Tree Reconstruction After sequences are aligned, phylogenetic trees can be constructed from the multiple alignments Examples: FastTree (Price et al 2009) – recommended by QIIME Clearcut (Evans et al 2006) Clustalw RAXML (Stamatakis et al 2005) Muscle

Visualization and statistics What is important? Visualization and statistics

Alpha Diversity Analysis 7) Downstream Analysis Alpha Diversity Analysis Alpha-diversity: diversity of organisms in one sample / environment Richness: # of species/taxa observed or estimated Evenness: relative abundance of each taxon α-Diversity: taking both evenness and richness into account Phylogenetic distance (PD) Shannon entropy dozens of different measures in QIIME

Beta Diversity Analysis 7) Downstream Analysis Beta Diversity Analysis Beta-diversity: differences in diversity across samples or environments UniFrac (Lozupone et al, AEM, 2005) (phylogenetic) Bray-Curtis dissimilarity measure (OTU abundance) Jaccard similarity coefficient (OTU presence/absence) Large number of other measures available in QIIME (and in Mothur)

Beta Diversity (UniFrac) 7) Downstream Analysis Beta Diversity (UniFrac) all OTUs are mapped to a phylogenetic tree summation of branch lengths unique to a sample constitutes the UniFrac score between the samples shared nodes either ignored (unweighted) or discounted (weighted)

Different beta-diversity measurements provide different information Parks and Beiko 2012, ISMEJ

Beta Diversity Analysis 7) Downstream Analysis Beta Diversity Analysis First, paired-wise distances/similarity of the samples calculated Then the matrix (high-dimensional data) is transformed into a lower-dimensional display for visualization Shows you which samples cluster together in 2D or 3D space PCoA Hierarchical clustering Sample 1 Sample 2 Sample 3 1 0.4 0.8 0.7

Principal Coordinate Analysis (PCoA) 7) Downstream Analysis Principal Coordinate Analysis (PCoA) When there are multiple samples, the UniFrac scores are calculated for each pair of samples. In order to view the multidimensional data, the distances are projected onto 2 (or more) dimensions using PCoA

PCoA 7) Downstream Analysis Each dot represents a sample Dots are colored based on the attributes provided in the mapping (metadata) file Source: Navas-Molina et al 2015

Is it statistically significant? You might see separation on PCoA plot, but is it statistically significant? QIIME has several tests within the compare_categories.py script Anosim Adonis Mantel?

Visualization and Statistics Various tools are available to determine statistically significant taxonomic differences across groups of samples Excel SigmaPlot Past R (many libraries) Python (matplotlib) STAMP

STAMP

STAMP Plots

STAMP Input “Profile file”: Table of features (samples by OTUs, samples by functions, etc.) Features can form a heirarchy (e.g. Phylum, Order, Class, etc) to allow data to be collapsed within the program “Group file”: Contains different metadata for grouping samples Can be two groups: (e.g. Healthy vs Sick) or multiple groups (e.g. Water depth at 2M, 4M, and 6M) Output PCA, heatmap, box, and bar plots Tables of significantly different features

Machine Learning Fits model (Random Forests) to data Selects features (species) Classifies samples (control or exercise) Samples Categories Machine learning fits a computer model, in this case one known as random forests, to the data. It selects the most important features, which for our data are species, that distinguish between data categories. It then classifies samples as being from either exercise or control mice based on those features. It trains on one subset of the data, and then it tests the model on another subset, to see how accurately it can predict the category of a sample based on those selected species. Explain table. As a control for the model, we look at whether the prediction accuracy drops if we randomly select the species that it uses for classification

Machine learning identifies differences between exercise and controls PC2 (15.31%) PC3 (8.34%) PC1 (28.56%) Exercise Control Y-axis: Prediction accuracy, or the model’s ability to tell which sample is from which group. X-axis: We ran the model for all time points. We looked at accuracy using the 30 most important species that were selected for classification, and we also looked at the accuracy using 30 randomly selected species. We found that accuracy was much better with the selected features (%80) than the random ones (less than %60). You can see that the difference in accuracy gets larger at later time points, indicating that there is a shift in the microbiome. So though we did not see a difference in species richness or shared species across samples, machine learning shows us that there is a difference in microbiome relationships in response to exercise. So we do see a difference in this method compared to other methods Future: narrow down important species Applying model to other datasets

Using Machine Learning Can run Random Forests within QIIME Script: supervised_learning.py More options within either: R using the caret package Python using the scikit-learn package

Putting it all together Microbiome Helper

Microbiome Helper Microbiome Helper is an open resource aimed at helping researchers process and analyze their microbiome data Combines bioinformatic methods from different groups and is updated continuously Scripts to wrap and integrate existing tools Available as an Ubuntu Virtualbox Standard Operating Procedures for 16S and metagenomics Tutorials with example datasets

Microbiome Helper Wiki https://github.com/mlangill/microbiome_helper/wiki

Microbiome Helper Vbox

Questions?

Tutorial Microbiome Helper https://github.com/mlangill/microbiome_helper/wiki/16S-tutorial-(chemerin)