Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

2Module #: Title of Module

Module 2 Marker Gene Based Analysis William.hsiao@bccdc.ca @wlhsiao Universal phylogenetic tree based on SSU rRNA sequences -N. Pace, Science, 1997

Module 2 bioinformatics.ca Learning Objectives of Module At the end of this module, you will be able to: Understand and Perform marker based microbiome analysis Analyze 16S rRNA marker gene data Use 16S rRNA marker gene to profile and compare microbomes Select suitable parameters for marker gene analysis Explain the advantages and disadvantages of marker based microbiome analysis

Module 2 bioinformatics.ca Marker genes Extract DNA Amplify with targeted primers Filter errors, build clusters Diversity analysis Source: Rob Beiko

Module 2 bioinformatics.ca rRNAs – the universal phylogenetic markers Ribosomal RNAs are present in all living organisms rRNAs play critical roles in protein translation rRNAs are relatively conserved and rarely acquired horizontally Behave like a molecular clock – Useful for phylogenetic analysis – Used to build tree-of-life (placing organisms in a single phylogenetic tree) 16S rRNA most commonly used

Module 2 bioinformatics.ca rRNA – the lens into Life A tool for placing organisms on a phylogenetic tree – Tree of life A tool for understanding the composition of a microbial community – Profiling of a community (alpha-diversity) A tool for relating one microbial community to another – Comparison of communities (beta-diversity) A tool for reading out the properties of a host or environment – Obese vs. lean; IBD vs. no IBD (classification) -McDonald et al, RNA, 2015

Module 2 bioinformatics.ca Other Marker Genes Used Eukaryotic Organisms (protists, fungi) – 18S (http://www.arb-silva.de)http://www.arb-silva.de – ITS (http://www.mothur.org/wiki/UNITE_ITS_database)http://www.mothur.org/wiki/UNITE_ITS_database Bacteria – CPN60 (http://www.cpndb.ca/cpnDB/home.php)http://www.cpndb.ca/cpnDB/home.php – ITS (Martiny, Env Micro 2009) – RecA gene Viruses – Gp23 for T4-like bacteriophage – RdRp for picornaviruses Faster evolving markers used for strain-level differentiation

Module 2 bioinformatics.ca A Few Considerations for Selecting Marker Genes The marker should have sufficient resolution to differentiate the different communities you want to study (16S differentiates genera and some species; e.g. HSP65 may be more useful for Mycobacterium) A reference database is needed for taxonomic assignment and for binning by reference so availability of a good reference DB for the samples you study is necessary When comparing across studies, need to standardize markers (different V regions of 16S gene), experimental protocols, and bioinformatic pipelines

Module 2 bioinformatics.ca DNA Extraction Many protocols/kits available but kit-bias exists HMP and other major projects standardized DNA extraction protocol - http://www.earthmicrobiome.org/emp-standard- protocols/dna-extraction-protocol/ http://www.earthmicrobiome.org/emp-standard- protocols/dna-extraction-protocol/ DNA Extraction can also be done after fractionation to separate out different organisms based on cell size and other characteristics - http://www.jove.com/video/52685/automated-gel-size- selection-to-improve-quality-next-generation http://www.jove.com/video/52685/automated-gel-size- selection-to-improve-quality-next-generation

Module 2 bioinformatics.ca Contaminations Lab: Extraction process can introduce contamination from the lab – Reagents may be contaminated with bacterial DNA Include Extraction Negative Control in your experiments! Especially crucial if samples have low DNA yield! Host / Environment (metagenomics sequencing): Host DNA often ends up in the microbiome sample – http://hmpdacc.org/doc/HumanSequenceRemoval_SOP.pdf http://hmpdacc.org/doc/HumanSequenceRemoval_SOP.pdf Unwanted fractions (e.g. eukaryotes) can be filtered by cell size-selection prior to DNA extraction Unwanted DNA can be removed by subtractive hybridization

Module 2 bioinformatics.ca Target Amplification Source: Illumina application Note Adapter primer barcode

Module 2 bioinformatics.ca Target Amplification PCR primers designed to amplify specific regions of a genome – In a “dirty” sample, inhibitors can interfere with PCR – In a complex sample, non-specific amplification can occur – Check your products using BioAnalyzer or run a gel Gel size selection may be necessary to clean up the PCR products – http://www.jove.com/video/52685/automated-gel-size- selection-to-improve-quality-next-generation http://www.jove.com/video/52685/automated-gel-size- selection-to-improve-quality-next-generation Dilution may be necessary to reduce inhibition

Module 2 bioinformatics.ca Target Selection and Bias 16S rRNA contains 9 hypervariable regions (V1-V9) V4 was chosen because of its size (suitable for Illumina 150bp paired-end sequencing) and phylogenetic resolution Different V regions have different phylogenetic resolutions – giving rise to slightly different community composition results – Jumpstart Consortium HMP, PLoS One, 2012

Module 2 bioinformatics.ca Sample Multiplexing MiSEQ capacity (~20M paired-end reads) allows multiple samples to be combined into a single run – Number of reads needed to differentiate samples depends on the nature of the studies Unique DNA barcodes can be incorporated into your amplicons to differentiate samples Amplicon sequences are quite homogeneous – Necessary to combine different amplicon types or spike in PhiX control reads to diversify the sequences

Module 2 bioinformatics.ca One Step vs. Two Step Amplification One step amplificationTwo step amplification Barcodes and sequencing adaptors included in the amplicon primers Barcodes and adaptors separated from the amplicon primers Single step PCR reactionPCR followed by DNA-ligation or more PCR Long primers; barcodes can interfere with amplification primers Target primers tested independently from barcodes/adaptors Not suitable for degenerate primer amplification Compatible with random and degenerate primer amplification Suitable if working with one amplicon type from many samples Suitable if working with many types of amplicons or metagenomics in a single study Rapid protocol and little loss of biomaterial Longer protocols with more biomaterial loss Barcoded primers for 16S available from Rob Knight’s Lab Tested barcodes from Illumina, BioO, NEB, etc. available

Module 2 bioinformatics.ca Available Marker Gene Analysis Platforms QIIME (http://qiime.org)http://qiime.org Mothur (http://www.mothur.org)http://www.mothur.org

Module 2 bioinformatics.ca Comparison Between QIIME and Mothur QIIMEMothur A python interface to glue together many programs Single program with minimal external dependency Wrappers for existing programsReimplementation of popular algorithms Large number of dependencies / VM available; work best on single multi-core server with lots of memory Easy to install and setup; work best on single multi-core server with lots of memory More scalableLess scalable Steeper learning curve but more flexible workflow if you can write your own scripts Easy to learn but workflow works the best with built-in tools http://www.ncbi.nlm.nih.gov/pubmed/240 60131 http://www.mothur.org/wiki/MiSeq_SOP

Module 2 bioinformatics.ca Overall Bioinformatics Workflow Sequence data (fastq) Metadata about samples (mapping file) 1) Preprocessing: remove primers, demultiplex, quality filter, decontamination 2) OTU Picking / Representative Sequences 5) Sequence Alignment 3) Taxonomic Assignment 4) Build OTU Table (BIOM file) 6) Phylogenetic Analysis Phylogenetic Tree OTU Table 7) Downstream analysis and Visualization – knowledge discovery Processed Data Inputs Outputs

Module 2 bioinformatics.ca Sample de-multiplexing MiSEQ capacity (~20M paired-end reads) allows multiple samples to be combined into a single run Reads need to be linked back to the samples they came from using the unique barcodes – Error-correcting barcodes available (~2000 unique barcodes from Caporaso et al ISME 2012) De-multiplexing removes the barcodes and the primer sequences QIIME has several different scripts to help you prepare your sequence files for analysis – http://qiime.org/tutorials/processing_illumina_data.html http://qiime.org/tutorials/processing_illumina_data.html 1) Preprocessing

Module 2 bioinformatics.ca Quality Filtering QIIME filtering by: – Maximal number of consecutive low-quality bases ( -r=3 ) – Minimum Phred quality score ( -q=3 ) – Minimal length of consecutive high-quality bases (as % of total read length) ( -p=0.75 ) – Maximal number of ambigous bases (N’s) (-n=0 ) Other quality filtering tools available – Cutadapt (https://github.com/marcelm/cutadapt)https://github.com/marcelm/cutadapt – Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic)http://www.usadellab.org/cms/?page=trimmomatic – Sickle (https://github.com/najoshi/sickle)https://github.com/najoshi/sickle Sequence quality summary using FASTQC – http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ 1) Preprocessing

Module bioinformatics.ca QIIME’s Quality Filtering Workflow and Observations http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531572/ Adjusting p,q, and c (minimal reads per OTU) significantly affect OTU count Filtering mainly affect low abundance OTUs – difficult to differentiate sequencing errors and rare OTUs Downstream analyses more robust to QF changes than raw OTU counts

Module 2 bioinformatics.ca Decontamination (in-silico) In situations where host and other non-bacterial DNA greatly outnumber the bacterial DNA, non-specific PCR amplification can occur Useful to check a subset of your sequences to make sure they match 16S reference sequences Two Strategies: – Query a subset of your sequences against 16S database(s) to confirm - use BLAST for higher sensitivity – mapping reads to databases containing suspected contaminants (host sequences, known contaminating sequences, etc.) – use short read aligners 1) Preprocessing

Module 2 bioinformatics.ca OTU Picking OTUs: formed arbitrarily based on sequence identity – 97% of sequence similarity over the entire 16S gene ≈ species QIIME supports 3 approaches (each with several algorithms) – De novo clustering – Closed-reference – Open-reference More details about OTU picking – http://qiime.org/tutorials/otu_picking.html http://qiime.org/tutorials/otu_picking.html – Current recommendation outlined in http://msystems.asm.org/content/1/1/e00003-15 http://msystems.asm.org/content/1/1/e00003-15 2) OTU Picking

Module 2 bioinformatics.ca De novo Clustering Groups sequences based on sequence identity Pair-wise comparisons needed Hierarchical clustering commonly used Requires a lot of disk/memory (hundreds Gb) and time Suitable if no reference database is available According to Schloss (mothur), average-linkage clustering is the most robust to changes in input data and algorithm parameters – “generated OTUs that were more likely to represent the actual distances between sequences” – Recommend first group input sequences into broad taxonomic bins (e.g. class or family) then cluster within each bin (https://peerj.com/articles/1487/)https://peerj.com/articles/1487/ Navas-Molina et al 2013, Meth. Enzym

Module 2 bioinformatics.ca De novo clustering Exhaustive de novo clustering (pair-wise distance matrix followed by hierarchical clustering) is not scalable (e.g. 1000 sequences requires 1000x1000 comparisons) Greedy algorithms (e.g. UCLUST, sumaclust) achieves time/space saving by clustering incrementally using a subset of sequences as “centroids” (centroid = seed of a cluster) Input order of sequences affects centroids picking so input seqs are usually pre-sorted by length or abundance

Module 2 bioinformatics.ca Closed-Reference Matches sequences to an existing database of reference sequences Unmatched sequences discarded Fast and can be parallelized Suitable if accurate and comprehensive reference is available Allow taxonomic comparison across different markers and different datasets Novel organisms missed/discarded Navas-Molina et al 2013, Meth. Enzym

Module 2 bioinformatics.ca Open-Reference First matches sequences to reference database (like closed-reference) Unmatched sequences then go through de-novo clustering Best of both worlds Suitable if a mixture of novel sequences and known sequences is expected Results could be biased by the reference database used QIIME Recommended approach Navas-Molina et al 2013, Meth. Enzym

Module 2 bioinformatics.ca Representative Sequence Once sequences have been clustered into OTUs, a representative sequence is picked for each OTU. This means downstream analyses treats all organisms in one OTU as the same! Representative sequence can be picked based on – Abundance (default) – Centroid used (open-reference OTU or de novo) – Length – Existing reference sequence (for closed-reference OTU picking) – Random – First sequence 2) OTU Picking

Module 2 bioinformatics.ca Chimera Sequence Removal PCR amplification process can generate chimeric sequences (an artificially joined sequence from >1 templates) Chimeras represent ~1% of the reads Detection based on the identification of a 3-way alignment of non-overlapping sub-sequences to 2 sequences in the search database Source: http://drive5.com/usearch/manual/chimera_formation.html 2) OTU Picking

Module 2 bioinformatics.ca Taxonomic Assignment OTUs don’t have names but humans have the desire to refer to things by names. – I.e. we want to refer to a group of organism as Bacteroides rather than OTU1. Closed-reference: – Transfer taxonomy of the reference sequence to the query read Open-reference + de-novo clustering: – Match OTUs to a reference data using “similarity” searching algorithms – It is important to report how the matching is done in your publication and which taxonomy database is used 3) Taxonomic Assignment

Module 2 bioinformatics.ca Taxonomic Assignment RDP GreenGenes Silva Taxonomy DBs Specify a target taxonomy database Specify a matching algorithm OTUs RDP Classifier NCBI BLAST rtax tax2tree Ranked taxonomy names Bacteria; Bacteroidetes/Chl orobi group; Bacteroidetes; Bacteroidia; Bacteroidales; Bacteroidaceae; Bacterroides Bacteria; Firmicutes; Bacilli; Lactobacillales; Lactobacillaceae; Lactobacillus Etc. 3) Taxonomic Assignment

Module 2 bioinformatics.ca Taxonomy Databases RDP (Cole et al 2009) – Most similar to NCBI Taxonomy – Has a rapid classification tool (RDP-Classifier) GreenGenes (DeSantis et al 2006) – Preferred by QIIME Silva (Quast et al. 2013) – Preferred by Mothur Choose DB based on existing datasets you want to compare to and community preference Caveat: only 11% of the ~150 human associated bacterial genera have species that fall in the 95%-98.7% 16S rRNA identity (Tamisier et al, IJSEM, 2015) 3) Taxonomic Assignment

Module 2 bioinformatics.ca Taxonomy Summary Source: Navas-Molina et al 2015 Different body sites have different taxonomy composition shown as stacked bar graphs 3) Taxonomic Assignment

Module 2 bioinformatics.ca OTU Table OTU table is a sample-by-observation matrix OTUs can have corresponding taxonomy information Extremely rare OTUs (singletons or frequencies <0.005% of total reads) can be filtered out to improve downstream analysis – These OTUs may be due to sequencing errors and chimeras Encoded in Biological observation Matrix (BIOM) format (http://biom-format.org/http://biom-format.org/ OTU1OTU2OTU3OTU4OTU5... Sample1101403323 Sample25054212 4) Build OTU Table

Module 2 bioinformatics.ca BIOM Format Advantages: – Efficient: Able to handle large amount of data; sparse matrix – Encapsulation: Able to keep metadata (e.g. sample information) and observational data (e.g. OTU counts) in one file – Supported: Have software applications to manipulate the content, check the content to minimize editing errors – Standardization: BIOM is a GSC (http://gensc.org/recognized standard; many tools use this file format (e.g. QIIME, MG-RAST, PICRUSt, Mothur, phyloseq, MEGAN, VAMPS, metagenomeSeq, Phinch, RDP Classifier)http://gensc.org/

Module 2 bioinformatics.ca Sequence Alignment Sequences must be aligned to infer a phylogenetic tree Phylogenetic trees can be used for diversity analyses Traditional alignment programs (clustalw, muscle etc) are slow (Larkin et al 2007, Edgar 2004) Template-based aligners such as PyNAST (Caporaso et al 2010) and Infernal (Nawrocki, 2009) are faster and more scalable for large number of sequences 5) Sequence Alignment

Module 2 bioinformatics.ca Phylogenetic Tree Reconstruction After sequences are aligned, phylogenetic trees can be constructed from the multiple alignments Examples: – FastTree (Price et al 2009) – recommended by QIIME – Clearcut (Evans et al 2006) – Clustalw – RAXML (Stamatakis et al 2005) – Muscle 6) Phylogenetic Analysis

Module 2 bioinformatics.ca Rarefaction As we multiplex samples, the sequencing depths can vary from sample to sample Many richness and diversity measures and downstream analyses are sensitive to sampling depth – “more reads sequenced, more species/OTUs will be found in a given environment” So we need to rarefy the samples to the same level of sampling depth for more fair comparison Alternative approach: variance stabilizing transformations – McMurdie and Holmes, PLOS CB, 2013 – Liu, Hsiao et al, Bioinformatics, 2011

Module 2 bioinformatics.ca Alpha Diversity Analysis Alpha-diversity: diversity of organisms in one sample / environment – Richness: # of species/taxa observed or estimated – Evenness: relative abundance of each taxon – α-Diversity: taking both evenness and richness into account Phylogenetic distance (PD) Shannon entropy dozens of different measures in QIIME 7) Downstream Analysis

Module 2 bioinformatics.ca Alpha Diversity (PD) Closed-referenceOpen-referenceDe-novo Rarefaction curves different based on the OTU-picking method used. De-novo approach and Open-reference generated higher diversity than Closed-reference approach Source: Navas-Molina et al 2015 7) Downstream Analysis

Module 2 bioinformatics.ca Beta Diversity Analysis Beta-diversity: differences in diversities across samples or environments – UniFrac (Lozupone et al, AEM, 2005) (phylogenetic) – Bray-Curtis dissimilarity measure (OTU abundance) – Jaccard similarity coefficient (OTU presence/absence) – Large number of other measures available in QIIME (and in Mothur) 7) Downstream Analysis

Module 2 bioinformatics.ca Beta Diversity Analysis First, paired-wise distances/similarity of the samples calculated Then the matrix (high-dimensional data) is transformed into a lower-dimensional display for visualization – Shows you which samples cluster together in 2D or 3D space – PCoA – Hierarchical clustering Sample 1Sample 2Sample 3 Sample 110.40.8 Sample 210.7 Sample 31 7) Downstream Analysis

Module 2 bioinformatics.ca Beta Diversity (UniFrac) all OTUs are mapped to a phylogenetic tree summation of branch lengths unique to a sample constitutes the UniFrac score between the samples shared nodes either ignored (unweighted) or discounted (weighted) Weighted UniFrac sensitive to experimental bias Unweighted UniFrac often proven more robust but sensitive to the effects of rare OTUs 7) Downstream Analysis

Module 2 bioinformatics.ca Principal Coordinate Analysis (PCoA) When there are multiple samples, the UniFrac scores are calculated for each pair of samples. In order to view the multidimensional data, the distances are projected onto 2-dimensions using PCoA 7) Downstream Analysis

Module 2 bioinformatics.ca PCoA Each dot represents a sample Dots are colored based on the attributes provided in the mapping (metadata) file Source: Navas-Molina et al 2015 7) Downstream Analysis

Module 2 bioinformatics.ca Hierarchical Clustering Each leaf is a sample Samples “forced” into a bifurcating tree Source: Navas-Molina et al 2015 7) Downstream Analysis

Module 2 bioinformatics.ca Marker Genes vs. Shotgun Metagenomics Marker Gene ProfilingShotgun Metagenomics Profiling Less expensive (~$100 per sample)Still very expensive (~$1000 per sample) Computational needs can be met by desktop / small server computers Usually requires huge computational resources (cluster of computers) Provides mainly taxonomic profilingProvides both taxonomic and functional profiling For 16S, majority of genes can be assigned at least to phylum level Many more unassigned gene fragments (“wasted” data) Relatively free of host DNA contaminationProne to host DNA contamination

Module 2 bioinformatics.ca Functional Stability vs. OTU Stability Source: Turnbaugh, P, et al. 2008. Nature

Module 2 bioinformatics.ca PICRUSt Source: Xu et al ISME Journal 2014 Predicts gene content of microbiome based on marker gene survey data Analysis showed that in poorly- characterized community types, 16S profile works better than (predicted) functional profile to classify samples

Module bioinformatics.ca We are on a Coffee Break & Networking Session

Canadian Bioinformatics Workshops

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Canadian Bioinformatics Workshops

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

Similar presentations

About project

Feedback