Canadian Bioinformatics Workshops

Slides:



Advertisements
Similar presentations
16S sequencing for microbiome studies Nicola Segata and Nick Loman
Advertisements

Metabarcoding 16S RNA targeted sequencing
Peter Tsai Bioinformatics Institute, University of Auckland
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
The Microbiome and Metagenomics
Metagenomic Analysis Using MEGAN4
Discussion on Metagenomic Data for ANGUS Course Adina Howe.
Molecular Microbial Ecology
Discovery of new biomarkers as indicators of watershed health and water quality Anamaria Crisan & Mike Peabody.
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
Probes can be designed in an evolutionary hierarchy.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Accurate estimation of microbial communities using 16S tags
Canadian Bioinformatics Workshops
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Canadian Bioinformatics Workshops
Date of download: 6/23/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A)
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Date of download: 7/7/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A) DNA.
16S rRNA Experimental Design
Canadian Bioinformatics Workshops
Presented By: Emily Lamoureux
Metagenomic Species Diversity.
Metagenomics: From Bench to Data Analysis 19-23rd September S rRNA-based surveys for Community Analysis: How Quantitative are they? Dr.
Micelle PCR reduces artifact formation in 16S microbiota profiling
MGmapper A tool to map MetaGenomics data
Results for all features Results for the reduced set of features
An Artificial Intelligence Approach to Precision Oncology
RNA-Seq analysis in R (Bioconductor)
Gene expression.
Research in Computational Molecular Biology , Vol (2008)
Welcome to Introduction to Bioinformatics
The African Soil Microbiology project
Workshop on the analysis of microbial sequence data using ARB
Evaluating classifiers for disease gene discovery
Taxonomic profiling with MetaPhlAn2
Identifying personal microbiomes using metagenomic codes
Content and Labeling of Tests Marketed as Clinical “Whole-Exome Sequencing” Perspectives from a cancer genetics clinician and clinical lab director Allen.
Systematic Characterization and Analysis of the Taxonomic Drivers of Functional Shifts in the Human Microbiome  Ohad Manor, Elhanan Borenstein  Cell Host.
Taxonomic profiling with MetaPhlAn2
Rosie Coates-Brown Final year Bioinformatics trainee
Metagenomics and metatranscriptomics: Windows on CF-associated viral and microbial communities  Yan Wei Lim, Robert Schmieder, Matthew Haynes, Dana Willner,
Discovery tools for human genetic variations
H = -Σpi log2 pi.
Exploring and Understanding ChIP-Seq data
Metagenomics and metatranscriptomics: Windows on CF-associated viral and microbial communities  Yan Wei Lim, Robert Schmieder, Matthew Haynes, Dana Willner,
Daniel A. Peterson, Daniel N. Frank, Norman R. Pace, Jeffrey I. Gordon 
Dr Tan Tin Wee Director Bioinformatics Centre
Bioinformatics for plant biosecurity and surveillance systems
Volume 20, Issue 5, Pages (November 2014)
Microbiome studies for microbial disease pathogenesis research
Skin Microbiome Surveys Are Strongly Influenced by Experimental Design
Volume 10, Issue 4, Pages (October 2011)
Bo Li, Akshay Tambe, Sharon Aviran, Lior Pachter  Cell Systems 
Summarized by Sun Kim SNU Biointelligence Lab.
Daniel A. Peterson, Daniel N. Frank, Norman R. Pace, Jeffrey I. Gordon 
Volume 20, Issue 5, Pages (November 2014)
Basic Local Alignment Search Tool
A typical current computational meta'omic pipeline to analyze and contrast microbial communities. A typical current computational meta'omic pipeline to.
Volume 20, Issue 4, Pages (October 2016)
Sequence Analysis - RNA-Seq 2
Toward Accurate and Quantitative Comparative Metagenomics
Rapid Detection of HIV-1 subtype C Integrase resistance mutations by the Use of High-Resolution Melting Analysis Tendai Washaya BSc, Msc. Pre-PhD Student.
Host-Associated Quantitative Abundance Profiling Reveals the Microbial Load Variation of Root Microbiome  Xiaoxuan Guo, Xiaoning Zhang, Yuan Qin, Yong-Xin.
Presentation transcript:

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Module 7 Microbiome biomarker discovery Fiona Brinkman Analysis of Metagenomic Data June 22-24, 2016 Thank you! Anamaria Crisan, Mike Peabody, Thea Van Rossum, John Parkinson

Learning objectives of the module Appreciate what biomarkers are and their utility Know the basics of how one can identify biomarkers from microbiome data Be aware of examples, including a case study, of biomarkers identified from microbiome data Appreciate the importance of careful, conservative microbiome/biomarker analysis

What are biomarkers? bi·o·mark·er Functional biomarkers /ˈbīōˌmärkər/ Measureable biological property that can be indicative of some phenomena, such as an infection, disease, or environmental disturbance Functional biomarkers Biological functions (genes, proteins, metabolites…) specific to a single organism or shared among multiple organisms Taxonomic biomarkers Can be a specific species, or a category of organisms – including an Operational Taxonomic Unit (OTU).

Why identify biomarkers? Detect/diagnose a phenotype more quickly, cheaply and/or accurately versus whole metagenomics sequencing. “Bugs as Drugs” Growing success stories…

Biomarker success stories - Asthma

Biomarker applications are many… Bowel: Differentiating Inflammatory Bowel Disease from related diseases and detecting increasing disease severity; Detecting colorectal cancer Lung: Progression of chronic obstructive pulmonary disease (COPD) Breast: Human milk microbiome protecting against mastitis Environmental biomarkers (pollution, ecosystem health, …)

What is biomarker selection? The process of removing non-informative or redundant OTUs/taxa/gene sequence – identifying sequences that are differential between two conditions

DISCOVERY VALIDATION How do we find biomarkers? Bioinformatics Software - Takes the raw digitized genomic data and performs QC and quantification of different sequences (based on taxa or genes) Use math – Applied statistical methods can help us find useful biomarkers VALIDATION Design Primers – biological “hooks” that pick out our biomarker sequence of interest from a sample qPCR – measures how many times primers (hooks) manage to snag our biomarker of interest

What are biomarkers and how do we find them? 1 Plan Analysis: “bio”? “mark”? 2 Obtain Biological Samples 3 Extract & Sequence DNA 4 Identify Potential Biomarkers (Taxa, OTUs or Functional Genes more/less abundant in test versus control condition) What bio? what mark? 5 Validate Potential Biomarkers (In silico then in vitro) 6 Iteratively Further Optimize Biomarkers

Microbiome analyses options for biomarker ID A) Bacteria Shotgun or 16S amplicon Best studied, most methods developed B) Viruses Shotgun or amplicon (RdRp, g23) Can be challenging to get enough DNA Host-specificity and population “bursts” hold promise C) Eukaryotes Combinations! Amplicon (18S, ITS) (large genomes make shotgun difficult) Well studied, many methods developed

A) TAXONOMIC B) GENE-BASED C) OTHER? What kind of biomarker do we want? A) TAXONOMIC Can use amplicon or shotgun data Strain-level diversity can lead to false positives/negatives More variable across environments (for better or worse) B) GENE-BASED Requires shotgun data (DNA or RNA) Need good sequencing depth to reach specialised genes Domain-based gene architecture can be tricky C) OTHER? Combinations! Diversity metrics, using microbiome analysis to suggest other metabolic markers, etc

OTU1 OTU2 Sample frequency Sample frequency Abundance Abundance

What makes a good biomarker? In this example we want to find biomarkers that separate between the red and blue class labels.

Biomarker selection relies on statistical techniques Range from the very simple (like a t-test) to more complex… You can write your own statistical analysis using R or equivalent OR You can use more complex methods developed by others LEfSE – implemented as a convenient Galaxy workflow (Segata et al 2010 PMID 21702898) MetagenomeSeq – implemented using R (Paulsen et al 2013 PMID: 24076764)

LEfSE - LDA Effect Size for biomarker selection High-dimensional biomarker discovery and explanation. IDs genomic features (genes, pathways, taxa), characterizing the differences between two or more biological conditions/classes. First: IDs statistically different features among biological classes (non-parametric factorial KW sum-rank test). Second: Performs pairwise tests among subclasses using (unpaired) Wilcoxon rank-sum test - to assess whether the differences are consistent with respect to expected biological behavior. Third: Uses Linear Discriminant Analysis to est. effect size of each differentially abundant feature (& dimensional reduction if desired)

https://huttenhower. sph. harvard. edu/galaxy/tool_runner https://huttenhower.sph.harvard.edu/galaxy/tool_runner?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fgeorge-weingart%2Flefse%2FLEfSe_for%2F1.0

Biomarker selection relies on statistical techniques KEY: Understand the statistical methods especially Your assumptions about the data The statistical method’s assumptions about the data The statistical method’s limitations How to interpret your results from the output

The best approach to use depends on your research question

Considerations for your statistical analysis … Discrete/categorical or continuous variables? Do your samples involve known classes or do you not know how many classes there are?

Generally, statistical techniques either try to predict labels or continuous values

And…they also generally belong to one of two categories (Lied a little : Semi-supervised methods also exist)

How do we validate our biomarkers? Once you ID the gene or taxonomic group you want to use as a biomarker, you need to design a test for it (q)PCR is a good option: Identify biomarker specific sequence Use marker-based tool (e.g. MetaPhlAn2) OR Cluster reads or align them to find conserved sequences – verifying that the “representative” sequence is selective 2. Design primers (& probe) PrimerProspector: Designs primers from a sequence alignment PrimerBLAST: Designs primers specific to a clade

Case Study

Case Study Improved pollution/pathogen detection, source tracking: Using metagenomics to identify markers of water quality Will Hsiao Patrick Tang Matt Coxen Thea Van Rossum Mike Peabody Univ. of Sask. Janet Hill NRC-PBI Sean Hemmingsen MSFHR Bev Holmes McGill University Bartha Knoppers Vural Ozdemir Yann Joly www.watersheddiscovery.ca

The current emphasis of water quality testing is at the tap… Why Watershed Metagenomics? An Ecosystem approach to water quality monitoring… The current emphasis of water quality testing is at the tap… …but we should be looking more at the source. The fecal coliform test can’t ID the source and is inaccurate. False negatives: Not all pathogens are coliforms False positives: Not all coliforms are pathogens Goal: Panel of qPCR assays, based on metagenomic surveys Module 7 bioinformatics.ca

Case study Design bioinformatics.ca Control (Protected) Watershed Human- Impacted Watershed Agriculturally-Impacted Watershed S1: Upstream Case study Design S2: Pollution S3: Downstream STUDY DESIGN – SAMPLING YEAR ONE 96 water samples collected over 1 year, plus additional hourly time courses Bioinformatics analysis, with metadata, modeling, biomarker identification Illumina sequencing of microbial DNA (and viral RNA) – 16S, 18S, CPN60, shotgun seq Water filtered for microbes 30 Module 7 bioinformatics.ca 30

Case study: Identifying biomarkers of water quality Microbiome survey (taxon & gene profiles) One sampling site upstream of contamination Differential features Two sampling sites downstream of contamination qPCR panel test Monthly water samples (with positive and negative controls) Module 7 bioinformatics.ca

Case Study – Categorical Biomarker from Bacterial Shotgun Data Example of fast track to marker ID and PCR test development Bacterial shotgun data using Illumina HiSeq Using MetaPhlAn High precision, Low sensitivity Based on select, clade-specific gene sequences 3000 reference genomes Fast: 3 million reads (100-150 bp) in 10 minutes Metaphlan is a fast taxonomic classification software developed by Huttenhower lab at Harvard. It relies on unique, clade-specific gene sequences identified from 3,000 reference genomes. It was used in the Human Microbiome Project main paper for species-level metagenomic profiling. Watershed Metagenomics SAB Meeting January 24, 2014

Case Study: Fast track approach Step 1. Process and validate data Quality trimmed and normalised data across samples MOCK COMMUNITY (POSITIVE CONTROL) VALIDATION Validated with mock community: DNA-free water spiked with DNA from multiple taxa of lab-cultured bacteria 7% of reads were assigned by MetaPhlAn to a species (i.e. the expected low sensitivity for such a fast, precise approach) Of those, 84% were correctly assigned Low quality: sliding window with Q of at least average 20 over a sliding window of 10 bp. Then the adapters were trimmed from the high-quality reads (allowing 20% mismatch/Ns) Trimmed low quality reads Trimmed adapter sequences Filtered short reads Discarded reverse paired reads (redundant data) Assessed amount of data left Positive Bacterial Control 1: 66,815 reads classified out of 901,344 (7.4%) Positive Bacterial Control 2: 419,720 reads classified out of 5,797,824 (7.2%) Strains in mock community: Pseudomonas aeruginosa PAO1 Streptomyces coelicolor A3(2) Rhodobacter capsulatus SB1003 7.3% of reads were assigned Most common incorrect: Genus level: Burkholderia 2.6% Frankia 1.8% Thioalkalivibrio 1.6% Species level: Desulfovibrio desulfuricans 0.75% Module 7 bioinformatics.ca Watershed Metagenomics SAB Meeting January 24, 2014

Case Study: Fast track approach 2. Identify differential taxa Top 50 most “abundant” taxa Taxon 1 Taxon 2 Upstream At Site & Downstream Prioritised high abundance taxa Use White’s non-parametric t-test with false discovery rate multiple test correction to find differentially abundant taxa (alternative: RandomForests) Figure on left are the top 50 most “abundant” microbes. Values are proportions of reads assigned to each species White’s non-parametric t-test: Non-parametric test proposed by White et al. for clinical metagenomic data. This test uses a permutation procedure to remove the normality assumption of a standard t-test. In addition, it uses a heuristic to identify sparse features which are handled with Fisher’s exact test and a pooling strategy when either group consists of less than 8 samples. See White et al., 2009 for details. Watershed Metagenomics SAB Meeting January 24, 2014

Case Study: Fast track approach 3 Case Study: Fast track approach 3. Identify sequences characteristic of taxa 57,016 reads assigned to Taxon 1 2,176 reads assigned to Taxon 2 Prioritised taxon 1 due to our research indicating high abundance taxa are most accurately predicted Extracted taxon-specific sequences from MetaPhlAn database 607 sequences for Taxon 1 Aligned reads against these sequences Chose regions of MetaPhlAn sequences with most hits Watershed Metagenomics SAB Meeting January 24, 2014

Case Study: Fast track approach 3. Identify sequences characteristic of taxa MetaPhlAn marker sequence Our reads aligned with quite a bit of depth to the first part of the sequence, but not the second part. This is because the marker is from Polynucleobacter necessarius subsp. asymbioticus QLW-P1DMWA-1. The second part of the sequence is different in Polynucleobacter necessarius subsp. necessarius STIR1, so this suggests that the strain present in our samples is not asymbioticus This marker is from LSU ribosomal protein L10P Consensus sequence Highest Coverage = candidate marker sequence Module 7 bioinformatics.ca Watershed Metagenomics SAB Meeting January 24, 2014

Case Study: Fast track approach 4 Case Study: Fast track approach 4. Design qPCR primers & probes from marker seq. Used Primer3 for primer and probe design Checked in silico relative occurrence rates of candidate primers and probes Considered matches that are exact or have 1-2 mismatches Chose sequences that minimize non-specific matches Lab: Confirmed we can amplify a product of the right size Forward primer Probe Reverse primer At site & downstream Upstream Module 7 bioinformatics.ca Watershed Metagenomics SAB Meeting January 24, 2014

“Fast track” Case Study Comments ID’d a marker based on differential abundance of a bacterial species  being used to pilot our iterative validation process Benefits of this approach Fast: Sequence data to PCR primers in a couple days Doesn’t require large amounts of processing power Limitations of this approach Depends on differential abundance of known bacteria (if the differential bacteria aren’t highly similar to those in the MetaPhlAn database, this approach won’t work) Based on taxa, which have been shown to be more variable across environments than (gene) functions A “low hanging fruit” approach; a good first step Watershed Metagenomics SAB Meeting January 24, 2014

Case Study Alternative Approach: More complete analysis of bacterial shotgun seq. More complete taxonomic analysis Kraken, Discribinate (See Peabody et al. 2015) Gene function analysis MEGAN4 (using SEED, KEGG databases) Cluster based analysis: Get predicted proteins Cluster Find differential clusters Design PCR Watershed Metagenomics SAB Meeting January 24, 2014

From Biomarker to Lab Test Identify discriminative taxa or functions Identify informative region for primer design (CD-HIT to cluster reads by identity) Design primers (Primer-BLAST or IDT Realtime PCR Tool) Primer 1 Primer 2 Validate primers in silico (Primer Prospector, Primer-BLAST) Validate primers in vitro (qPCR)

Remember other markers… Community diversity as an indicator of ecosystem health Microbiome analysis may suggest other types of screening tests… (metabolites, etc) Markers are only as good as the data they are based on, so design experiments carefully, include +ve and -ve controls

Final Comments Use controls, careful experimental design Note variation in accuracy of different metagenomics analysis methods Appreciate biases, limitations re what’s in sequence databases Carefully examine data, methods and biases when comparing across different datasets (i.e. from different research groups) Careful, considerate analysis can really pay off

Final Comments Use controls, careful experimental design Note variation in accuracy of different metagenomics analysis methods Appreciate biases, limitations re what’s in sequence databases Carefully examine data, methods and biases when comparing across different datasets (i.e. from different research groups) Careful, considerate analysis can really pay off

Questions?