Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canadian Bioinformatics Workshops

Similar presentations


Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

1 Canadian Bioinformatics Workshops

2 Module #: Title of Module
2

3 Module 7 Microbiome biomarker discovery
Fiona Brinkman Analysis of Metagenomic Data June 22-24, 2016 Thank you! Anamaria Crisan, Mike Peabody, Thea Van Rossum, John Parkinson

4 Learning objectives of the module
Appreciate what biomarkers are and their utility Know the basics of how one can identify biomarkers from microbiome data Be aware of examples, including a case study, of biomarkers identified from microbiome data Appreciate the importance of careful, conservative microbiome/biomarker analysis

5 What are biomarkers? bi·o·mark·er Functional biomarkers
/ˈbīōˌmärkər/ Measureable biological property that can be indicative of some phenomena, such as an infection, disease, or environmental disturbance Functional biomarkers Biological functions (genes, proteins, metabolites…) specific to a single organism or shared among multiple organisms Taxonomic biomarkers Can be a specific species, or a category of organisms – including an Operational Taxonomic Unit (OTU).

6 Why identify biomarkers?
Detect/diagnose a phenotype more quickly, cheaply and/or accurately versus whole metagenomics sequencing. “Bugs as Drugs” Growing success stories…

7 Biomarker success stories - Asthma

8 Biomarker applications are many…
Bowel: Differentiating Inflammatory Bowel Disease from related diseases and detecting increasing disease severity; Detecting colorectal cancer Lung: Progression of chronic obstructive pulmonary disease (COPD) Breast: Human milk microbiome protecting against mastitis Environmental biomarkers (pollution, ecosystem health, …)

9 What is biomarker selection?
The process of removing non-informative or redundant OTUs/taxa/gene sequence – identifying sequences that are differential between two conditions

10 DISCOVERY VALIDATION How do we find biomarkers?
Bioinformatics Software - Takes the raw digitized genomic data and performs QC and quantification of different sequences (based on taxa or genes) Use math – Applied statistical methods can help us find useful biomarkers VALIDATION Design Primers – biological “hooks” that pick out our biomarker sequence of interest from a sample qPCR – measures how many times primers (hooks) manage to snag our biomarker of interest

11 What are biomarkers and how do we find them?
1 Plan Analysis: “bio”? “mark”? 2 Obtain Biological Samples 3 Extract & Sequence DNA 4 Identify Potential Biomarkers (Taxa, OTUs or Functional Genes more/less abundant in test versus control condition) What bio? what mark? 5 Validate Potential Biomarkers (In silico then in vitro) 6 Iteratively Further Optimize Biomarkers

12 Microbiome analyses options for biomarker ID
A) Bacteria Shotgun or 16S amplicon Best studied, most methods developed B) Viruses Shotgun or amplicon (RdRp, g23) Can be challenging to get enough DNA Host-specificity and population “bursts” hold promise C) Eukaryotes Combinations! Amplicon (18S, ITS) (large genomes make shotgun difficult) Well studied, many methods developed

13 A) TAXONOMIC B) GENE-BASED C) OTHER?
What kind of biomarker do we want? A) TAXONOMIC Can use amplicon or shotgun data Strain-level diversity can lead to false positives/negatives More variable across environments (for better or worse) B) GENE-BASED Requires shotgun data (DNA or RNA) Need good sequencing depth to reach specialised genes Domain-based gene architecture can be tricky C) OTHER? Combinations! Diversity metrics, using microbiome analysis to suggest other metabolic markers, etc

14 OTU1 OTU2 Sample frequency Sample frequency Abundance Abundance

15 What makes a good biomarker?
In this example we want to find biomarkers that separate between the red and blue class labels.

16 Biomarker selection relies on statistical techniques
Range from the very simple (like a t-test) to more complex… You can write your own statistical analysis using R or equivalent OR You can use more complex methods developed by others LEfSE – implemented as a convenient Galaxy workflow (Segata et al 2010 PMID ) MetagenomeSeq – implemented using R (Paulsen et al 2013 PMID: )

17 LEfSE - LDA Effect Size for biomarker selection
High-dimensional biomarker discovery and explanation. IDs genomic features (genes, pathways, taxa), characterizing the differences between two or more biological conditions/classes. First: IDs statistically different features among biological classes (non-parametric factorial KW sum-rank test). Second: Performs pairwise tests among subclasses using (unpaired) Wilcoxon rank-sum test - to assess whether the differences are consistent with respect to expected biological behavior. Third: Uses Linear Discriminant Analysis to est. effect size of each differentially abundant feature (& dimensional reduction if desired)

18 https://huttenhower. sph. harvard. edu/galaxy/tool_runner

19 Biomarker selection relies on statistical techniques
KEY: Understand the statistical methods especially Your assumptions about the data The statistical method’s assumptions about the data The statistical method’s limitations How to interpret your results from the output

20 The best approach to use depends on your research question

21 Considerations for your statistical analysis …
Discrete/categorical or continuous variables? Do your samples involve known classes or do you not know how many classes there are?

22 Generally, statistical techniques either
try to predict labels or continuous values

23 And…they also generally belong to one of two categories
(Lied a little : Semi-supervised methods also exist)

24

25

26 How do we validate our biomarkers?
Once you ID the gene or taxonomic group you want to use as a biomarker, you need to design a test for it (q)PCR is a good option: Identify biomarker specific sequence Use marker-based tool (e.g. MetaPhlAn2) OR Cluster reads or align them to find conserved sequences – verifying that the “representative” sequence is selective 2. Design primers (& probe) PrimerProspector: Designs primers from a sequence alignment PrimerBLAST: Designs primers specific to a clade

27 Case Study

28 Case Study Improved pollution/pathogen detection, source tracking: Using metagenomics to identify markers of water quality Will Hsiao Patrick Tang Matt Coxen Thea Van Rossum Mike Peabody Univ. of Sask. Janet Hill NRC-PBI Sean Hemmingsen MSFHR Bev Holmes McGill University Bartha Knoppers Vural Ozdemir Yann Joly

29 The current emphasis of water quality testing is at the tap…
Why Watershed Metagenomics? An Ecosystem approach to water quality monitoring… The current emphasis of water quality testing is at the tap… …but we should be looking more at the source. The fecal coliform test can’t ID the source and is inaccurate. False negatives: Not all pathogens are coliforms False positives: Not all coliforms are pathogens Goal: Panel of qPCR assays, based on metagenomic surveys Module 7 bioinformatics.ca

30 Case study Design bioinformatics.ca Control (Protected) Watershed
Human- Impacted Watershed Agriculturally-Impacted Watershed S1: Upstream Case study Design S2: Pollution S3: Downstream STUDY DESIGN – SAMPLING YEAR ONE 96 water samples collected over 1 year, plus additional hourly time courses Bioinformatics analysis, with metadata, modeling, biomarker identification Illumina sequencing of microbial DNA (and viral RNA) – 16S, 18S, CPN60, shotgun seq Water filtered for microbes 30 Module 7 bioinformatics.ca 30

31 Case study: Identifying biomarkers of water quality
Microbiome survey (taxon & gene profiles) One sampling site upstream of contamination Differential features Two sampling sites downstream of contamination qPCR panel test Monthly water samples (with positive and negative controls) Module 7 bioinformatics.ca

32 Case Study – Categorical Biomarker from Bacterial Shotgun Data
Example of fast track to marker ID and PCR test development Bacterial shotgun data using Illumina HiSeq Using MetaPhlAn High precision, Low sensitivity Based on select, clade-specific gene sequences 3000 reference genomes Fast: 3 million reads ( bp) in 10 minutes Metaphlan is a fast taxonomic classification software developed by Huttenhower lab at Harvard. It relies on unique, clade-specific gene sequences identified from 3,000 reference genomes. It was used in the Human Microbiome Project main paper for species-level metagenomic profiling. Watershed Metagenomics SAB Meeting January 24, 2014

33 Case Study: Fast track approach Step 1. Process and validate data
Quality trimmed and normalised data across samples MOCK COMMUNITY (POSITIVE CONTROL) VALIDATION Validated with mock community: DNA-free water spiked with DNA from multiple taxa of lab-cultured bacteria 7% of reads were assigned by MetaPhlAn to a species (i.e. the expected low sensitivity for such a fast, precise approach) Of those, 84% were correctly assigned Low quality: sliding window with Q of at least average 20 over a sliding window of 10 bp. Then the adapters were trimmed from the high-quality reads (allowing 20% mismatch/Ns) Trimmed low quality reads Trimmed adapter sequences Filtered short reads Discarded reverse paired reads (redundant data) Assessed amount of data left Positive Bacterial Control 1: 66,815 reads classified out of 901,344 (7.4%) Positive Bacterial Control 2: 419,720 reads classified out of 5,797,824 (7.2%) Strains in mock community: Pseudomonas aeruginosa PAO1 Streptomyces coelicolor A3(2) Rhodobacter capsulatus SB1003 7.3% of reads were assigned Most common incorrect: Genus level: Burkholderia 2.6% Frankia 1.8% Thioalkalivibrio 1.6% Species level: Desulfovibrio desulfuricans 0.75% Module 7 bioinformatics.ca Watershed Metagenomics SAB Meeting January 24, 2014

34 Case Study: Fast track approach 2. Identify differential taxa
Top 50 most “abundant” taxa Taxon 1 Taxon 2 Upstream At Site & Downstream Prioritised high abundance taxa Use White’s non-parametric t-test with false discovery rate multiple test correction to find differentially abundant taxa (alternative: RandomForests) Figure on left are the top 50 most “abundant” microbes. Values are proportions of reads assigned to each species White’s non-parametric t-test: Non-parametric test proposed by White et al. for clinical metagenomic data. This test uses a permutation procedure to remove the normality assumption of a standard t-test. In addition, it uses a heuristic to identify sparse features which are handled with Fisher’s exact test and a pooling strategy when either group consists of less than 8 samples. See White et al., 2009 for details. Watershed Metagenomics SAB Meeting January 24, 2014

35 Case Study: Fast track approach 3
Case Study: Fast track approach 3. Identify sequences characteristic of taxa 57,016 reads assigned to Taxon 1 2,176 reads assigned to Taxon 2 Prioritised taxon 1 due to our research indicating high abundance taxa are most accurately predicted Extracted taxon-specific sequences from MetaPhlAn database 607 sequences for Taxon 1 Aligned reads against these sequences Chose regions of MetaPhlAn sequences with most hits Watershed Metagenomics SAB Meeting January 24, 2014

36 Case Study: Fast track approach
3. Identify sequences characteristic of taxa MetaPhlAn marker sequence Our reads aligned with quite a bit of depth to the first part of the sequence, but not the second part. This is because the marker is from Polynucleobacter necessarius subsp. asymbioticus QLW-P1DMWA-1. The second part of the sequence is different in Polynucleobacter necessarius subsp. necessarius STIR1, so this suggests that the strain present in our samples is not asymbioticus This marker is from LSU ribosomal protein L10P Consensus sequence Highest Coverage = candidate marker sequence Module 7 bioinformatics.ca Watershed Metagenomics SAB Meeting January 24, 2014

37 Case Study: Fast track approach 4
Case Study: Fast track approach 4. Design qPCR primers & probes from marker seq. Used Primer3 for primer and probe design Checked in silico relative occurrence rates of candidate primers and probes Considered matches that are exact or have 1-2 mismatches Chose sequences that minimize non-specific matches Lab: Confirmed we can amplify a product of the right size Forward primer Probe Reverse primer At site & downstream Upstream Module 7 bioinformatics.ca Watershed Metagenomics SAB Meeting January 24, 2014

38 “Fast track” Case Study Comments
ID’d a marker based on differential abundance of a bacterial species  being used to pilot our iterative validation process Benefits of this approach Fast: Sequence data to PCR primers in a couple days Doesn’t require large amounts of processing power Limitations of this approach Depends on differential abundance of known bacteria (if the differential bacteria aren’t highly similar to those in the MetaPhlAn database, this approach won’t work) Based on taxa, which have been shown to be more variable across environments than (gene) functions A “low hanging fruit” approach; a good first step Watershed Metagenomics SAB Meeting January 24, 2014

39 Case Study Alternative Approach: More complete analysis of bacterial shotgun seq.
More complete taxonomic analysis Kraken, Discribinate (See Peabody et al. 2015) Gene function analysis MEGAN4 (using SEED, KEGG databases) Cluster based analysis: Get predicted proteins Cluster Find differential clusters Design PCR Watershed Metagenomics SAB Meeting January 24, 2014

40 From Biomarker to Lab Test
Identify discriminative taxa or functions Identify informative region for primer design (CD-HIT to cluster reads by identity) Design primers (Primer-BLAST or IDT Realtime PCR Tool) Primer 1 Primer 2 Validate primers in silico (Primer Prospector, Primer-BLAST) Validate primers in vitro (qPCR)

41 Remember other markers…
Community diversity as an indicator of ecosystem health Microbiome analysis may suggest other types of screening tests… (metabolites, etc) Markers are only as good as the data they are based on, so design experiments carefully, include +ve and -ve controls

42 Final Comments Use controls, careful experimental design
Note variation in accuracy of different metagenomics analysis methods Appreciate biases, limitations re what’s in sequence databases Carefully examine data, methods and biases when comparing across different datasets (i.e. from different research groups) Careful, considerate analysis can really pay off

43

44

45 Final Comments Use controls, careful experimental design
Note variation in accuracy of different metagenomics analysis methods Appreciate biases, limitations re what’s in sequence databases Carefully examine data, methods and biases when comparing across different datasets (i.e. from different research groups) Careful, considerate analysis can really pay off

46 Questions?


Download ppt "Canadian Bioinformatics Workshops"

Similar presentations


Ads by Google