Martin Wiedmann Cornell University

Martin Wiedmann Cornell University E-mail: mw16@cornell.edu
hqSNP, wgMLST and the WGS alphabet soup: what epidemiologists need to know Martin Wiedmann Cornell University

Outline Review of genomes, genes, and evolution
Use of sequence data to assess relatedness of organisms Data analysis approaches wgMLST and hqSNP Trees and how to interpret them

What is a SNP? Single Nucleotide Polymorphism (SNP)
ATGTTCCTC sequence ATGTTGCTC reference *phylogenetically informative differences Insertion or Deletion (Indel) ATGTTCCCTC sequence ATGTTC-CTC reference *differences not used in hqSNP analysis

Microbial evolution 101 – mechanisms of change
Point mutations ACCCTCTAGTAGTAGCA ACCATCTAGTAGTAGCA ACCCTCTAGTAGTAGCA 1 SNP and one “genetic event”

Insertion or deletion ACCCTCTAGTAGTAGCA ACCATCTAG TAGCA ACCCTCTAGTAGTAGTAGCA 3 differences (?) and one “genetic event”

Inversion ACCCTCTAGTAGTAGCA ACCATCTCGTAGTAGCA ACCCTCTAGTAGTAGCA Alignment: ACCATCTCGTAGTAGCA ACCCTCTAGTAGTAGCA 2 SNPs and one “genetic event”

Horizontal gene transfer of homologous gene sequences ACCCTCTAGTACTAGCATCC TCCCTCTTGTCCTACCATCA CTTGTCCTACCA CTTGTCCTACCA Alignment: ACCCTCTAGTACTAGCATCC ACCCTCTTGTCCTACCATCC 3 SNPs and 1 genetic event ACCCTCTAGTACTAGCATCC ACCCTCTTGTCCTACCATCC

Transformation Transduction

Case study – why does it matter
Human listeriosis outbreak in 2000 with 29 cases Isolates show 1 SNP differences to food and human isolate from a single case linked to processing facility X in 1988 Epidemiology support that this facility was the source of the outbreak Some analyses approaches that did not account for recombination would have shown that human isolates from show approx. 3,000 SNP differences to 1988 food isolate from facility X Why: Large recombination event that introduces a large prophage (viruses inserted into the bacterial genome)

Use of sequence data to assess relatedness of organisms Data analysis approaches kSNP, wgMLST, and hqSNP Trees and how to interpret them

Use of sequence data to assess relatedness of organisms
Differences in sequences can be used to assess relatedness of organisms and the likelihood of recent common ancestor “Do the M. tuberculosis isolates from patient A and patient B share recent common ancestor” Definition of “recent” becomes important – recent in years or generation times Salmonella in a dry processing plant may stay dormant and rarely if ever multiply (or imagine anthrax spores in soil) Salmonella in a chicken flock may multiply every 30 min (>7,500 times a year) Assessing relationships of microbial isolates typically requires more information than just sequence data Information on epidemiological relationships and other relevant data is essential

Use of sequence data to assess relatedness of organisms Data analysis approaches kSNP, wgMLST, hqSNP, and others Trees and how to interpret them

Basics of WGS Analyses Different ways to compare the genomes of 2 different isolates Compare the genome small piece-by-small piece to find pieces that are different Kmer based analyses Use a high quality (reference) sequence or genome to identify differences hqSNP analysis Compare genomes on a gene-by-gene (locus-by-locus) basis wgMLST analysis All these analysis can provide an output that provides the “number of differences” or can be sued to build trees

What makes a SNP high quality (hq)?
Quality filtered Sequence Reads ready for analysis Sequence reads Sequence Reads Apply a quality filter that filters out nucleotides in sequence reads for comparison based on sequence coverage and quality

The alphabet soup of analysis – Coverage
Coverage at 40x Coverage at 5x NGS generates 100,000 or more reads per one genome sequenced Any single location on the genome can have zero to hundreds of sequence reads that cover the one region

What to call a SNP SNPs called based on:
ATGTTACTC ATGTTCCTC ATGTTTCTC ATGTTCCTC ATGTTCCTC ATGTTGCTC ATGTTGCTC reference Is it a SNP? SNPs called based on: Quality Coverage Base frequency The differences between the reference and compared genome are extracted and used to determine relatedness

-do no consider SNPs in this location
Where to call a SNP? Not all SNP pipelines are equal – where you call SNPs will affect the total SNP count SNPs relevant for phylogenetic analysis are vertically transmitted, not horizontally, so horizontal genetic elements like phages can be masked Mobile elements genes Raw reads Mask mobile elements -do no consider SNPs in this location Only call SNPs in genes

How to report SNP data – keep it simple
Hi folks: New Cluster: Two isolates are 0 SNPs from each other: E (SE77B52) E (SE77B52) New Cluster: Two isolates are 2 SNPs from each other: E (SE1B1) I (SE1B1)

MDH00841 MDH00849

Caveats of hqSNP analyses
Advantages Disadvantages When to Use Phylogenetically informative (build a tree consistent with evolution of the strains) Requires a closely related reference genome – hqSNP analysis is problematic if reference genome is not closely related Good for situations where a wgMLST database has not been developed and validated. May provide highest amount of resolution for strain comparison SNP position can be identified on genome (gene affected can be identified) Takes a while and requires a lot of computer power Interpretation of data depends on genomes added – is not stable and does not lead to nomenclature

Basics of WGS Analyses Different ways to compare the genomes of 2 different isolates Compare the genome small piece-by-small piece to find pieces that are different Kmer based analyses Use a high quality (reference) sequence or genome to identify differences hqSNP analysis Compare genomes on a gene-by-gene (locus-by- locus) basis wgMLST analysis All these analysis can provide an output that provides the “number of differences” or can be used to build trees

Traditional MLST ~6-12 housekeeping genes; usually portion of gene
MLST.NET ~6-12 housekeeping genes; usually portion of gene Developed in the area of Sanger sequencing, providing for improve discrimination over sequencing 1 gene Targets selected to represent population structure, not as useful for outbreak detection Schemes are available on international publically accessible databases combination of 6-12 genes used to name a unique sequence type (i.e. MLST profile = ST1)

Whole genome multilocus sequence typing (MLST)
Database is built from gene content representing a diverse selection of the genus/species of the organism being compared Each unique gene is referred to as a “locus” – a locus may include the entire gene or a piece of the gene Any changes – SNP, insertions, deletions – equals a new allele call for a locus New alleles are named sequentially when encountered- not based on sequence 2 SNPs 1 indel Locus 1 ACTAGAGGGAAA ACTAGAGGCTAA ACT-GAGGGAAA allele allele 2 allele 3

Whole genome multilocus sequence typing (MLST)
Allows for simpler analysis and clear naming of subtypes Performs comparison on a gene by gene level Isolate A Isolate B Isolate C Locus 1 (20 nt) 1 Locus 2 (100nt) 8 12 Locus 3 (5000nt) 5 2 Etc. Locus 2,005 (5nt) 4 wgMLST type A B

The alphabet soup of analysis - wgMLST
The allele calls at each locus are compared between isolates and differences are used to determine relatedness

“Allele Code” Pattern Naming in the Listeria Database
Pilot thresholds 10% = 300 alleles 5% = 150 alleles 2.5% = 75 alleles 1% = 30 alleles 0.5% = 15 alleles 0.25% = 7 alleles Two isolate are the same: Patient 1: Patient 2:

The wgMLST “zip code” Two isolate are the same:
Patient 1: Patient 2: Three isolates; patient 3 differs by 1 to 7 alleles from 1 and 2 Patient 3: Four isolates; patient 4 differs by 8 to 15 alleles from the others: Patient 4:

How to report wgMLST data – keep it simple
Hi folks: New Cluster: Two isolates are 0 alleles from each other: E (SE77B52) E (SE77B52) New Cluster: Two isolates are 2 alleles from each other: E (SE1B1) I (SE1B1)

How to report wgMLST data – give me the ZIP codes
Looks like we may have a cluster Patient 1: Patient 2: Patient 3: Patient 4: Patient 4:

MLST Analysis Faster than analyzing SNP differences
For WGS data, allele calls can be performed on short reads (“assembly free”) and assembled genomes (“assembly-based”) If there is a conflict between the allele calls then no allele call is made

Advantages and Caveats of wgMLST analysis
Disadvantages When to Use Phylogenetically informative Initial assignment of alleles is computationally costly (doing assemblies before calling alleles); CDCs system will call alleles directly from raw reads (~ 2 min); assemblies take about 2 h or perhaps longer; if there is a conflict between the allele calls then no allele call is made Surveillance, especially for a distributed testing network All virulence, serotyping, and antibiotic resistance genes can be pulled out as part of analysis Comparing character data (allele numbers) rather than genetic data Reference characterization Neutralizes the effects of horizontal gene transfer (event is only counted once rather than many times for hqSNPs) SNPs and indels treated equally Accurate cluster detection Allele calling is stable – data standardizable; directly comparable between laboratories; can lead to nomenclature based on allele calls, which can be used for communication and automated cluster detection; reproducibility not dependent on choice of reference strain; amenable to automated bioinformatics Requires curation for allele calls Need to communicate with partners using stable nomenclature

hqSNP versus MLST Analysis
Both analyses conducted from the same raw data (typically short read sequencing data) For public health purposes, both correlate well i.e the outermost branches of phylogenetic trees are almost identical The two are not mutually exclusive For some use cases MLST works better, others SNP works better

Interpreting analysis data – how to build trees using WGS analysis
Use WGS analysis to infer relatedness of isolates For wgMLST: translate the number allele difference between isolates to a measure of similarity and use that to infer branch lengths and relatedness For hqSNP analysis – translate nucleotide differences between isolates to relatedness Can use substitution models to estimate the cost of changing from A>T, C>A, etc. Thymine Cytosine adenine guanine

How to report SNP data - trees
1 2 1 ATATTCCGCAA 2 ATATTCCGCAA 3 ATATTGCGCAA 4 ACCTTGCGCTA 3 4 3 2 1

Building the tree Isolate Sequence A ggagagtta B ggatccccc C ggattatta D actgccggt ancestor actgaatta 6 Isolate B 1 ggatccccc ggataatta 1 3 Isolate C ggattatta ggagaatta 1 Isolate A actgaatta ggagagtta 5 Isolate D actgccggt genetic change Use the differences you identified by hqSNP or wgMLST to infer the relatedness or phylogeny

Reading the trees Node Most recent common ancestor (for isolate B and C) 6 Leaf Taxa Isolate B 1 1 3 Isolate C Clade 1 Ancestral node Terminal node Isolate A Outgroup/Root – related isolate (same PFGE pattern or 7-gene MLST) but not part of outbreak 5 Isolate D genetic change

Trees, branches, and leaves – more than one way to draw a tree
Many different ways to display trees Branches that connect to the terminal node are the important branch lengths to indicate relatedness

Trees, branches and leaves – reading the trees
Difference between similarity and relatedness on the tree Isolate A and C are more similar to each other than C and B are Isolate C and B are more related to each other than C and A are 6 Isolate B 1 ggatccccc ggataatta 1 3 Isolate C ggattatta ggagaatta 1 Isolate A actgaatta ggagagtta 5 Isolate D actgccggt genetic change

Trees, branches and leaves – what does it mean for my outbreak investigation
Epidemiologic data provides context to the tree – cannot rely on phylogenetic tree to identify outbreak source 5 spinach 1 ggatccccc ggataatta 1 3 Stool ggattatta ggagaatta 1 kale actgaatta ggagagtta 5 stool actgccggt genetic change

wgMLST–based phylogenetic Tree
Crave Brothers Minimum spanning tree (MST) Unrooted Depicts genomes in a network and branch lengths show relatedness of isolates (number of allele differences) New subgroup kale

0-2 SNPs 0-1 SNPs 0 SNPs 1SNP 0-3 SNPs
MDH00219 MDH In-vivo, same as E MDH Sporadic 4/19/01 MDH Sporadic 8/6/12 MDH Sporadic 5/14/01 MDH Sporadic 5/14/01 MDH Sporadic 7/11/00 MDH Sporadic 3/12/01 MDH Sporadic 8/23/00 MDH Sporadic 6/10/13 MDH00237 Sporadic 6/22/11 MDH Sporadic 5/7/11 MDH Sporadic 8/31/2000 MDH Sporadic 12/7/2001 MDH Sporadic 6/10/13 MDH Sporadic 8/22/2000 MDH Sporadic 4/30/2001 MDH Sporadic 6/11/2001 MDH00254 MDH00252 MDH00253 MDH00234 MDH Sporadic 6/21/2001 MDH Sporadic 7/16/2001 MDH Sporadic 7/7/2000 MDH Sporadic- Same time, PFGE, and MLVA as Outbreak 1 MDH00209 MDH00210 MDH00211 MDH In-vivo, same as E MDH In-vivo, same as E MDH00223 MDH00220 MDH00218 MDH Sporadic- Same PFGE and time as Outbreak 1 MDH Sporadic 10/17/01 MDH00227 MDH00230 MDH00251 MDH00229 MDH Sporadic 10/3/05 MDH Sporadic, same PFGE and time as Outbreak 5 MDH Sporadic 6/26/12 MDH00249 MDH00250 MDH00246-Sporadic 7/30/12 MDH OH Sample 1 MDH OH Sample 2 MDH Sporadic, same PFGE and time as Outbreak 5 MDH00239 MDH00242 MDH Environmental sample from Outbreak 5 MDH00238 MDH00240 Defined Outbreak Samples Outbreak 1- Sept 2000 Outbreak 2- May 2001 Outbreak 3- Aug 2001 Outbreak 4- Nov 2003 Outbreak 5- Aug 2008 Outbreak 6- Spring 2014 Outbreak 7- Spring 2014 0-2 SNPs 1SNP 0 SNPs 0-1 SNPs 0-3 SNPs Taylor et al. J Clin Micro Oct 2015.

Take Home Messages Molecular epidemiology requires collaborations between epidemiologists and the lab Microbial isolates can accumulate genetic differences through a variety of mechanisms (e.g., horizontal gene transfer) The approach data analyses use to deal or not deal with these different evolutionary mechanisms can play an important role hqSNP and wgMLST both address and account for horizontal gene transfer, but in different ways Different organisms differ in their lifestyles and mechanisms of evolution Need to know your epi and your bugs

Acknowledgments Centers for Disease Control and Prevention
Heather Carleton Greg Armstrong Peter Gerner-Smidt John Besser Integrated Food Safety Centers of Excellence

Questions

Martin Wiedmann Cornell University

Similar presentations

Presentation on theme: "Martin Wiedmann Cornell University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Martin Wiedmann Cornell University

Similar presentations

Presentation on theme: "Martin Wiedmann Cornell University"— Presentation transcript:

Similar presentations

About project

Feedback