Martin Wiedmann Cornell University

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
THE EVOLUTIONARY HISTORY OF BIODIVERSITY
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Plant Molecular Systematics (Phylogenetics). Systematics classifies species based on similarity of traits and possible mechanisms of evolution, a change.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
Molecular phylogenetics
Beyond Phylogeny: Evolutionary analysis of a mosaic pathogen Dr Rosalind Harding Departments of Zoology and Statistics, Oxford University,UK.
Classification and Systematics Tracing phylogeny is one of the main goals of systematics, the study of biological diversity in an evolutionary context.
Chapter 26: Phylogeny and the Tree of Life Objectives 1.Identify how phylogenies show evolutionary relationships. 2.Phylogenies are inferred based homologies.
Probes can be designed in an evolutionary hierarchy.
3- RIBOSOMAL RNA GENE RECONSTRUCITON  Phenetics Vs. Cladistics  Homology/Homoplasy/Orthology/Paralogy  Evolution Vs. Phylogeny  The relevance of the.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
 Read Chapter 4.  All living organisms are related to each other having descended from common ancestors.  Understanding the evolutionary relationships.
Introduction to Phylogenetics
Speaker: Bin-Shenq Ho Dec. 19, 2011
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
The Whole Genome Sequencing Revolution Martin Wiedmann Gellert Family Professor of Food Safety Department of Food Science Cornell University, Ithaca, NY.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
Systematics and Phylogenetics Ch. 23.1, 23.2, 23.4, 23.5, and 23.7.
Restriction enzyme analysis The new(ish) population genetics Old view New view Allele frequency change looking forward in time; alleles either the same.
Lessons Learned and Novel Investigation Techniques in Response to a Large Community Outbreak of HIV-1 infection Philip J. Peters MD HIV Testing and Biomedical.
Bioinformatics for Clinical Microbiology and Molecular Epidemiology: From Databases to Population Genetics João André Carriço 7 July 2010 Ciência 2010.
Virginia Commonwealth University
Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments Xinjun Zhang.
2. Centers for Disease Control and Prevention (CDC), Atlanta, GA, USA
Whole Genome Sequencing for Epidemiologists – A Brief Introduction
Metagenomic Species Diversity.
Introduction to Bioinformatics Resources for DNA Barcoding
Announcements Seminar today after class! Seminar Wednesday!
Evolutionary genomics can now be applied beyond ‘model’ organisms
Gil McVean Department of Statistics
Phylogeny - based on whole genome data
Chapter 7 Microbial Genetics
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
Epidemiologist Supervisor Foodborne Diseases Unit
Pipelines for Computational Analysis (Bioinformatics)
Horizontal gene transfer and the history of life
Human Cells Human genomics
Introduction to bioinformatics lecture 11 SNP by Ms.Shumaila Azam
Future Directions Unknowns:
Multiple Alignment and Phylogenetic Trees
Genomes and Their Evolution
Methods of molecular phylogeny
Agenda 10/8 Seashell Sort Phylogeny Lecture Phylogenetics Pracice
Gene Transfer, Genetic Engineering, and Genomics
A.R. Manges  Clinical Microbiology and Infection 
Linking Genetic Variation to Important Phenotypes
Summary and Recommendations
Explore Evolution: Instrument for Analysis
Reading Phylogenetic Trees
Chapter 19 Molecular Phylogenetics
Extra chromosomal Agents Transposable elements
Molecular data assisted morphological analyses
Contact investigations for outbreaks of Mycobacterium tuberculosis: advances through whole genome sequencing  T.M. Walker, P. Monk, E. Grace Smith, T.E.A.
The Content of the Genome
Unit Genomic sequencing
Using Whole Genome Sequencing Analysis in California
The Indispensable Forensic Tool
9-3 DNA Typing with Tandem Repeats
Evolutionary History of the ADRB2 Gene in Humans
Summary and Recommendations
Division of Tuberculosis Elimination
Perspectives from Los Angeles County Tuberculosis Control Program
Francois Balloux, Ola Brønstad Brynildsrud, Lucy van Dorp, Liam P
Presentation transcript:

Martin Wiedmann Cornell University E-mail: mw16@cornell.edu hqSNP, wgMLST and the WGS alphabet soup: what epidemiologists need to know Martin Wiedmann Cornell University E-mail: mw16@cornell.edu

Outline Review of genomes, genes, and evolution Use of sequence data to assess relatedness of organisms Data analysis approaches wgMLST and hqSNP Trees and how to interpret them

What is a SNP? Single Nucleotide Polymorphism (SNP) ATGTTCCTC sequence ATGTTGCTC reference *phylogenetically informative differences Insertion or Deletion (Indel) ATGTTCCCTC sequence ATGTTC-CTC reference *differences not used in hqSNP analysis

Microbial evolution 101 – mechanisms of change Point mutations ACCCTCTAGTAGTAGCA ACCATCTAGTAGTAGCA ACCCTCTAGTAGTAGCA 1 SNP and one “genetic event”

Microbial evolution 101 – mechanisms of change Insertion or deletion ACCCTCTAGTAGTAGCA ACCATCTAG . . . TAGCA ACCCTCTAGTAGTAGTAGCA 3 differences (?) and one “genetic event”

Microbial evolution 101 – mechanisms of change Inversion ACCCTCTAGTAGTAGCA ACCATCTCGTAGTAGCA ACCCTCTAGTAGTAGCA Alignment: ACCATCTCGTAGTAGCA ACCCTCTAGTAGTAGCA 2 SNPs and one “genetic event”

Microbial evolution 101 – mechanisms of change Horizontal gene transfer of homologous gene sequences ACCCTCTAGTACTAGCATCC TCCCTCTTGTCCTACCATCA CTTGTCCTACCA CTTGTCCTACCA Alignment: ACCCTCTAGTACTAGCATCC ACCCTCTTGTCCTACCATCC 3 SNPs and 1 genetic event ACCCTCTAGTACTAGCATCC ACCCTCTTGTCCTACCATCC

Microbial evolution 101 – mechanisms of change Transformation Transduction

Case study – why does it matter Human listeriosis outbreak in 2000 with 29 cases Isolates show 1 SNP differences to food and human isolate from a single case linked to processing facility X in 1988 Epidemiology support that this facility was the source of the outbreak Some analyses approaches that did not account for recombination would have shown that human isolates from 2000 show approx. 3,000 SNP differences to 1988 food isolate from facility X Why: Large recombination event that introduces a large prophage (viruses inserted into the bacterial genome)

Outline Review of genomes, genes, and evolution Use of sequence data to assess relatedness of organisms Data analysis approaches kSNP, wgMLST, and hqSNP Trees and how to interpret them

Use of sequence data to assess relatedness of organisms Differences in sequences can be used to assess relatedness of organisms and the likelihood of recent common ancestor “Do the M. tuberculosis isolates from patient A and patient B share recent common ancestor” Definition of “recent” becomes important – recent in years or generation times Salmonella in a dry processing plant may stay dormant and rarely if ever multiply (or imagine anthrax spores in soil) Salmonella in a chicken flock may multiply every 30 min (>7,500 times a year) Assessing relationships of microbial isolates typically requires more information than just sequence data Information on epidemiological relationships and other relevant data is essential

Outline Review of genomes, genes, and evolution Use of sequence data to assess relatedness of organisms Data analysis approaches kSNP, wgMLST, hqSNP, and others Trees and how to interpret them

Basics of WGS Analyses Different ways to compare the genomes of 2 different isolates Compare the genome small piece-by-small piece to find pieces that are different Kmer based analyses Use a high quality (reference) sequence or genome to identify differences hqSNP analysis Compare genomes on a gene-by-gene (locus-by-locus) basis wgMLST analysis All these analysis can provide an output that provides the “number of differences” or can be sued to build trees

Basics of WGS Analyses Different ways to compare the genomes of 2 different isolates Compare the genome small piece-by-small piece to find pieces that are different Kmer based analyses Use a high quality (reference) sequence or genome to identify differences hqSNP analysis Compare genomes on a gene-by-gene (locus-by-locus) basis wgMLST analysis All these analysis can provide an output that provides the “number of differences” or can be sued to build trees

What makes a SNP high quality (hq)? Quality filtered Sequence Reads ready for analysis Sequence reads Sequence Reads Apply a quality filter that filters out nucleotides in sequence reads for comparison based on sequence coverage and quality

The alphabet soup of analysis – Coverage Coverage at 40x Coverage at 5x http://missusrousselee.deviantart.com/art/Alphabet-Soup-134724659 NGS generates 100,000 or more reads per one genome sequenced Any single location on the genome can have zero to hundreds of sequence reads that cover the one region

What to call a SNP SNPs called based on: ATGTTACTC ATGTTCCTC ATGTTTCTC ATGTTCCTC ATGTTCCTC ATGTTGCTC ATGTTGCTC reference Is it a SNP? SNPs called based on: Quality Coverage Base frequency The differences between the reference and compared genome are extracted and used to determine relatedness

-do no consider SNPs in this location Where to call a SNP? Not all SNP pipelines are equal – where you call SNPs will affect the total SNP count SNPs relevant for phylogenetic analysis are vertically transmitted, not horizontally, so horizontal genetic elements like phages can be masked Mobile elements genes Raw reads Mask mobile elements -do no consider SNPs in this location Only call SNPs in genes

How to report SNP data – keep it simple Hi folks: New Cluster: 2016039   Two isolates are 0 SNPs from each other: E2017003216 (SE77B52) E2017003039 (SE77B52) New Cluster: 2016040 Two isolates are 2 SNPs from each other: E2017002910 (SE1B1) I2017003132 (SE1B1)

MDH00841 MDH00849

Caveats of hqSNP analyses Advantages Disadvantages When to Use Phylogenetically informative (build a tree consistent with evolution of the strains) Requires a closely related reference genome – hqSNP analysis is problematic if reference genome is not closely related Good for situations where a wgMLST database has not been developed and validated. May provide highest amount of resolution for strain comparison SNP position can be identified on genome (gene affected can be identified) Takes a while and requires a lot of computer power Interpretation of data depends on genomes added – is not stable and does not lead to nomenclature

Basics of WGS Analyses Different ways to compare the genomes of 2 different isolates Compare the genome small piece-by-small piece to find pieces that are different Kmer based analyses Use a high quality (reference) sequence or genome to identify differences hqSNP analysis Compare genomes on a gene-by-gene (locus-by- locus) basis wgMLST analysis All these analysis can provide an output that provides the “number of differences” or can be used to build trees

Traditional MLST ~6-12 housekeeping genes; usually portion of gene MLST.NET ~6-12 housekeeping genes; usually portion of gene Developed in the area of Sanger sequencing, providing for improve discrimination over sequencing 1 gene Targets selected to represent population structure, not as useful for outbreak detection Schemes are available on international publically accessible databases combination of 6-12 genes used to name a unique sequence type (i.e. MLST profile 1-1-1-1-1-1-1 = ST1)

Whole genome multilocus sequence typing (MLST) Database is built from gene content representing a diverse selection of the genus/species of the organism being compared Each unique gene is referred to as a “locus” – a locus may include the entire gene or a piece of the gene Any changes – SNP, insertions, deletions – equals a new allele call for a locus New alleles are named sequentially when encountered- not based on sequence 2 SNPs 1 indel Locus 1 ACTAGAGGGAAA ACTAGAGGCTAA ACT-GAGGGAAA allele 1 allele 2 allele 3

Whole genome multilocus sequence typing (MLST) Allows for simpler analysis and clear naming of subtypes Performs comparison on a gene by gene level Isolate A Isolate B Isolate C Locus 1 (20 nt) 1 Locus 2 (100nt) 8 12 Locus 3 (5000nt) 5 2 Etc. Locus 2,005 (5nt) 4 wgMLST type A B

The alphabet soup of analysis - wgMLST The allele calls at each locus are compared between isolates and differences are used to determine relatedness

“Allele Code” Pattern Naming in the Listeria Database Pilot thresholds 10% = 300 alleles 5% = 150 alleles 2.5% = 75 alleles 1% = 30 alleles 0.5% = 15 alleles 0.25% = 7 alleles Two isolate are the same: Patient 1: 4.1.1.5.2 Patient 2: 4.1.1.5.2

The wgMLST “zip code” Two isolate are the same: Patient 1: 1.4.1.1.5.2 Patient 2: 1.4.1.1.5.2 Three isolates; patient 3 differs by 1 to 7 alleles from 1 and 2 Patient 3: 1.4.1.1.5.4 Four isolates; patient 4 differs by 8 to 15 alleles from the others: Patient 4: 1.4.1.1.7.1

How to report wgMLST data – keep it simple Hi folks: New Cluster: 2016039   Two isolates are 0 alleles from each other: E2017003216 (SE77B52) E2017003039 (SE77B52) New Cluster: 2016040 Two isolates are 2 alleles from each other: E2017002910 (SE1B1) I2017003132 (SE1B1)

How to report wgMLST data – give me the ZIP codes Looks like we may have a cluster Patient 1: 1.4.1.1.5.2 Patient 2: 1.4.1.1.5.2 Patient 3: 1.4.1.1.5.4 Patient 4: 1.4.1.1.7.1 Patient 4: 1.4.3.3.1.1

MLST Analysis Faster than analyzing SNP differences For WGS data, allele calls can be performed on short reads (“assembly free”) and assembled genomes (“assembly-based”) If there is a conflict between the allele calls then no allele call is made

Advantages and Caveats of wgMLST analysis Disadvantages When to Use Phylogenetically informative Initial assignment of alleles is computationally costly (doing assemblies before calling alleles); CDCs system will call alleles directly from raw reads (~ 2 min); assemblies take about 2 h or perhaps longer; if there is a conflict between the allele calls then no allele call is made Surveillance, especially for a distributed testing network All virulence, serotyping, and antibiotic resistance genes can be pulled out as part of analysis Comparing character data (allele numbers) rather than genetic data Reference characterization Neutralizes the effects of horizontal gene transfer (event is only counted once rather than many times for hqSNPs) SNPs and indels treated equally Accurate cluster detection Allele calling is stable – data standardizable; directly comparable between laboratories; can lead to nomenclature based on allele calls, which can be used for communication and automated cluster detection; reproducibility not dependent on choice of reference strain; amenable to automated bioinformatics Requires curation for allele calls Need to communicate with partners using stable nomenclature

hqSNP versus MLST Analysis Both analyses conducted from the same raw data (typically short read sequencing data) For public health purposes, both correlate well i.e the outermost branches of phylogenetic trees are almost identical The two are not mutually exclusive For some use cases MLST works better, others SNP works better

Interpreting analysis data – how to build trees using WGS analysis Use WGS analysis to infer relatedness of isolates For wgMLST: translate the number allele difference between isolates to a measure of similarity and use that to infer branch lengths and relatedness For hqSNP analysis – translate nucleotide differences between isolates to relatedness Can use substitution models to estimate the cost of changing from A>T, C>A, etc. Thymine Cytosine adenine guanine

How to report SNP data - trees 1 2 1 ATATTCCGCAA 2 ATATTCCGCAA 3 ATATTGCGCAA 4 ACCTTGCGCTA 3 4 3 2 1

Building the tree Isolate Sequence A ggagagtta B ggatccccc C ggattatta D actgccggt ancestor actgaatta 6 Isolate B 1 ggatccccc ggataatta 1 3 Isolate C ggattatta ggagaatta 1 Isolate A actgaatta ggagagtta 5 Isolate D actgccggt genetic change Use the differences you identified by hqSNP or wgMLST to infer the relatedness or phylogeny

Reading the trees Node Most recent common ancestor (for isolate B and C) 6 Leaf Taxa Isolate B 1 1 3 Isolate C Clade 1 Ancestral node Terminal node Isolate A Outgroup/Root – related isolate (same PFGE pattern or 7-gene MLST) but not part of outbreak 5 Isolate D genetic change

Trees, branches, and leaves – more than one way to draw a tree Many different ways to display trees Branches that connect to the terminal node are the important branch lengths to indicate relatedness

Trees, branches and leaves – reading the trees Difference between similarity and relatedness on the tree Isolate A and C are more similar to each other than C and B are Isolate C and B are more related to each other than C and A are 6 Isolate B 1 ggatccccc ggataatta 1 3 Isolate C ggattatta ggagaatta 1 Isolate A actgaatta ggagagtta 5 Isolate D actgccggt genetic change

Trees, branches and leaves – what does it mean for my outbreak investigation Epidemiologic data provides context to the tree – cannot rely on phylogenetic tree to identify outbreak source 5 spinach 1 ggatccccc ggataatta 1 3 Stool ggattatta ggagaatta 1 kale actgaatta ggagagtta 5 stool actgccggt genetic change

wgMLST–based phylogenetic Tree Crave Brothers Minimum spanning tree (MST) Unrooted Depicts genomes in a network and branch lengths show relatedness of isolates (number of allele differences) New subgroup kale

0-2 SNPs 0-1 SNPs 0 SNPs 1SNP 0-3 SNPs MDH00219 MDH00225- In-vivo, same as E2001001070 MDH00215 -Sporadic 4/19/01 MDH00247 --Sporadic 8/6/12 MDH00204 - Sporadic 5/14/01 MDH00221- Sporadic 5/14/01 MDH00203 - Sporadic 7/11/00 MDH00214 - Sporadic 3/12/01 MDH00206 - Sporadic 8/23/00 MDH00217 - Sporadic 6/10/13 MDH00237 Sporadic 6/22/11 MDH00236 - Sporadic 5/7/11 MDH00207 - Sporadic 8/31/2000 MDH00233 - Sporadic 12/7/2001 MDH00248 - Sporadic 6/10/13 MDH00205 - Sporadic 8/22/2000 MDH00216 - Sporadic 4/30/2001 MDH00224 -Sporadic 6/11/2001 MDH00254 MDH00252 MDH00253 MDH00234 MDH00226 - Sporadic 6/21/2001 MDH00231 - Sporadic 7/16/2001 MDH00202 - Sporadic 7/7/2000 MDH00208 - Sporadic- Same time, PFGE, and MLVA as Outbreak 1 MDH00209 MDH00210 MDH00211 MDH00222- In-vivo, same as E2001001070 MDH00228- In-vivo, same as E2001001070 MDH00223 MDH00220 MDH00218 MDH00213- Sporadic- Same PFGE and time as Outbreak 1 MDH00232- Sporadic 10/17/01 MDH00227 MDH00230 MDH00251 MDH00229 MDH00235- Sporadic 10/3/05 MDH00243- Sporadic, same PFGE and time as Outbreak 5 MDH00245- Sporadic 6/26/12 MDH00249 MDH00250 MDH00246-Sporadic 7/30/12 MDH00255- OH Sample 1 MDH00256- OH Sample 2 MDH00241- Sporadic, same PFGE and time as Outbreak 5 MDH00239 MDH00242 MDH00244- Environmental sample from Outbreak 5 MDH00238 MDH00240 Defined Outbreak Samples Outbreak 1- Sept 2000 Outbreak 2- May 2001 Outbreak 3- Aug 2001 Outbreak 4- Nov 2003 Outbreak 5- Aug 2008 Outbreak 6- Spring 2014 Outbreak 7- Spring 2014 0-2 SNPs 1SNP 0 SNPs 0-1 SNPs 0-3 SNPs Taylor et al. J Clin Micro Oct 2015.

Take Home Messages Molecular epidemiology requires collaborations between epidemiologists and the lab Microbial isolates can accumulate genetic differences through a variety of mechanisms (e.g., horizontal gene transfer) The approach data analyses use to deal or not deal with these different evolutionary mechanisms can play an important role hqSNP and wgMLST both address and account for horizontal gene transfer, but in different ways Different organisms differ in their lifestyles and mechanisms of evolution Need to know your epi and your bugs

Acknowledgments Centers for Disease Control and Prevention Heather Carleton Greg Armstrong Peter Gerner-Smidt John Besser Integrated Food Safety Centers of Excellence

Questions