Introduction to the CGE servers

Slides:



Advertisements
Similar presentations
Lateral Transfer. Donating Genes Mutation often disrupts the function of a gene Gene transfer is a way to give new functions to the recipient cell Thus,
Advertisements

Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization I Sharon, MJ Morowitz,
An Introduction to “Bioinformatics to Predict Bacterial Phenotypes” Jerry H. Kavouras, Ph.D. Lewis University Romeoville, IL.
Course on Introduction to microbial whole genome sequencing and analysis Mette Voldby Larsen DTU – Center for Biological Sequence Analysis (CBS) Henrik.
Benefit to Society Good Science. Human genes claimed in granted U.S. patents Jensen and Murray, Science 310: (14 Oct. 2005) “Specifically, this.
Bacteria.
Metabarcoding 16S RNA targeted sequencing
Strain-, species-, and genus-specific core unique proteins from selected organisms CUPID: Core and Unique Protein IDentification Raja Mazumder and Darren.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Pathogenomics: Focusing studies of bacterial pathogenicity through evolutionary analysis of genomes.
9 Genomics and Beyond Brief Chapter Outline
BioFire (FilmArray) Multiplex PCR Assays
High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.eduwww.theseed.org.
Workshop in Bioinformatics 2010 Class # Class 8 March 2010.
Phylogeny - based on whole genome data
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
The Microbiome and Metagenomics
Metagenomics Binning and Machine Learning
BIO 411 – Medical Microbiology Chapter 9 Commensal and Pathogenic Microbial Flora.
Genetic transfer and recombination
Sequencing capacitiesacademic company based microarray facilitiesacademic company based bioinformaticsacademic proteomic facilitiesacedemic Genome Research.
Prokaryotic Cell “before” Nucleus (has no nucleus) No membrane bound organelles 3.5 billion years Unicellular Circular DNA Contain a cell wall Eukaryotic.
Methods Revised Abstract Methods Results TP-271 is a Potent, Broad-Spectrum Fluorocycline with Activity Against Community-Acquired Bacterial Respiratory.
Vol 6 | June 2008 Presenter: Constantin Bode extra information in the notepad!
Mouse Genome Sequencing
Molecular Microbial Ecology
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
IB Bacterial Genomics - Jan Fred Sanger sequenced the first complete genomes, e.g. the 5kbp genome of the phiX174 phage in 1978, the 16kb.
Commercial Production of Antibiotics
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
+ Clonal expansion of ciprofloxacin-resistant Campylobacter jejuni Jasna Kova č, Alison J. Cody, James E. Bray, Kate E. Dingle, Sonja Smole Mo ž ina, Martin.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments Isaam Saeed & Saman K Halgamuge MERIT,
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
Small Talk Cell-to-Cell Communication in Bacteria.
Kingdom Monera Archaebacteria Methanogens Swamps, Intestines Thermophiles Hydrothermal Vents Halophiles Salt Lake, Utah Eubacteria (peptidoglycan) Autotrophs.
“It is less clear, however, whether our species demarcations provide this information for the vast majority of prokaryotes that are never going to cause.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
SPECIES AT THE GENOMIC LEVEL. DDH has been the gold standard  the “sex” for higher eukaryotes Stackebrandt et al., 2002, Int J Syst Evol Microbiol. 52:
The V. fischeri Autoinducer N-(b-ketocaproyl)-L-homoserine lactone.
Speaker: Bin-Shenq Ho Dec. 19, 2011
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Kingdom Monera Archaebacteria Methanogens Swamps, Intestines Thermophiles Hydrothermal Vents Halophiles Salt Lake, Utah Eubacteria (peptidoglycan) Autotrophs.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Introductory medical bacteriology Chien-Ming Li MD, Ph.D.
Accurate estimation of microbial communities using 16S tags
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Metagenomic dataset preprocessing – data reduction
Introduction to Microbiology & Handwashing
MICROBIOLOGIA GENERALE Prokaryotic genomes. The prokaryotic genome.
What is sequencing? Video: WlxM (Illumina video) WlxM.
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
MICROBIOLOGIA GENERALE Prokaryotic genomes. The Escherichia coli nucleoid.
Bergey's Phylogenetic In 1923 David Bergey published Bergey's Manual of Determinative It arranged bacteria in 10 orders.
Whole Genome Sequencing for Epidemiologists – A Brief Introduction
Metagenomic Species Diversity.
Phylogeny - based on whole genome data
Workshop on the analysis of microbial sequence data using ARB
Global, real-time microbiological genomic identification and control project Frank M. Aarestrup Center for genomic Epidemiology
H = -Σpi log2 pi.
Genomics of medical importance
Use of a multiplex PCR-based reverse line blot (mPCR/RLB) hybridisation assay for the rapid identification of bacterial pathogens  Y. Wang, F. Kong, G.L.
Volume 108, Issue 5, Pages (March 2002)
Microbial gene identification using interpolated Markov models
Presentation transcript:

Introduction to the CGE servers

Center for Genomic Epidemiology Aim: To provide the scientific foundation for future internet-based solutions, where a central database will enable simplification of total genome sequence information and comparison to all other sequenced isolates including spatial-temporal analysis. To develop algorithms for rapid analyses of whole genome DNA-sequences, tools for analyses and extraction of information from the sequence data and internet/web-interfaces for using the tools in the global scientific and medical community.   

Tools for species identification Name of Service Description URL (cge.cbs.dtu.dk/services/) Status Publication SpeciesFinder Species identification using 16S rRNA Online Published Feb 2014 PMID: 24574292 KmerFinder Species identification using overlapping 16mers Published Jan 2014 PMID: 24172157 TaxonomyFinder Taxonomy identification using functional protein domains Published in PMID: 24574292 + Oksana's PhD thesis Reads2Type Species identification on client computer

Benchmarking of Methods for Bacterial Species Identification PMID: 24574292

Training data Evaluation data 1,647 completed / almost completed genomes downloaded from NCBI in 2011 (1,009 different species) Evaluation data NCBI draft genomes 695 isolates from species that overlap with training set (151 species) SRA draft genomes 10,407 sets of short reads from Illumina (168 species) 10,407 draft genomes from Illumina data (168 species)

16S rRNA 16S rRNA sequencing has dominated molecular taxonomy of prokaryotes for more than 30 years (Fox et al, Int. J. Syst. Bacteriol., 1977) Tremendous amounts of 16S rRNA sequence data are available in databases Concerns: Low resolution Some genomes contain several copies of the 16S rRNA gene with inter-gene variation The 16S rRNA gene represents only about 0.1% of the coding part of a microbial genome

CGE implementation of 16S species identification SpeciesFinder Reference database 16S rRNA genes are isolated from genomes in training data using RNAmmer (Lagesen, NAR, 2007). Method Input genomes are BLASTed against 16S rRNA genes in reference database. Best hit is selected based on a combination of coverage, % identity, bitscore, number of mistmatches and number of gaps in the alignments. BLAST will not work isolating the 16S RNA gene. RNAmmer is based on a hidden markov model BLAST hits is based on a combination of coverage length, identity and bit score

KmerFinder Genomes in training data is chopped into 16mers: 9mer A T G A C G T A T G A T T G A T G A C G T A G T A G T C C 9mer Immune system inspired downsampling Only 16mers with specific prefix are kept The human allele HLA-A:02:01 prefers leucine on position 2 and valine on position 9. If all amino acids were equally frequent, restriction by this motif would make it bind 1 out of 400 peptides. The fixation of the 2 anchor positions of a 9mer peptide by MHC still makes the peptide as selective as any other 9mer peptide but only a fraction of the 9mer peptides have the right anchors. As a result of this binding specificity, the immune system does not recognize the entire proteome of a microbe but only subset of it. The “microbe database” that is actually remembered by the immune system can be ~200 times smaller than it would otherwise be. This may be important, since there are only ~10^12 cells in the immune system and without a reduction in the microbe database, the number of cells may be insufficient to save information about all the microbes we encounter throughout our lives. ATGA is the prefix used in this example. A database is generated where each 16mer is a key and the value is a list of all the isolates in the trainingset containing this 16mer MHC-I

CP001921 (Acinetobacter baumanii) CP000521 (Acinetobacter baumanii) 16mer database CP001921 (Acinetobacter baumanii) CP000521 (Acinetobacter baumanii) CP002522 (Acinetobacter baumanii) ATGAATGTGTGAGTGA CP001921 (Acinetobacter baumanii) CP002301 (Buchnera aphidicola) ATGACTGTGCCCCTGA Unknown isolate Species Match No. of Kmer hits Acinetobacter baumannii CP001921 2 CP000521 1 CP002521 Buchnera aphidicola CP002301 Unique 16 mers: A database is generated where each 16mer is a key and the value is a list of all the isolates in the trainingset containing this 16mer Very robust method - it just needs one 16mer to make a prediction. ATGAATGTGTGAGTGA ATGACTGTGCCCCTGA ATGAAAAAAAAAAAA

KmerFinder is very robust – it only needs one 16mer! Desulfovibrio piger GOR1 SRR097356 >NODE 4 length 92 cov 23.119566 TAGGACGTGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCCACTTGA CGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATGC >NODE 15 length 82 cov 2.792683 AGCGAAAAATGTCATAACAACGATCACGACCGATAACCATCTTTGGTCCAAACTTACTCA CGCAGCAGGCGTATAACTCGCGCATACCAGCTTTGGGCAT N50 = 110 Total no. of bp: 210 For 41 isolates, the method failed to produce an output. The 41 draft genomes typically had an N50 below 200 (average N50 = 155) and total no. of bp below 2-3000. Prediction Species Match No. of Kmer hits Flavobacterium psycrophilum AM398681 1

TaxonomyFinder

Reads2Type Definition: Quick & dirty taxonomy identification of single isolates 50-mer of marker gene DB 16S rRNA: Training data genomes  RNAmmer (other) ITS: Training data (Mycobacterium) GyrB: Training data (Enterobacteriaceae) Resulting database ~5 MB Read2Type pushes analysis to user, server provides 50-mers database SuffixTree: efficient data structure for string matching Narrow Down Approach: Reads2Type compares 50-mers of combined marker genes against raw reads Shared Probes vs Unique Probe

rMLST Jolley KA, Bliss CM, Bennett JS, Bratcher HB, Brehony C, Colles FM, Wimalarathna H, Harrison OB, Sheppard SK, Cody AJ, Maiden MC. Ribosomal multilocus sequence typing: universal characterization of bacteria from domain to strain. Microbiology. 2012 Apr;158(Pt 4):1005-15. CGE implementation For each genome in the training data the 53 ribosomal genes were extracted. Genomes in evaluation sets were aligned using blat to each gene collection (only hits with at least 95% identity and 95% coverage were considered as a potential match). The closets match of the training genomes was selected based on a combination of coverage, %identity, bitscore, number of mistmatches and number of gaps in the alignments across all genes. Average N50 of 1329 for failed isolates

Results (16s rRNA) On the SRA drafts set, rMLST is not able to make a prediction for 3.5% of the isolates, TaxonomyFinder 1.8%, KmerFinder 0.4%, SpeciesFinder 0.2%

Overlap in predictions One of the six isolates that all methods agree are not correctly annotated has actually been re-annotated since we downloaded them

Isolates in the NCBIdrafts set for which all four methods predict the species to be different from the annotated one. * NZAEPO00000000 has been re-annotated as S. oralis since we downloaded the data.

All four methods agree that 2 of the B. cereus is B. weihenstephanensis

Bacillus cereus predicted to be B. thuringiensis is problematic Bacillus cereus predicted to be B. thuringiensis is problematic. Likewise recently diverged: Y pestis <> Y pseudotuberculosis, M. tuberculosis <> M. bovis,

Speed Method Estimated speed (mm:ss) 16S 00:13* KmerFinder 00:09* TaxonomyFinder 11:33* rMLST 00:45* Reads2Type 00:55** *Estimation based on draft genomes **Estimation based on short reads

Summary of taxonomy benchmark study KmerFinder had the highest accuracy and was the fastest method. SpeciesFinder (16S rRNA-based) had the lowest accuracy. Methods that only sample genomic loci (16S, Reads2Type, rMLST) had difficulties distin- guishing species that only recently diverged, especially when main difference is a plasmid. Recently diverged: Y pestis <> Y pseudotuberculosis, M. tuberculosis <> M. bovis,

Tools for further typing Name of Service Description URL (https://cge.cbs.dtu.dk/services/ ) Publication MLST Multilocus sequence typing Published Apr 2012, PMID: 22238442 Plasmid-Finder Identification of plasmids in Enterobacteriaceae PlasmidFinder Published Apr 2014, PMID 24777092 pMLST pMLST of plasmids in Enterobacteriaceae

Multilocus Sequence Typing (MLST) First developed in 1998 for Neisseria meningitis (Maiden et al. PNAS 1998. 95:3140-3145) The nucleotide sequence of internal regions of app. 7 housekeeping genes are determined by PCR followed by Sanger sequencing Different alleles are each assigned a random number The unique combination of alleles is the sequence type (ST)

Using WGS data for MLST

www.cbs.dtu.dk/services/MLST Acinetobacter baumannii #1 Arcobacter Borrelia burgdorferi Bacillus cereus Brachyspira hyodysenteriae Bifidobacterium Brachyspiria intermedia Bordetella Burkholderia pseudomallei Brachyspira Burkholeria cepacia complex Campylobacter jejuni Clostridium botulinum Clostridium difficile #1 Clostridium difficile #2 Campylobacter helveticus Campylobacter insulaenigrae Clostridium septicum C. diphtheriae Campylobacter fetus Chlamydiales Campylobacter lari Cronobacter C. upsaliensis Escherichia coli #1 Escherichia coli #2 Enterococcus faecalis Enterococcus faecium F. psychrophilum Haemophilus influenzae Haemophilus parasuis Helicobacter pylori Klebsiella pneumoniae Lactobacillus casei Lactococcus lactis Leptospira Listeria Listeria monocytogenes Moraxella catarrhalis Mannheimia haemolytica Neisseria P. gingivalis P. acne Pseudomonas aeruginosa Pasteurella multocida Staphylococcus aureus Streptococcus agalactiae Salmonella enterica Staphylococcus epidermidis S. maltophilia Streptococcus pneumoniae Streptococcus oralis S. zooepidemicus Streptococcus pyogenes Streptococcus suis Streptococcus thermophilus Streptomyces Streptococcus uberis Vibrio parahaemolyticus Vibrio vulnificus Wolbachia Xylella fastidiosa Y. pseudotuberculosis Assembled genome 454 – single end reads 454 – paired end reads Illumina – single end reads Illumina – paired end reads Ion Torrent SOLiD – single end reads SOLiD – mate pair reads

Extended Output

Extended Output aro: WARNING, Identity: 100%, HSP/Length: 349/498, Gaps: 0, aro_122 is the best match for aro

What is the MLST web-service used for? Our most used service - In the first 9 month of 2014, it was on average used more than 1,500 times per month. From Sep. 2012 – Oct. 2014, the service was used more than 20,000 times in total.

PlasmidFinder and pMLST The PlasmidFinder database contains replicons, not entire plasmids.

(https://cge.cbs.dtu.dk/services/ ) Tools for phenotyping Name of Service Description URL (https://cge.cbs.dtu.dk/services/ ) Publication ResFinder Identification of acquired antibiotic resistance genes Published Nov 2012, PMID: 22782487 Virulence-Finder Identification of virulence genes in E. coli (and S. aureus and Enterococcus) VirulenceFinder E. coli published Feb 2014, PMID: 24574290. MyDbFinder Identification of genes from the users own database Will be published in book chapter Pathogen-Finder Prediction of pathogenic potential PathogenFinder Published Oct 2013, PMID: 24204795

Theoretical resistance phenotype ResFinder ResFinder (BLAST) NGS Illumina Ion torrent 454.. Assembly pipeline Resistance gene profile List of genes Accession numbers Theoretical resistance phenotype Sanger Fasta Fasta Sanger

From S. aureus

ResFinder, 98 %ID, 60% length coverage 200 isolates from 4 different species (Salmonella Typhimurium, Escherichia coli, Enterococcus faecalis and Enterococcus faecium) ResFinder, 98 %ID, 60% length coverage Phenotypic tests, 3,051 in total 482 Resistant 2569 Susceptible => 99,74% of the results were in agreement between ResFinder and the phenotypic tests 23 discrepancies -> 16, typically in relation to spectinomycin in E. coli

Alternatives to ResFinder

Unpublished or uncategorized Name of Service Description URL (https://cge.cbs.dtu.dk/services/ ) Status Publication PanFunPro Groups homologous proteins based on functional domain content Online Published in F1000Research 2013, 2:265 Serotype-Finder Identification of serotypes SerotypeFinder-1.0 Not yet published Restriction-ModificationFinder Identification of RM system genes Will only be published in book chapter HostPhinder Prediction of the host of a bacteriophage Online, but under development MetaVir-Finder Identification of virus in metegenomic data MetaVirFinder MGmapper Identifies the content of metagenomic samples

Tools for phylogeny Name of Service Description Status Publication URL (cge.cbs.dtu.dk/services) Status Publication SnpTree Creation of phylogenetic trees based on SNPs snpTree Online Published Dec 2012, PMID: 23281601 CSIPhylo-geny CSIPhylogeny Planned NDtree Creation of phylogenetic trees Published in Feb 2014, PMID: 24505344

Web-service usage

Type of data uploaded to MLST web-service