A comparison of algorithms for identification of specimens using DNA barcodes: examples from gymnosperms Damon P. Little and Dennis Wm. Stevenson Cullman.

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.
DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
© Wiley Publishing All Rights Reserved. Phylogeny.
Structural bioinformatics
Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.
DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.
Bioinformatics and Phylogenetic Analysis
Characteristic Restriction Endonuclease cut order for Classification and analysis of DNA Sequences Rajib SenGupta College of Information Science and Technology,
A Comparison of Algorithms for Species Identification based on DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Alexander.
Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.
Similar Sequence Similar Function Charles Yan Spring 2006.
BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Sequence comparison: Local alignment
BioBarcode: a general DNA barcoding database and server platform for Asian biodiversity resources Jeongheui Lim Korean BioInformation Center Korea Research.
DNA Barcoding Dolan DNA Learning Center
An Introduction to Bioinformatics
Terminology of phylogenetic trees
DNA Barcoding – Southern African Experience Michelle van der Bank.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Character-based DNA barcoding for identifying conservation units in Odonates J. Rach 1, R. DeSalle 2, I.N. Sarkar 2, B. Schierwater 1,2 & H. Hadrys 1,
DNA Barcoding Amy Driskell Laboratories of Analytical Biology
Protein Sequence Alignment and Database Searching.
Standard land plant barcoding requires a multi loci approach? Peter Gasson Sujeevan Ratnasingham Robyn Cowan.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Automated Barcoding Using the Characteristic Attribute Organization System Indra Neil Sarkar, PhD Divisions of Invertebrate Zoology & Library Services.
DNA barcoding: bane or boon (or both) for taxonomy? Donal A. Hickey, Concordia University, Montreal. Collaborators: Mehrdad Hajibabaei and Gregory Singer.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
Introduction to Phylogenetics
Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
DNA Barcoding and the Consortium for the Barcode of Life Katie Ferrell, Project Manager National Museum of Natural History Smithsonian Institution
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
CHARACTERS USED IN RECONSTRUCTING PHYLOGENETIC TREES 1. Morphological “ Tiktaalik is the sister group of Acanthostega + Ichthyostega in one of the two.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Determining Sequence Relationships
CNR ITB, Bari Section - BioInformatics and Genomics MOLECULAR BIODIVERSITY Barcode: A new challenge for Bioinformatics Cecilia Saccone Meeting FIRB 2005.
Introduction to Bioinformatics Resources for DNA Barcoding
Sierra M. Love Stowell & Andrew P. Martin Student Figures
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Goals of Phylogenetic Analysis
Sequence Alignment 11/24/2018.
BNFO 602 Phylogenetics Usman Roshan.
Dr Tan Tin Wee Director Bioinformatics Centre
Basic Local Alignment Search Tool (BLAST)
Molecular data assisted morphological analyses
Phylogenetic tree representation of a neighbor-joining analysis of several species of piroplasms. Phylogenetic tree representation of a neighbor-joining.
Presentation transcript:

A comparison of algorithms for identification of specimens using DNA barcodes: examples from gymnosperms Damon P. Little and Dennis Wm. Stevenson Cullman Program for Molecular Systematic Studies The New York Botanical Garden, Bronx, New York

Why is DNA barcoding useful?

(1)Non–specialists can identify specimens (e.g., customs inspectors, ethnobotanists). (2)Morphologically deficient or incomplete specimens can be identified (e.g., powders).

application to conservation: Cycadopsida: all 305 species are protected by CITES (Convention on International Trade in Endangered Species) 5 genera are appendix I 6 genera are appendix II* Cycas machonie

((GTGCTCGGGC and TCTCGCACTG) and not CGCCTCCCCT) nrITS 2: Encephalartos feroxLepidozamia hopei CITES appendix ICITES appendix II CGCCTCCCCT

selection of the barcode locus

loci used for barcoding nuclear: rDNA: 26S, 18S, ITS 1, ITS 2 mitochondrial: COI chloroplast: trnH-psbA, rbcL

Consortium for the Barcode Of Life (CBOL) cpDNA: matK, rpoC1, rpoB, YCF5, accD, ndhJ Edinburgh (UK) => Podocarpus, Araucaria, Asterella, Anastrophyllum Instituto de Biologia UNAM (Mexico) => Agave Kew (UK) => Conostylis, Pinus, Equisetum, Dactylorhiza National Biodiversity Institute (South Africa) => Encephalartos, Mimetes Natural History Museum (Denmark) => Hordeum, Scalesia, Crocus Natural History Museum (UK) => Tortella, Ptychomniaceae, Asplenium, New York Botanical Garden (USA) => Elaphoglossum, Cupressus, Labordia Universidad de los Andes (Colombia) => Lauraceae University of Cape Town (South Africa) => Anastrophyllum, Bryum Universidade Estadual de Feira de Santana (Brazil) => Laelia, Cattleya

measuring precision and accuracy

test data sets gymnosperm nuclear ribosomal internal transcribed spacer 2 (nrITS 2) 1,037 sequences 413 species 71 genera gymnosperm plastid encoded maturase K (matK) 522 sequences 334 species 75 genera

pairwise divergence locussequencesmedianinterquartile range zero comparisons nrITS 2 all30.99%26.53–34.48%0.09% one per species29.39%25.75–33.30%0.21% matK all20.39%5.95–23.30%0.54% one per species21.38%8.13–23.89%0.42%

hierarchical clustering

…alignment locussequences median unaligned length (IQR) aligned length nrITS 2 all137 (108–250) bp8,733 bp one per species196 (115–260) bp6,778 bp matK all1,561 (1,412–1,661) bp3,975 bp one per species1,601 (1,530–1,661) bp3,906 bp

hierarchical clustering reference databases: aligned with MUSCLE 3.52 query sequence: aligned to the reference database using MUSCLE (“-profile” option) parsimony (TNT 1.0): (1) 200 iteration ratchet holding 1 tree (2) SPR holding 1 tree neighbor joining (PHYLIP 3.63): Jukes–Cantor distance (returns 1 tree) identification scored using “Least Inclusive Clade”

Will and Rubinoff (2004)... identification ambiguity due to tree shape Fitch (1971) optimization of group membership variables

Least Inclusive Clade

…clustering with nrITS 2 and matK locusmethodprecisionaccuracy to genusaccuracy to species parsimony ratchet58% (13%)98% (95%)67% (46%) nrITS 2SPR search60% (11%)98% (96%)69% (47%) neighbor joining65% (8%)97% (91%)68% (42%) parsimony ratchet71% (41%)100% (99%)77% (60%) matKSPR search70% (41%)99% (98%)78% (58%) neighbor joining44% (23%)99% (97%)75% (52%)

…clustering time (s) method25 TH percentile50 TH percentile75 TH percentile parsimony ratchet SPR search11 12 neighbor joining N = 29; 3.06 GHz Intel Pentium 4; 1 GB of RAM; Ubuntu Linux 5.04 (Hoary Hedgehog)

similarity methods

BLASTn (version ) BLAT (version 32) megaBLAST (version ) default parameters best match(es) taken as ID

…similarity methods with nrITS 2 and matK locusmethodprecisionaccuracy to genusaccuracy to species BLAST94% (81%)100% (100%)!67% (63%) nrITS 2BLAT94% (82%)99% (99%)66% (62%) megaBLAST94% (80%)95% (95%)72% (68%) BLAST99% (67%)100% (100%)!84% (68%) matKBLAT99% (69%)99% (99%)82% (67%) megaBLAST99% (61%)100% (99%)84% (64%)

… similarity time (s) method25 TH percentile50 TH percentile75 TH percentile BLAST 112 BLAT 112 megaBLAST 0<12 N = 29; 3.06 GHz Intel Pentium 4; 1 GB of RAM; Ubuntu Linux 5.04 (Hoary Hedgehog)

combination methods (cf. BOLD–ID)

combination methods (cf. BOLD–ID): (1)get the top 100 BLAST hits (2)align with MUSCLE (a)200 iteration ratchet holding 1 tree (b)SPR holding 1 tree (c)neighbor joining with Jukes – Cantor distances

…combination methods with nrITS 2 and matK locusmethodprecision accuracy to genus accuracy to species nrITS 2 BLAST only94% (81%)100% (100%)67% (63%) SPR only60% (11%)98% (96%)69% (47%) BLAST/parsimony ratchet86% (74%)99% (98%)78% (67%) BLAST/SPR87% (73%)100% (99%)79% (67%) BLAST/neighbor joining93% (71%)99% (97%)80% (64%) matK BLAST only99% (67%)100% (100%)84% (68%) SPR only60% (11%)98% (96%)69% (47%) BLAST/parsimony ratchet77% (55%)100% (99%)80% (60%) BLAST/SPR76% (53%)100% (99%)78% (61%) BLAST/neighbor joining95% (56%)100% (99%)86% (56%)

…combination time (s) method25 TH percentile50 TH percentile75 TH percentile BLAST only112 SPR only11 12 BLAST/parsimony ratchet BLAST/SPR search BLAST/neighbor joining N = 29; 3.06 GHz Intel Pentium 4; 1 GB of RAM; Ubuntu Linux 5.04 (Hoary Hedgehog)

diagnostic methods

DNA–BAR (DasGupta et al 2005): each sequence and its reverse complement (separated by 50 ``N'' symbols) presence/absence matrix of “distinguishers” up to 50 bp long degenbar

DNA–BAR (DasGupta et al 2005): matrix of distinguishers query + PERL script ID = the reference sequence(s) with the greatest number of matching presence/absence scores C. arizonica 1matches = 582 C. arizonica 2matches = 582 C. lusitanica 1matches = 582

DNA–BAR... distinguisher matrix locussequences distinguishers unidentifiable sequences totalunique nrITS 2 all1,997495% one per species813275% matK all808814% one per species %

diagnostic methods: DOME ID reference database (via PERL and MySQL): (1)all sequence strings of 10 nucleotides offset by 5 nucleotides were extracted from the reference sequences (2) each string was classified as diagnostic (unique to a particular species) or non–diagnostic (3)diagnostic strings were inserted into the diagnostic barcode database GCGTTGATGG GTTGGGCGTT CATACGTTGG GTCACCATAC CCTTTGTTTG AGGGACCTTT CTGAGCATCG GTGCACTGAG TTCTCGATGC GGCGTTTCTC TAGCTGGCGT AGGTCTAGCT GGCTGAGGTC GCTTGCATCG CCCTAGCTTG AATGTGCGCA GATGCAATGT TAGCCGGCGT CTGTCTAGCC GCCTTGCCCC ATGCCCCCTG ATCGTGGTGC CCCTGCAAGT AGTGTGCGCA TAGACGACGT CTGTCTAGAC GACTTGCCCC CTTGCGGATC CGGCCTGACT ACCCCCGGCC CGTGAACCCC CTGCCTGACT CCCCCCTGCC TGGGCCGTCA CGCGATGGGC ATACGCGCGA GCCCTTTGAG TGCGGTGGGA CAAGTGAGGA TCGGGCAAGT TAAAATCGTC CAAACCCGTC GTGCATGTGC CGTGCGTGCA CTTCCCACGA CCGTCCCGCA GCATTTGCGG CTCGGGGAGC AAGACCCGTC GCGGCAAGAC GTGCGTGCGT TGCAGAGGGG TTCTCACGAA AGGTTCTCCC GTGCCAGGTT TGCGTCCCGC TTGTTTGCGT TTTCATTGTT GGCGGCATGA TCCCCTGCCC CTTGCTTTTT GGCGGCTTGC CGGCGGGCGG CGGCACGGCG CTTTACGGCA AGACTCCGCG GATCGAGACT CAAGTGATCG GGTGTCAAGT GGTGGCCCCC GGCTCATCAT TGAAACGTGC CCCAAGACGG CGTGCCCCAA AGGACCGGGA TGGGGGTGGG CCGCGTGGGG GACCTCCATT AAACCGACCT AAAGAAAAGA TCCAAGAAAA GCCTGTTTTC GGTCAGCCTG CATGCGTGCG TCAAGGATCC CGGTTTCAAG CGACGCGGTT GTGCTCGGAA GGGATGTGCT CTACGGTCGA GTCGCCTACG ATAGTCTTCA CGGCGATAGT TGTTTTCATG GATGGTGTTT GTCCCTATCA ATTAAAATAC CGATCCGAGT GCGGGTGAGA TCCCCCCCAA AGGATGACGA GCAAAAGGAT ACATGATTCG AATACAACTC CGCAAGCGGC GGCGTGGAAT TCAGCGTTGG ACGGGTCAGC GATAGTCCGT GATCCGATAG GCATTGGGGG GATATTTGAT TAGCCCAAAA TCGCCTAGCC GCCCTTCGGC CATGCGCCCT CTACTCTTTC AACGTCTACT CACGCGAGAG CGCGTCACGC CGCGTATCTT AGCGTGCATC GGGGGAGCGT GCTACGGGGG CGAGGCGTCC GGAACCGAGG TTTCACGGGT GCCGATCCGG AATGCGCCGA GTACTCGCGA TGGCAAGGAT GCCGGTACCG CAACGGCCGG AAGCGGGCAG GCAGCAAGCG CGAGACGATG GACGACGAGA AGACCCGGGA CGAGCCTTCA CGGATGAGAA TTGCGCGGAT CTCCATAGGT TTCCCCCAAG AATCGTTCCC CGCCTCGATG CCGAGCCTCG TTCAAGAATC GTGAATTCAA AAAATTCACG TCGTCCGCCG GCGACCCAGC GAAGCGCGAC ACGGGTGCCG CGTGTAATGT AACGACGTGT AGTAAAGGTC GCTCAAGTAA GACGTGCTCA TGCTGGACGT TAGATGGCTG GGCGGTATGT CCGATGCGAT ATCCCCCGAT TCCTGTCCTC GAGACTCCAA ACCGGCGTTG CAAAGACCGG ACTGAAATGA AGGGCTCGGC ATATCGTCGG CAGGAATCCC AATTGCAGGA CCAACGATGA ACATCCCAAC TGTCAACATC CCTCTCCCGT GGTTGGACGG TTGATGGTTG GGGGATTGAT AATCTAGTTG AGGGGAATCT CTCTTTCCAA CGCCTCTCTT CTGTGCGCCT TCGACCTGTG CTTTCTCGAC CGCTACTTTC AGCGCCGCTA ATCTCAGCGC TGGGTATCTC CTCGTTGGGT TCGCGCTCGT GTGTGTCGCG CTTGACGTCC AAAGCCTCGT CTTCGAAAGC CCGATGCGCT TCTCGCCGAT CCCTGTCTCG GTTGGAGGGT TGATCGTTGG TTGATTGATC GGTGATTGAT TCGTGGGTGA TCTTCTCGTG GCTATTCTTC GACGGGCTAT TAGCTGACGG CTGGATAGCT CAGCACTGGA GGCTTCAGCA TCGCGGGCTT GTGATTGCTG CCGCCGTGAT CTGCCCCGCC CTTCTCTGCC CCTGACTTCT CGTTGCCTGA GCTGCCGTTG TGCTGGCTGC TCCAGTGCTG GGCTATCCAG CCGTGGGCTA GCGCCCCGTG CTGTTGCGCC CGAGGCTGTT CTTTACGCCT GCGCCCTTTA GAAAGGGCTT GATCGGAAAG TGTTGCATGT GGTCCTGTTG TTGTCGGTCC CATGGTTGTC

diagnostic methods: DOME ID reference database (via PERL and MySQL): (1)all sequence strings of 10 nucleotides offset by 5 nucleotides were extracted from the reference sequences (2) each string was classified as diagnostic (unique to a particular species) or non–diagnostic (3)diagnostic strings were inserted into the diagnostic barcode database diagnostic barcode database

diagnostic methods: DOME ID query + MySQL + PERL script ID = the reference sequence(s) with the greatest number of matching presence/absence scores C. arizonicamatches = 43 diagnostic barcode database

diagnostic methods: ATIM presence/absence matrix of all possible of 10 bp combinations [1,048,576 motifs] PERL script

diagnostic methods: ATIM 1,048,576 character presence/absence matrix TNT (parsimony ratchet) reference tree (strict consensus)

diagnostic methods: ATIM query + 1,048,576 character presence/absence matrix + reference tree (positive constraint) TNT (TBR hold 20) identification scored using “Least Inclusive Clade”

…diagnostic methods with nrITS 2 and matK locusmethodprecisionaccuracy to genusaccuracy to species nrITS 2 DNA–BAR98% (89%)!86% (86%)65% (62%) DOME ID80% (80%)86% (84%)67% (66%) ATIM100% (83%)99% (98%)83% (71%)! matK DNA–BAR100% (79%)!96% (96%)73% (62%) DOME ID60% (60%)53% (53%)50% (50%) ATIM100 (67%)98% (97%)87% (53%)

…diagnostic time (s) method25 TH percentile50 TH percentile75 TH percentile DNA–BAR 111 DOME ID ATIM N = 29; 3.06 GHz Intel Pentium 4; 1 GB of RAM; Ubuntu Linux 5.04 (Hoary Hedgehog)

DAWG I “training” dataset

…the DAWG I “training” dataset methodprecisionaccuracy to species SPR71% (41%)86% (81%) SPR60–70% (11–41%)69–78% (47–58%) BLAST100% (78%)83% (83%) BLAST94–99% (67–81%)67–84% (63 –68%) DNA–BAR97% (90%)43% (42%) DNA–BAR98–100% (79–89%)65–73% (62%) ATIM100% (72%)75% (69%) ATIM100% (67–83%)83–87% (53–71%)

conclusions: all methods are relatively precise => expect accuracy to approximate precision observed accuracy of species level identification is lower => failure of the algorithms to correspond to species delimitations (shared haplotypes or haplotypes of a species are more similar to those of different species) => for accurate identification, the reference database must contain virtually all haplotypes none of the methods performed particularly well => computer time => BLAST (BLAT and megaBLAST too) => DNA–BAR

acknowledgments brilliant insights &tc: K. Cameron C. Chaboo T. Dikow C. Martin R. Meier M. Mundry money: Cullman Program for Molecular Systematic Studies DIMACS/NSF