DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics.

Slides:



Advertisements
Similar presentations
Parallel BioInformatics Sathish Vadhiyar. Parallel Bioinformatics  Many large scale applications in bioinformatics – sequence search, alignment, construction.
Advertisements

Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Random Walks and BLAST Marek Kimmel (Statistics, Rice)
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Structural bioinformatics
Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Protein Sequence Comparison Patrice Koehl
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Novel computational methods for large scale genome comparison PhD Director: Dr. Xavier Messeguer Departament de Llenguatges i Sistemes Informàtics Universitat.
Chapter 5 Multiple Sequence Alignment.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Multiple sequence alignment
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
An Introduction to Bioinformatics
DNA Barcoding Amy Driskell Laboratories of Analytical Biology
Protein Sequence Alignment and Database Searching.
Why Models of Sequence Evolution Matter Number of differences between each pair of taxa vs. genetic distance between those two taxa. The x-axis is a proxy.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
A comparison of algorithms for identification of specimens using DNA barcodes: examples from gymnosperms Damon P. Little and Dennis Wm. Stevenson Cullman.
DNA barcoding: bane or boon (or both) for taxonomy? Donal A. Hickey, Concordia University, Montreal. Collaborators: Mehrdad Hajibabaei and Gregory Singer.
Introduction to Phylogenetics
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Speaker: Bin-Shenq Ho Dec. 19, 2011
CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
From Smith-Waterman to BLAST
Sequence Alignment.
Doug Raiford Phage class: introduction to sequence databases.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Heuristic Alignment Algorithms Hongchao Li Jan
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Shruthi Prabhakara, Raj Acharya Department of Computer Science and Engineering, Pennsylvania State University We propose a two-pass semi-supervised fuzzy.
Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen.
What is sequencing? Video: WlxM (Illumina video) WlxM.
DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis.
A Music Search Engine for Plagiarism Detection
Metagenomic Species Diversity.
Introduction to Bioinformatics Resources for DNA Barcoding
EDNA analyze Wang Ying & Huang Junman.
Research in Computational Molecular Biology , Vol (2008)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Local alignment and BLAST
Fast Sequence Alignments
Sequence Based Analysis Tutorial
Why Models of Sequence Evolution Matter
Basic Local Alignment Search Tool (BLAST)
Lecture 7 – Algorithmic Approaches
Pairwise Sequence Alignment
Presentation transcript:

DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics Studies The New York Botanical Garden, Bronx, New York

test data sets (Little and Stevenson 2007) gymnosperm nuclear ribosomal internal transcribed spacer 2 (nrITS 2) 1,037 sequences 413 species 71 genera gymnosperm plastid encoded maturase K (matK) 522 sequences 334 species 75 genera

…alignment locussequences median unaligned length (IQR) aligned length nrITS 2 all137 (108–250) bp8,733 bp one per species196 (115–260) bp6,778 bp matK all1,561 (1,412–1,661) bp3,975 bp one per species1,601 (1,530–1,661) bp3,906 bp

pairwise divergence locussequencesmedianinterquartile range zero comparisons nrITS 2 all30.99%26.53–34.48%0.09% one per species29.39%25.75–33.30%0.21% matK all20.39%5.95–23.30%0.54% one per species21.38%8.13–23.89%0.42%

measuring precision and accuracy

precision methodnrITS2matK parsimony ratchet58% (13%)71% (41%) SPR search60% (11%)70% (41%) neighbor joining65% (8%)44% (23%) BLAST94% (81%)99% (67%) BLAT94% (82%)99% (69%) megaBLAST94% (80%)99% (61%) BLAST/parsimony ratchet86% (74%)77% (55%) BLAST/SPR87% (73%)76% (53%) BLAST/neighbor joining93% (71%)95% (56%) DNA–BAR98% (89%)100% (79%) DOME ID80% (80%)60% (60%) ATIM100% (83%)100 (67%)

accuracy to species methodnrITS2matK parsimony ratchet67% (46%)77% (60%) SPR search69% (47%)78% (58%) neighbor joining68% (42%)75% (52%) BLAST67% (63%)84% (68%) BLAT66% (62%)82% (67%) megaBLAST72% (68%)84% (64%) BLAST/parsimony ratchet78% (67%)80% (60%) BLAST/SPR79% (67%)78% (61%) BLAST/neighbor joining80% (64%)86% (56%) DNA–BAR65% (62%)73% (62%) DOME ID67% (66%)50% (50%) ATIM83% (71%)87% (53%)

lessons learned

“global” alignments do not work

precision methodnrITS2matK parsimony ratchet58% (13%)71% (41%) SPR search60% (11%)70% (41%) neighbor joining65% (8%)44% (23%) BLAST94% (81%)99% (67%) BLAT94% (82%)99% (69%) megaBLAST94% (80%)99% (61%) BLAST/parsimony ratchet86% (74%)77% (55%) BLAST/SPR87% (73%)76% (53%) BLAST/neighbor joining93% (71%)95% (56%) DNA–BAR98% (89%)100% (79%) DOME ID80% (80%)60% (60%) ATIM100% (83%)100 (67%)

accuracy to species methodnrITS2matK parsimony ratchet67% (46%)77% (60%) SPR search69% (47%)78% (58%) neighbor joining68% (42%)75% (52%) BLAST67% (63%)84% (68%) BLAT66% (62%)82% (67%) megaBLAST72% (68%)84% (64%) BLAST/parsimony ratchet78% (67%)80% (60%) BLAST/SPR79% (67%)78% (61%) BLAST/neighbor joining80% (64%)86% (56%) DNA–BAR65% (62%)73% (62%) DOME ID67% (66%)50% (50%) ATIM83% (71%)87% (53%)

“fuzzy” matches are not precise

precision methodnrITS2matK parsimony ratchet58% (13%)71% (41%) SPR search60% (11%)70% (41%) neighbor joining65% (8%)44% (23%) BLAST94% (81%)99% (67%) BLAT94% (82%)99% (69%) megaBLAST94% (80%)99% (61%) BLAST/parsimony ratchet86% (74%)77% (55%) BLAST/SPR87% (73%)76% (53%) BLAST/neighbor joining93% (71%)95% (56%) DNA–BAR98% (89%)100% (79%) DOME ID80% (80%)60% (60%) ATIM100% (83%)100 (67%)

accuracy to species methodnrITS2matK parsimony ratchet67% (46%)77% (60%) SPR search69% (47%)78% (58%) neighbor joining68% (42%)75% (52%) BLAST67% (63%)84% (68%) BLAT66% (62%)82% (67%) megaBLAST72% (68%)84% (64%) BLAST/parsimony ratchet78% (67%)80% (60%) BLAST/SPR79% (67%)78% (61%) BLAST/neighbor joining80% (64%)86% (56%) DNA–BAR65% (62%)73% (62%) DOME ID67% (66%)50% (50%) ATIM83% (71%)87% (53%)

autoapomorphies (unique characters) work... but not always present

precision methodnrITS2matK parsimony ratchet58% (13%)71% (41%) SPR search60% (11%)70% (41%) neighbor joining65% (8%)44% (23%) BLAST94% (81%)99% (67%) BLAT94% (82%)99% (69%) megaBLAST94% (80%)99% (61%) BLAST/parsimony ratchet86% (74%)77% (55%) BLAST/SPR87% (73%)76% (53%) BLAST/neighbor joining93% (71%)95% (56%) DNA–BAR98% (89%)100% (79%) DOME ID80% (80%)60% (60%) DOME ID*100% (100%) ATIM100% (83%)100 (67%)

accuracy to species methodnrITS2matK parsimony ratchet67% (46%)77% (60%) SPR search69% (47%)78% (58%) neighbor joining68% (42%)75% (52%) BLAST67% (63%)84% (68%) BLAT66% (62%)82% (67%) megaBLAST72% (68%)84% (64%) BLAST/parsimony ratchet78% (67%)80% (60%) BLAST/SPR79% (67%)78% (61%) BLAST/neighbor joining80% (64%)86% (56%) DNA–BAR65% (62%)73% (62%) DOME ID67% (66%)50% (50%) DOME ID*76% (75%)90% (90%) ATIM83% (71%)87% (53%)

some sequences are simply unidentifiable

...remaining (insoluble) problems identical sequences for multiple terminals shared alleles between terminals use allele frequency as a predictor?

desirable methodologies and properties of Sequence IDentification Engines (SIDEs)

Sequence IDentification Engines (SIDEs) avoid global alignment by comparing short segments: pseudo–alignment use exact matches use autoapomorphies where possible...but allow the use of other characters too

context/text DNA recoding characters are defined by flanking context => pretext and postext permit “alignment–free” comparisons size and separation between pretext and postext must be arbitrarily delimited states (text) limited by the proximity of context terminals can be individual sequences or composites representing taxa

context/text DNA recoding

characters are defined by flanking context => pretext and postext permit “alignment–free” comparisons size and separation between pretext and postext is arbitrarily possible states (text) is limited by the length of the text terminals can be individual sequences or composites representing taxa

querying text/context database find pretext/text/postext in the query sequence and match to references

querying text/context database

find pretext/text/postext in the query sequence and match to references score terminals based on the number of matches final score can be raw or based a weighting function

possible weighting functions equal weights (raw score) number of distinct texts => up weights more variable characters 1/(number of distinct texts) => down weights more variable characters (number of texts)/(number of scores)

precision methodnrITS2matK parsimony ratchet58% (13%)71% (41%) SPR search60% (11%)70% (41%) neighbor joining65% (8%)44% (23%) BLAST94% (81%)99% (67%) BLAT94% (82%)99% (69%) megaBLAST94% (80%)99% (61%) BLAST/parsimony ratchet86% (74%)77% (55%) BLAST/SPR87% (73%)76% (53%) BLAST/neighbor joining93% (71%)95% (56%) DNA–BAR98% (89%)100% (79%) DOME ID80% (80%)60% (60%) ATIM100% (83%)100 (67%) BRONX 091% (90%)88% (84%) BRONX 1 96% (86%)98% (79%)

accuracy to species methodnrITS2matK parsimony ratchet67% (46%)77% (60%) SPR search69% (47%)78% (58%) neighbor joining68% (42%)75% (52%) BLAST67% (63%)84% (68%) BLAT66% (62%)82% (67%) megaBLAST72% (68%)84% (64%) BLAST/parsimony ratchet78% (67%)80% (60%) BLAST/SPR79% (67%)78% (61%) BLAST/neighbor joining80% (64%)86% (56%) DNA–BAR65% (62%)73% (62%) DOME ID67% (66%)50% (50%) ATIM83% (71%)87% (53%) BRONX 059% (58%)76% (71%) BRONX 172% (67%)92% (75%)

BRONX conclusions BRONX is more precise than existing algorithms BRONX is sometimes more accurate than existing algorithms BRONX is an incremental improvement

future directions improve the scoring function in BRONX dynamically size context/text benchmark additional datasets for all methods incorporate context/text recoding into a scalable version of the ATIM algorithm

acknowledgments Kenneth Cameron Santiago Madriñán Christian Schulz Dennis Stevenson