Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Comparison of Algorithms for Species Identification based on DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Alexander.

Similar presentations


Presentation on theme: "A Comparison of Algorithms for Species Identification based on DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Alexander."— Presentation transcript:

1 A Comparison of Algorithms for Species Identification based on DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Alexander Gusev, Sotirios Kentros, James Lindsay and Ion Măndoiu

2 Introduction Several methods proposed for assigning specimens to species  TaxI (Steinke et al.05), Likelihood ratio test (Matz&Nielsen06), BOLD-IDS(Ratnasingham&Hebert 07)… No direct comparisons on standardized benchmarks This work:  Direct comparison of methods from three main classes  Distance-based, tree-based, and statistical model-based  Explore the effect of repository size  #barcodes/species, #species Species identification problem  Given repository containing barcodes from known species and a new barcode find its species

3 Datasets Fishes of Australia Container Part [Ward et. al, 05]  754 barcodes, 211 species, 113 genera Cowries [Meyer and Paulay, 05]  2036 barcodes, 263 species, 46 genera Birds of North America - Phase II [Kerr K.C.R. et al, 07]  2589 barcodes, 656 species, 289 genera Bats of Guyana [Clare E.L. et al, 06]  840 barcodes, 96 species, 50 genera Hesperidia of the ACG 1[Hajibabaei M. et al, 05]  4267 barcodes, 561 species, 207 genera 90% in training and 10% in testing

4 Distance-based methods Barcode assigned to closest specie  Two-variants: Minimum/Maximum or Average Hamming distance [MIN-HD, AVG-HD]  Percent of sequence divergencence Aminoacid Similarity [MAX-AA-SIM, AVG-AA-SIM]  Blossom62 matrix to score similarity Convex Score similarity [MAX-CS-SIM]  Higher score to longer consecutive runs of matches Tri-nucleotide frequency distance [MIN-3FREQ]  Euclidian distance between vectors of frequencies Combined method [COMB]  Assignment made using majority rule

5 Distance-based methods

6 Tree-based methods Exemplar NJ [Meyer&Paulay05]  One exemplar per species (random)  One neighbor joining tree for exemplar + unknown barcodes Profile NJ [Muller et al, 04]  Distance between profiles  Neighbor joining tree for the species profiles Phylogenetic Traversal  Construct NJ-tree from training profiles  Traverse down the tree (from the root)  Choose least distant branch Substitution models: UNC, JK, K2P, TN.

7 Tree-based methods

8 Statistical model-based Likelihood ratio test for species membership using MCMC [Matz&Nielsen06]  Impractical runtime even for moderate #species Scalable models explored: position weight matrices, Markov chains, hidden Markov models  Similar to models used successfully in other sequence analysis problems such as DNA motif finding and protein families

9 Positional Weight Matrix(PWM) Assumption: independence of loci  P(x|SP) = P(x 1 |SP)*P(x 2 |SP)*…*P(x n |SP) For each locus, P(x i |SP) is estimated as the probability of seeing each nucleotide at that locus in DB sequences from species SP

10 Inhomogeneous Markov Chain (IMC) Takes into account dependencies between consecutive loci  start A C T G A C T G A C T G A C T G … locus 1locus 2locus 3locus 4

11 Hidden Markov Model (HMM) Same structure as the IMC  Each state emits the associated DNA base with high probability; but can also emit the other bases with probability equal to mutation rate Barcode x generated along path p with probability equal to product of emission & transitions along p P(x|HMM) = sum of probabilities over all paths  Efficiently computed by forward algorithm

12 Probabilistic model-based methods HMM not scalable  genus level identification

13 Comparison of representative methods

14 Effect of #barcodes/species BOLD species with at least 25 barcodes (270 sp, 17197 barcodes) randomly picked 5-20 barcodes from all species All remaining barcodes used in testing

15 Effect of #species BOLD species with at least 10 barcodes (690 sp, 23558 barcodes) Randomly picked 100 to 690 species (10 barcodes per species) All remaining barcodes from picked species used in testing

16 Conclusions & Ongoing work Presented an initial comparison of a broad range of species assignment methods Ongoing work explores further effects  New specie detection  Barcode length/quality  Runtime scalability (up to millions of species)  More datasets


Download ppt "A Comparison of Algorithms for Species Identification based on DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Alexander."

Similar presentations


Ads by Google