A Comparison of Algorithms for Species Identification based on DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Alexander.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.
Hidden Markov Model.
Phylogenetic Trees Lecture 4
Introduction to Hidden Markov Models
Markov Models Charles Yan Markov Chains A Markov process is a stochastic process (random process) in which the probability distribution of the.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
Statistical NLP: Lecture 11
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Profile-profile alignment using hidden Markov models Wing Wong.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Multiple sequence alignment
Scalable Algorithms for Analysis of Genomic Diversity Data Bogdan Paşaniuc Department of Computer Science & Engineering University of Connecticut.
Class 3: Estimating Scoring Rules for Sequence Alignment.
DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods CSE 377 – Bioinformatics - Spring 2006 Sotirios Kentros Univ. of.
Probabilistic methods for phylogenetic trees (Part 2)
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
HMM for multiple sequences
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Pairwise Sequence Analysis-III
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
California Pacific Medical Center
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
1 DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008.
Expected accuracy sequence alignment Usman Roshan.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Other Models for Time Series. The Hidden Markov Model (HMM)
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Imputation-based local ancestry inference in admixed populations
Hidden Markov Models Part 2: Algorithms
The Most General Markov Substitution Model on an Unrooted Tree
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

A Comparison of Algorithms for Species Identification based on DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Alexander Gusev, Sotirios Kentros, James Lindsay and Ion Măndoiu

Introduction Several methods proposed for assigning specimens to species  TaxI (Steinke et al.05), Likelihood ratio test (Matz&Nielsen06), BOLD-IDS(Ratnasingham&Hebert 07)… No direct comparisons on standardized benchmarks This work:  Direct comparison of methods from three main classes  Distance-based, tree-based, and statistical model-based  Explore the effect of repository size  #barcodes/species, #species Species identification problem  Given repository containing barcodes from known species and a new barcode find its species

Datasets Fishes of Australia Container Part [Ward et. al, 05]  754 barcodes, 211 species, 113 genera Cowries [Meyer and Paulay, 05]  2036 barcodes, 263 species, 46 genera Birds of North America - Phase II [Kerr K.C.R. et al, 07]  2589 barcodes, 656 species, 289 genera Bats of Guyana [Clare E.L. et al, 06]  840 barcodes, 96 species, 50 genera Hesperidia of the ACG 1[Hajibabaei M. et al, 05]  4267 barcodes, 561 species, 207 genera 90% in training and 10% in testing

Distance-based methods Barcode assigned to closest specie  Two-variants: Minimum/Maximum or Average Hamming distance [MIN-HD, AVG-HD]  Percent of sequence divergencence Aminoacid Similarity [MAX-AA-SIM, AVG-AA-SIM]  Blossom62 matrix to score similarity Convex Score similarity [MAX-CS-SIM]  Higher score to longer consecutive runs of matches Tri-nucleotide frequency distance [MIN-3FREQ]  Euclidian distance between vectors of frequencies Combined method [COMB]  Assignment made using majority rule

Distance-based methods

Tree-based methods Exemplar NJ [Meyer&Paulay05]  One exemplar per species (random)  One neighbor joining tree for exemplar + unknown barcodes Profile NJ [Muller et al, 04]  Distance between profiles  Neighbor joining tree for the species profiles Phylogenetic Traversal  Construct NJ-tree from training profiles  Traverse down the tree (from the root)  Choose least distant branch Substitution models: UNC, JK, K2P, TN.

Tree-based methods

Statistical model-based Likelihood ratio test for species membership using MCMC [Matz&Nielsen06]  Impractical runtime even for moderate #species Scalable models explored: position weight matrices, Markov chains, hidden Markov models  Similar to models used successfully in other sequence analysis problems such as DNA motif finding and protein families

Positional Weight Matrix(PWM) Assumption: independence of loci  P(x|SP) = P(x 1 |SP)*P(x 2 |SP)*…*P(x n |SP) For each locus, P(x i |SP) is estimated as the probability of seeing each nucleotide at that locus in DB sequences from species SP

Inhomogeneous Markov Chain (IMC) Takes into account dependencies between consecutive loci  start A C T G A C T G A C T G A C T G … locus 1locus 2locus 3locus 4

Hidden Markov Model (HMM) Same structure as the IMC  Each state emits the associated DNA base with high probability; but can also emit the other bases with probability equal to mutation rate Barcode x generated along path p with probability equal to product of emission & transitions along p P(x|HMM) = sum of probabilities over all paths  Efficiently computed by forward algorithm

Probabilistic model-based methods HMM not scalable  genus level identification

Comparison of representative methods

Effect of #barcodes/species BOLD species with at least 25 barcodes (270 sp, barcodes) randomly picked 5-20 barcodes from all species All remaining barcodes used in testing

Effect of #species BOLD species with at least 10 barcodes (690 sp, barcodes) Randomly picked 100 to 690 species (10 barcodes per species) All remaining barcodes from picked species used in testing

Conclusions & Ongoing work Presented an initial comparison of a broad range of species assignment methods Ongoing work explores further effects  New specie detection  Barcode length/quality  Runtime scalability (up to millions of species)  More datasets