Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.

Slides:

Advertisements

Similar presentations

B. Knudsen and J. Hein Department of Genetics and Ecology

Advertisements

Marius Nicolae Computer Science and Engineering Department

RNA-Seq based discovery and reconstruction of unannotated transcripts

Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.

Hidden Markov Model in Biological Sequence Analysis – Part 2

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.

Ab initio gene prediction Genome 559, Winter 2011.

Hidden Markov Models Eine Einführung.

Markov Models Charles Yan Markov Chains A Markov process is a stochastic process (random process) in which the probability distribution of the.

Hidden Markov Models Modified from:

Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.

Ka-Lok Ng Dept. of Bioinformatics Asia University

Profiles for Sequences

Hidden Markov Models Theory By Johan Walters (SR 2003)

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Lecture 6, Thursday April 17, 2003

Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.

Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.

Heuristic alignment algorithms and cost matrices

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Hidden Markov Models Lecture 5, Tuesday April 15, 2003.

Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.

. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.

DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.

Lecture 5: Learning models using EM

Expected accuracy sequence alignment

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin.

Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.

A Comparison of Algorithms for Species Identification based on DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Alexander.

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.

Modeling biological data and structure with probabilistic networks I Yuan Gao, Ph.D. 11/05/2002 Slides prepared from text material by Simon Kasif and Arthur.

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

Similar Sequence Similar Function Charles Yan Spring 2006.

Class 3: Estimating Scoring Rules for Sequence Alignment.

DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods CSE 377 – Bioinformatics - Spring 2006 Sotirios Kentros Univ. of.

Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.

Probabilistic methods for phylogenetic trees (Part 2)

Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Hidden Markov Models for Sequence Analysis 4

BINF6201/8201 Hidden Markov Models for Sequence Analysis

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.

Comp. Genomics Recitation 3 The statistics of database searching.

Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.

Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Imputation-based local ancestry inference in admixed populations

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.

Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.

Construction of Substitution matrices

Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.

Expected accuracy sequence alignment Usman Roshan.

MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Learning Sequence Motif Models Using Expectation Maximization (EM)

Imputation-based local ancestry inference in admixed populations

Ab initio gene prediction

Hidden Markov Models Part 2: Algorithms

Pairwise Sequence Alignment (cont.)

Presentation transcript:

Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios Kentros

Outline Existing approaches to species identification Proposed statistical model based methods Experimental Results Ongoing Work and Conclusions

Background on DNA barcoding Recently proposed tool for species identification Use short DNA region as “fingerprint” for the species Region of choice: cytochrome c oxidase subunit 1 mitochondrial gene ("COI", 648 base pairs long). Key assumption: inter-species variability higher than intra-species variability

Species identification problem Given:  Database DB containing barcodes from known species  New barcode x Find:  a high confidence assignment to a species in the DB  UNKNOWN, if confidence not high enough Use additional evidence/methods to resolve UNKNOWN assignments and possible discovery of new species

Existing approaches and limitations Neighbor Joining tree for new + known barcodes [Meyers&Paulay05]  One barcode per species  Runtime does not scale well with #species (quadratic or worse) Likelihood ratio test for species membership using MCMC [Matz&Nielsen06]  Impractical runtime even for moderate #species Distance-based [BOLD-IDS, TaxI(Steinke et al.05)]  Unclear statistical significance

BOLD BOLD: The Barcode of Life Data Systems [Ratnasingham&Hebert07]   Currently: 28,129 species, 251,429 barcodes Identification System: BOLD-IDS  Distance-based (NJ tree for visualization)  Employs a threshold (less than 1% divergence) to get a tight match to a barcode in the DB

BOLD-IDS [Ekrem et al.07]: “ … identifications by the BOLD facility must be cautiously evaluated as the system at present may return high probabilities of placements that obviously are erroneous”

Outline Existing approaches to species identification Proposed statistical model based methods Experimental Results Ongoing Work and Conclusions

Bayesian approach to species identification Assign barcode x=x 1 x 2 x 3 …x n to species SP i that maximizes P(SP i |x) over all species SP i P(SP i |x) computed using Bayes’ theorem: P(SP|x) = P(x|SP)*P(SP)/P(x)  Uniform prior P(SP)  P(x) constant for fixed x  Need model for P(x|SP) We explored three scalable models: position weight matrices, Markov chains, hidden Markov models  Similar to models used successfully in other sequence analysis problems such as DNA motif finding and protein families

Positional weight matrix (PWM) Assumption: independence of loci  P(x|SP) = P(x 1 |SP)*P(x 2 |SP)*…*P(x n |SP) For each locus, P(x i |SP) is estimated as the probability of seeing each nucleotide at that locus in DB sequences from species SP

Inhomogeneous Markov Chain (IMC) Takes into account dependencies between consecutive loci  start A C T G A C T G A C T G A C T G … locus 1locus 2locus 3locus 4

Hidden Markov Model (HMM) Same structure as the IMC  Each state emits the associated DNA base with high probability; but can also emit the other bases with probability equal to mutation rate Barcode x generated along path p with probability equal to product of emission & transitions along p P(x|HMM) = sum of probabilities over all paths  Efficiently computed by forward algorithm

Accuracy on BOLD dataset 37 species with at least 100 barcodes from BOLD  10-50% barcodes removed and used for test IMC yields better accuracy in all cases 10%20%30%40%50% PWM90.08%90.01%90.02%89.68%89.69% IMC99.97%99.93%99.90%99.91%99.89% HMM99.57% 99.66%99.70%99.76%

Score normalization DB barcodes have non uniform lengths and cover different regions of the COI gene  Membership probabilities not always comparable Normalization scheme:  Species models constructed only over positions covered in DB  Scores normalized using background IMC constructed from all sequences in DB

Computing the confidence of assignment x assigned to species SP with score s p-value: probability that a barcode generated under background model Ḿ has a score s’  s Methods for p-value estimation:  Random sampling  Generate random sequences and count how many exceed the score  Exact computation (for PWMs):  Dynamic programming [Rahmann03]  Branch and bound [Zhang et. Al 07]  Shiffted FFTs [Nagarajan et al. 05]

Exact computation for PWMs [Rahmann03] Computes the entire distribution Scores rounded by a granularity factor Score is a sum of n independent variables (score contribution of each position)  Probability of a rand. seq. of length i having a score of computed from the contribution of first i-1 positions and current position

Exact computation for IMCs Defineas the prob. of a random seq of length i having score and last letter Basic recurrence:

IMC exact p-value computation Initially The probability of a random barcode having score Runtime, where R is the difference between max and min score for any i.

Outline Existing approaches to species identification Proposed statistical model based methods Experimental Results Ongoing Work and Conclusions

Experimental setup (1) Compared methods  IMC  Species with highest score  If score < species specific threshold  UNKNOWN  Distance-based (BOLD-IDS like)  Species containing barcode showing less divergence  If divergence > threshold (default 1%)  UNKNOWN Basic questions  What is the effect of training set size (#barcodes per species) on accuracy?  What is the effect of the #species on accuracy?

Experimental setup (2) Two scenarios:  Complete DB: all new barcodes belong to species in DB  Incomplete DB: some new barcodes belong to species not in DB

Accuracy measures True positive rate = TP/(TP+FP)  Barcodes belonging to species present in the DB  TP = #barcodes assigned to correct species  FP = #barcodes assigned to incorrect species  Barcodes belonging to species not present in DB  TP = #barcodes assigned to unknowns  FP = #barcodes assigned to species in the DB

Effect of #barcodes/species Datasets containing all BOLD species with at least 5/25 barcodes  BOLD5: 1508 sp, barcodes  BOLD25: 270 sp, barcodes DB composed of randomly picked 5-20 barcodes from all species in BOLD25 Test barcodes  Complete database scenario  All remaining barcodes from BOLD25  Incomplete database scenario  All barcodes from BOLD5 not in DB

Effect of #barcodes/species, complete DB

Effect of #barcodes/species, incomplete DB

Effect of #species Datasets containing all BOLD species with at least 5/10 barcodes  BOLD5: 1508 sp, barcodes  BOLD10: 690 sp, barcodes DB composed of randomly picked 100 to 690 species from BOLD10  10 barcodes per species Test barcodes  Complete database scenario  All remaining barcodes from picked species  Incomplete database scenario  All barcodes from BOLD5 not in DB

Effect of #species, complete DB

Effect of #species, incomplete DB

Outline Existing approaches to species identification Proposed statistical model based methods Experimental Results Ongoing Work and Conclusions

Conclusions & Ongoing work IMC provides a scalable method for species identification  High accuracy, with useful tradeoff between TP rate and unknown rate  Efficiently computable p-values Comprehensive comparison of identification algorithms to be submitted to 2 nd International Barcode Conference  Broad coverage of methods  tree-based, distance-based, character-based, model-based  Assessment of further effects besides #species and #barcodes/species  Barcode length  Barcode quality  Number of regions  Runtime scalability (up to millions of species)  Diverse datasets (BOLD, cowries, flu viruses, simulated data, etc.)