DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.

Slides:

Advertisements

Similar presentations

. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.

Advertisements

Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Measuring the degree of similarity: PAM and blosum Matrix

DNA sequences alignment measurement

Molecular Evolution Revised 29/12/06

Mutual Information Mathematical Biology Seminar

Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.

Heuristic alignment algorithms and cost matrices

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.

Introduction to Bioinformatics Algorithms Sequence Alignment.

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

A Comparison of Algorithms for Species Identification based on DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Alexander.

Sequence similarity.

Distance Measures Tan et al. From Chapter 2.

Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

A unified statistical framework for sequence comparison and structure comparison Michael Levitt Mark Gerstein.

DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods CSE 377 – Bioinformatics - Spring 2006 Sotirios Kentros Univ. of.

Introduction to Bioinformatics Algorithms Sequence Alignment.

Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.

Alignment III PAM Matrices. 2 PAM250 scoring matrix.

Multiple Sequence Alignments

The Barcode of Life Integrating machine learning techniques for species prediction and discovery The Barcode of Life Integrating machine learning techniques.

Phylogenetic Tree Construction and Related Problems Bioinformatics.

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

An Introduction to Bioinformatics

Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Bioinformatics 2011 Molecular Evolution Revised 29/12/06.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Applied Bioinformatics Week 8 Jens Allmer. Practice I.

Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.

Construction of Substitution Matrices

Calculating branch lengths from distances. ABC A B C----- a b c.

PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

1 E. Fatemizadeh Statistical Pattern Recognition.

Pairwise Sequence Analysis-III

An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA 07/07/2006.

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Applied Bioinformatics Week 8 Jens Allmer. Theory I.

Construction of Substitution matrices

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Step 3: Tools Database Searching

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.

Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.

Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers Zhaozheng Yin RI/CMU, Fall 2009.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

Reconstructing the Evolutionary History of Complex Human Gene Clusters

Sofus A. Macskassy Fetch Technologies

Basic machine learning background with Python scikit-learn

Ab initio gene prediction

Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM

Pairwise Sequence Alignment (cont.)

Multivariate Methods Berlin Chen

Presentation transcript:

DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion Mândoiu Computer Science & Engineering Department, University of Connecticut

2 Outline Motivation & Problem Definition Methods used  Hamming Distance (MIN-HD and AVG-HD)  Aminoacid Similarity (MAX-AA-SIM and AVG-AA-SIM)  Convex-score similarity (MAX-CS-SIM)  Trinucleotide frequency (MIN-3FREQ)  Positional weight matrix (MAX-PWM)  Character-based pairwise species discrimination (k-BEST) Combining the Methods Results  Species Classification  New Species Recognition Future Work & Conclusions

3 Motivation “DNA barcoding” was proposed as a tool for differentiating species Goal: To make a “finger print” for species, using a short sequence of DNA Assumption: mitochondrial DNA evolves at a lower rate than regular DNA Mitochondrial DNA: High interspecie variability while retaining low intraspecie sequence variability Choice was cytochrome c oxidase subunit 1 mitochondrial region ("COI", 648 base pairs long).

4 Problem definition The scope of our project was to explore if by combining simple classification methods one can increase the classification accuracy. We address two problems: Classification of barcodes given a training set of species. Identification of barcodes that belong in new species. Assumption: All the barcode DNA sequences are aligned

5 Problem definition(1) Species Differentiation:  INPUT: a set S of barcodes for which the species is known and x a new barcode  OUTPUT: the species of x, given that there are barcodes S that have the same species as x

6 Problem definition(2) Species Differentiation & New Species Detection:  INPUT: a set S of barcodes for which the species is known and x a new barcode  OUTPUT: find the species of x, if there is at least a barcode in S with the same species or determine if x belongs to a new species.

7 Methods Find a “distance” between barcodes that is “able to distinguish between species”: 1. Low intraspecie variability 2. High interpecie variability Hamming Distance Aminoacid Similarity Convex-score similarity Trinucleotide frequency  Closer barcodes tend to have similar trinucleotide frequencies Positional weight matrix  Compute the probability of that barcode x belongs to a given species Character-based pairwise species discrimination  Find k most informative characters that are able to distinguish between two species.

8 Methods species S1 x d(x,S1) species S2 d(x,S2) … species Sn d(x,Sn) 1.d(x,Si) = Minimum{ d(x,y) | sequence y belongs to species Si } Minimum “Method” Classifier 2.d(x,Si) = Average{ d(x,y) | sequence y belongs to species Si } Average “Method” Classifier

9 Hamming Distance Percent of basepair divergences Average:  Given barcode x find species S such that the minimum hamming distances on the average from x to y (y in S) is minimized  species(x)= S. Minimum:  Given barcode x find barcode y that minimizes the hamming distance from x to y  species(x) = species(y)

10 Aminoacid Similarity Genetic code:  rules that map DNA sequences to proteinsDNA sequencesproteins  Codon: tri-nucleotide unit that encodes for one aminoacid  Divide DNA seq. into codons and substitute each one by its corresp. aminoacid Blosum62 (BLOck SUbstitution Matrix)  20x20 matrix that gives score for each two aminoacids based on aminoacid properties  The higher the score the more likely no functional change in the protein

11 Aminoacid Similarity Measures How similar the two aminoacid sequences encoded by the barcodes are Distance(x,y)  barcodes x, y -> Aminoacid sequences x’, y’ (using genetic code)  Score of the aminoacid alignment using the Blosum62 Average:  Find the species with maximum average similarity Minimum:  Find the barcode with max. similarity

12 Convex-score Similarity “Long runs of consecutive basepair matches” indicate that the encoded aminoacid sequence plays an important role -> the two barcodes are “close” on the evolutionary distance The longer the run of basepair matches, the higher the score The contribution of a run is convexly increasing with its length The new sequence is assigned to the species containing the highest scoring sequence

13 Trinucleotide Distance For each species compute the vector of trinucleotide frequencies For the new sequence x we compute the vector of trinucleotide frequencies Find the closest species. To measure the distance between 2 vectors of frequencies we use Minimum Mean Square distance

14 Positional weight matrix For each species we compute a positional weight matrix For each locus the PWM gives the probability of seeing each nucleotide appear at that locus in that species We assume independence of loci For a barcode x we can compute the probability that x belongs to species S as the product of the probabilities of observing at every locus the respective nucleotide in x Assign x to the specie that gives the highest probability

15 Character-based pairwise species discrimination Given species S1, S2 and new barcode x we find the k most discriminating characters A locus -> character Nucleotides -> possible values for character Idea: If at a given locus, there is a nucleotide that appears in S1 and not in S2, then if x contains that nucleotide at that locus -> x is more likely to belong to S1 and not to S2

16 Character-based pairwise species discrimination Finding the k most discriminative characters The discriminative power of character i is given by Cnt(i,X,S1) - the number of times we see nucleotide X at position i in species S1 Size(S1) - number of barcodes in specie S1

17 Character-based pairwise species discrimination i … A … … C … … T … … G … w(i) = 1  The two species (red, blue) are discriminated by character i with 100% accuracy  The nucleotide present at position i in the new barcode x safely tells us in which specie x is more likely to belong  i is a “pure” character

18 Character-based pairwise species discrimination i … A … … C … … A … … T … … G … w(i) = 0.9  The two species (red, blue) are discriminated by character i with 90% accuracy  if the new barcode x has a C,T,G at i we guess correctly the species of x  if the new barcode x has an A at i then we choose the species of x as the species containing the highest number of A’s at i (red sp.)

19 Character-based pairwise species discrimination 1. Given species S1, S2 and new barcode x we find the k most discriminating characters 2. We compute how many times specie S1 is favored over S2 and output the most favored specie 3. We repeat steps 1 and 2 for all pairs of species and the new barcode x 4. The specie S that is favored the most in all these pairwise discriminations is assigned to barcode x

20 Combining the Methods Every classifier outputs the specie the new barcode is most likely to belong Simple Voting:  Every classifier’s returned species has a weight of 1  Output the species with the most votes

21 Datasets(1) We used the dataset provided at consisting of 1623 aligned sequences classified into 150 species with each sequence consisting of 590 nucleotides on the average. We randomly deleted from each species 10 to 50 percent of the sequences  Deleted seq -> test  Remaining seq -> train We made sure that in every species has a least one sequence

22 Species Recovering Accuracy(in %) (no new species - DAWG train dataset) Classifier Percentage of barcodes removed from each species and used for testing 10%20%30%40%50% MIN-HD AVG-HD MAX-AA-SIM AVG-AA-SIM MAX-CS-SIM MIN-3FREQ MAX-PWM BEST COMBINED

23 Datasets(2) We used the cowries dataset provided at xxx We removed the species containing less than 4 barcodes per species We randomly deleted from each species 10 to 50 percent of the sequences  Deleted seq -> test  Remaining seq -> train We made sure that in every species has a least one sequence

24 Species Recovering Accuracy(in %) (no new species) Classifier Percentage of barcodes removed from each species and used for testing 10%20%30%40%50% MIN-HD AVG-HD MAX-AA-SIM AVG-AA-SIM MAX-CS-SIM MIN-3FREQ MAX-PWM BEST COMBINED

25 Datasets(3) In order to test the accuracy of new species detection and classification we devised a regular leave one out procedure. delete a whole species randomly delete from each remaining species 0 to 50 percent of the sequences  Deleted seq -> test  Remaining seq -> train The following table gives accuracy results on average for 150x6 different testcases

26 Leave one out Accuracy(in %) DAWG train dataset Classifier Percentage of additional barcodes removed from each species and used for testing 0%10%20%30%40%50% MIN-HD AVG-HD MAX-AA-SIM AVG-AA-SIM MAX-CS-SIM MIN-3FREQ MAX-PWM BEST COMBINED

27 Leave one out Accuracy(in %) Cowries dataset Classifier Percentage of additional barcodes removed from each species and used for testing 0%10%20%30%40%50% MIN-HD AVG-HD MAX-AA-SIM AVG-AA-SIM MAX-CS-SIM MIN-3FREQ MAX-PWM BEST COMBINED

28 Conclusions(1) Every method shows a tradeoff between new species detection and classification accuracy Hamming distance performs very good when no new species are present but the accuracy results are low for new species detection The combined method yields better accuracy results both on new species detection and seq. classification. The runtime of all methods is within the same order of magnitude

29 Future Work New species clustering: determining the different new species present Further investigate threshold selection and weighting schemes. Possible ignoring parts of the given sequences could improve accuracy. Are there redundant/noisy regions? Use independent weighting schemes for new species detection and classification into known species.