DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods CSE 377 – Bioinformatics - Spring 2006 Sotirios Kentros Univ. of.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Random Forest Predrag Radenković 3237/10
Clustering Categorical Data The Case of Quran Verses
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.
Improved Alignment of Protein Sequences Based on Common Parts David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
BARCODING LIFE, ILLUSTRATED Goals, Rationale, Results ppt v1
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Molecular Evolution Revised 29/12/06
Mutual Information Mathematical Biology Seminar
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Introduction to Bioinformatics Algorithms Sequence Alignment.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
A Comparison of Algorithms for Species Identification based on DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Alexander.
Lecture 5 (Classification with Decision Trees)
Sequence similarity.
Evaluating alignments using motif detection Let’s evaluate alignments by searching for motifs If alignment X reveals more functional motifs than Y using.
A unified statistical framework for sequence comparison and structure comparison Michael Levitt Mark Gerstein.
Class 3: Estimating Scoring Rules for Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Lecture 12 Splicing and gene prediction in eukaryotes
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
MicroRNA identification based on sequence and structure alignment Presented by - Neeta Jain Xiaowo Wang†, Jing Zhang†, Fei Li, Jin Gu, Tao He, Xuegong.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 E. Fatemizadeh Statistical Pattern Recognition.
Pairwise Sequence Analysis-III
An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA 07/07/2006.
Information Theory Linear Block Codes Jalal Al Roumy.
© Devi Parikh 2008 Devi Parikh and Tsuhan Chen Carnegie Mellon University April 3, ICASSP 2008 Bringing Diverse Classifiers to Common Grounds: dtransform.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Construction of Substitution matrices
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers Zhaozheng Yin RI/CMU, Fall 2009.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Semi-Supervised Clustering
Logistic Regression: To classify gene pairs
Reconstructing the Evolutionary History of Complex Human Gene Clusters
Basic machine learning background with Python scikit-learn
King Fahd University of Petroleum and Minerals
Tutorial 3 – Protein Scoring Matrices PAM & BLOSUM
Generalizations of Markov model to characterize biological sequences
Consensus Partition Liang Zheng 5.21.
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods CSE 377 – Bioinformatics - Spring 2006 Sotirios Kentros Univ. of Connecticut Bogdan Paşaniuc

2 Outline Motivation Problem Definition The Methods  Hamming Distance and Minimum Hamming Distance  Aminoacid Similarity and Minimum Aminoacid Similarity  Dinucleotide Distance  Trinucleotide Distance  Nucleotide Frequency Similarity Combining the Methods Results  Specie Classification  New Specie Recognition Conclusion Future Work

3 Motivation “DNA barcoding” was proposed as a tool for differentiating biological species Goal: To make a “finger print” for species, using a short sequence of DNA Assumption: mitochondrial DNA evolve at a lower rate than regular DNA Mitochondrial DNA: High interspecie variability while retaining low intraspecie sequence variability Choice was cytochrome c oxidase subunit 1 mitochondrial region ("COI", 648 base pairs long).

4 Problem definition The scope of our project was to explore if by combining simple classification methods one can increase the classification accuracy. We address two problems: Classification of individuals given a training set of species. Identification of individuals that belong in new species. All the sequences are aligned

5 Problem definition Specie differentiation:  INPUT: a set S of aligned DNA sequences for which the specie is known and x a new sequence  OUTPUT: find the specie of x, given that there are sequences in S that have the same specie as x

6 Problem definition Specie differentiation&New Specie Determination:  INPUT: a set S of aligned DNA sequences for which the specie is known and x a new sequence  OUTPUT: find the specie of x, if there is at least a sequence in S with the same specie or determine if it is a new specie.

7 Methods Used Hamming Distance and Minimum Hamming Distance Aminoacid Similarity and Minimum Aminoacid Similarity Dinucleotide Distance Trinucleotide Distance Nucleotide Frequency Similarity

8 Methods Specie S1 x d(x,S1) Specie S2 d(x,S2) … Specie Sn d(x,Sn) 1.d(x,Si) = Minimum{ d(x,y) | sequence y belongs to specie Si } Notation: Minimum “Method” Classifier 2.d(x,Si) = Average{ d(x,y) | sequence y belongs to specie Si } Notation: “Method” Classifier

9 Hamming Distance Average:  Given new sequence x find specie S such that the minimum hamming distances on the average from x to y (y in S) is minimized  Assign to S to y Minimum  Given new sequence x find y such that the minimum hamming distance from x to y is minimized  Assign specie(y) to x

10 Aminoacid Similarity Genetic code:  rules that map DNA sequences to proteinsDNA sequencesproteins  Codon: tri-nucleotide unit that encodes for one aminoacid  Divide DNA seq. into codons and substitute each one by its corresp. aminoacid Blosum62 (BLOck SUbstitution Matrix)  20x20 matrix that gives score for each two aminoacids based on aminoacid properties  The higher the score the more likely no functional change in the protein

11 Aminoacid Similarity Distance(x,y)  DNA sequences x, y ->Aminoacid sequences x’, y’ (using codon to aminoacid transf.)  Using the Blosum aminoacid substitution matrix get the score of the alignment Average:  Find the specie with maximum average similarity Minimum:  Find the sequence with max. similarity

12 Dinucleotide Distance For each specie find the frequency with which each Dinucleotide appears. Compute the frequency of each Dinucleotide in the unclassified sequence. Find the specie with the minimum Mean Square distance to the new unclassified sequence For New Species, after classifying the individual find the Average Intraspecie Mean Square distance for the candidate specie. If the individual is close enough, assign him at the specie, otherwise he belongs in a New Specie. in/dels are ignored

13 Trinucleotide Distance For each specie find the frequency with which each Trinucleotide appears. Compute the frequency of Trinucleotide appearance of the unclassified sequence. Find the specie with the minimum Mean Square distance to the new unclassified sequence For New Species, after classifying the individual find the Average Intraspecie Mean Square distance for the candidate specie. If the individual is close enough, assign him at the specie, otherwise he belongs in a New Specie. in/dels are ignored

14 Nucleotide Frequency Similarity For each position in the DNA find the frequency with which the Nucleotides appear in the specie individuals. We include the frequency of in/dels appearing. For unclassified individuals compute the log of the probability that the individual belongs to the specie and assign it to the specie for which the probability is maximum. For new species, we compute the minimum probability for the individuals belonging in the specie and compare it with the one of the candidate specie in order to determine whether it belongs to the specie or not.

15 Combining the Methods The specie on which most classifiers agreed is returned Simple Voting:  Every classifier’s returned specie has a weight of 1  Output the specie with the most votes Weighted Voting  Every classifier has a different weight based on the accuracy of each independent method  Output the specie with largest total As expected weighted voting yields higher accuracy and thus in our results the combined method uses weighted voting

16 Datasets(1) We used the dataset provided at consisting of 1623 aligned sequences classified into 150 species with each sequence consisting of 590 nucleotides on the average. We randomly deleted from each specie 10 to 50 percent of the sequences  Deleted seq -> test  Remaining seq -> train We made sure that in every specie has a least one sequence

17 Methods Percent missing from each specie(%) Aminoacid Similarity Min. Aminoacid Similarity Hamming Dist Min. Hamming Dist Nucleotide Freq Sim Dinucleotide Freq. Dist Trinucleotide Freq. Dist Combination Specie Recovering Accuracy(in %) (no new specie)

18 Datasets(2) In order to test the accuracy of new specie detection and classification we devised a regular leave one out procedure. delete a whole specie randomly delete from each remaining specie 0 to 50 percent of the sequences  Deleted seq -> test  Remaining seq -> train The following table gives accuracy results on average for 150x6 different testcases

19 Methods Percent missing from each remaining specie(%) Aminoacid Similarity Min. Aminoacid Similarity Hamming Dist Min. Hamming Dist Dinucleotide Freq. Dist Trinucleotide Freq. Dist Nucleotide Freq Sim Combination Leave one out Accuracy(in %)

20 Conclusions(1) Every method show a tradeoff between new specie detection and classification accuracy Hamming distance performs very good when no new species are present but the accuracy results are low for new specie detection The combined method yields better accuracy results both on new specie detection and seq. classification. The runtime of all methods is within same order of magnitude

21 Conclusions(2) By combining simple classification methods, we managed to boost the accuracy both for classifying individuals in known species and for detecting new species As expected the results imply a tradeoff between classification and new specie detection  the higher the classification accuracy the lower the detection Hamming Distance is a very good metric for the training dataset provided

22 Future Work New specie clustering: determining the different new species present Further investigate threshold selection and weighting schemes. Possible ignoring parts of the given sequences could improve accuracy. Are there redundant/noisy regions? Use independent weighting schemes for new specie detection and classification into known species.

23 Questions Thank you.