On Evaluating the Performance of Compression Based Techniques for Sequence Comparison R AMEZ M INA † D HUNDY B ASTOLA †, * AND H ESHAM A LI †, * †College.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Classification of Living Things. 2 Taxonomy: Distinguishing Species Distinguishing species on the basis of structure can be difficult  Members of the.
BIOINFORMATICS Ency Lee.
Introduction to Bioinformatics
Molecular Evolution Revised 29/12/06
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Bioinformatics and Phylogenetic Analysis
Characteristic Restriction Endonuclease cut order for Classification and analysis of DNA Sequences Rajib SenGupta College of Information Science and Technology,
The Protein Data Bank (PDB)
Clustering by Compression Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Phylogeny - based on whole genome data
Supplementary material Figure S1. Cumulative histogram of the fitness of the pairwise alignments of random generated ESSs. In order to assess the statistical.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Metagenomic Analysis Using MEGAN4
Anomaly Detection Using Symmetric Compression Benjamin Arai & Chris Baron Computer Science and Engineering Department University of California - Riverside.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
1/27 Discrete and Genetic Algorithms in Bioinformatics 許聞廉 中央研究院資訊所.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
Introduction to Phylogenetics
Comp. Genomics Recitation 3 The statistics of database searching.
Calculating branch lengths from distances. ABC A B C----- a b c.
BioPerf: A Benchmark Suite to Evaluate High- Performance Computer Architecture on Bioinformatics Applications David A. Bader, Yue Li Tao Li Vipin Sachdeva.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
ARE THESE ALL BEARS? WHICH ONES ARE MORE CLOSELY RELATED?
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Speaker: Bin-Shenq Ho Dec. 19, 2011
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Bioinformatics Overview
Distance based phylogenetics
Pipelines for Computational Analysis (Bioinformatics)
A Hybrid Algorithm for Multiple DNA Sequence Alignment
Multiple Alignment and Phylogenetic Trees
Overview Bioinformatics: Analyzing biological data using statistics, math modeling, and computer science BLAST = Basic Local Alignment Search Tool Input.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Phylogeny.
Hierarchical Clustering
Presentation transcript:

On Evaluating the Performance of Compression Based Techniques for Sequence Comparison R AMEZ M INA † D HUNDY B ASTOLA †, * AND H ESHAM A LI †, * †College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE *Department of Pathology and Microbiology, University of Nebraska Medical Center, Omaha, NE ABSTRACT Comparing biological sequences remains one of the most important problems in bioinformatics. Sequence alignment has been the method of choice specially when comparing DNA and Protein sequences. This approach of comparative analysis has been used in detecting functional and structural similarities and for classification purposes. While the alignment method has proven to be a reliable approach for sequence comparison, it fails to produce accurate results in many cases, particularly when the input sequences are incomplete, have a high degree of dissimilarity, or contain large number of repeats or mobile subsequences. We conducted a study to evaluate the performance of compression based algorithms in detecting similarity among biological sequences. These algorithms use data compression techniques to generate a dictionary of non-redundant and non-overlapping words. Subsequently dissimilarity measures (complexities) are obtained by comparing the dictionaries. We implemented different compression algorithms including LZW and Huffman [1] and evaluated the distance between input sequences using Lempel-Ziv and Kolmogorov complexities [1, 2]. Using different datasets [1], we compared the results obtained from using Kolmogorov and Lempel-Ziv complexities against the gold standard trees. Additionally, the trees obtained through alignment and compression based approaches were compared. Our preliminary results show that compression based algorithms out perform alignment techniques in several datasets that contain highly dissimilar sequences. MOTIVATION Classification of organism and construction of phylogenies are major activities in bioinformatics. Mutational events leading to the changes in nucleotide composition is the basis for molecular evolutionary studies and molecular phylogenetic studies include the use of genetic sequences and its comparisons. The most commonly used computation approach to classification and phylogentic studies using genetic sequence similarity involves pairwise local or multiple sequence alignment. Although sequence alignment has been the method of choice in comparitive sequence analysis and has also been used in detecting functional and structural similarities, this approach is not suitable for whole genomic sequence comparisons or sequence with long length or multiple targets. Literature review show many compression based methods being evaluated that would allow one to overcome the limitations associated with alignment based techniques. In this study, we implement alignment and compression based algorithms, including the dictionary based technique, on mitochondrial genome form 15 mammals and protein. C ONCLUSIONS A set of closely related organisms can be compared using DNA or protein sequences.  Compression based techniques for sequence comparison are competitive alternative to the widely used alignment based methods.  The Dictionary based comparison methods outperform other compression or alignment based techniques. R EFERENCES [1] Paolo Ferragina, Raffaele Giancarlo, Valentina Greco, Giovanni Manzini and Gabriel Valiente ”Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment” BMC Bioinformatics. 2007, 8:252. [2] Hasan H. Out and Khalid Sayood ”A new sequence measure for phylogenetic tree construction”, Bioinformatics. 2003, 19:2122. [3] Handl J, Knowles J, Kell DB: “Computational Cluster Validation in Post-Genomic Data Analysis. Bioinformatics 2005”, 21(15): TABLE 1. Evaluation of compression based (Kolmogorov, LZ) and Sequence alignment approaches using Chew-Kedem data set of 36 protein. The values are reflective of how good the tree topology is compared to the gold standard using F-measure of comparing trees. Fig.1 A flow diagram showing major steps used in data collection, analysis and evaluation. Two of the compression based and the classical alignment methods were used in sequence analysis. The relatedness between sequences was contained in distance matrices that were used in producing phylogenetic trees. METHODOLOGY Data compression is a process of minimizing the size of data using specific encoding schema. It is a highly useful and popular technique in reducing the use of expensive resources including hard disk space or transmission bandwidth. Many compression techniques are based on finding similar structures. Therefore, compression techniques have found its application in the analysis of genetic sequence where the data is comprised of long strings of deoxyribo-nucleotides (ATG and C) and consists of repeated structure. Additionally, when comparing two or more sequences degree of compression could be a reflection of the relatedness between two sequences. This concept is known as the compression complexity. Two know example of compression complexities are Lempel-Ziv (LZ) [2] and Kolmogorov complexities [1] where the former produces dictionary of words and the later is more flexible in choosing the compression technique. Distance measure 1: d(S, Q) = max{c(SQ) - c(S), c(QS) - c(Q)} Distance measure 2: d*(S, Q) = Distance Measure 3: d1(S, Q) = c(SQ) - c(S) + c(QS) - c(Q) Distance Measure 4: d1*(S, Q) = StepCopyAddGenerated Sequence (Z) 1NothingAA 2 TA.T 3NothingGA.T.G. 4TGAA.T.G.TGA 5ATGCA.T.G.TGA.ATGC 6ATA.T.G.TGA.ATGC.AT.. The exhaustive history for these sequences would be as shown H E (S) = A.T.G.TGA.ATGC.AT H E (R) = C.T.A.G.GGA.CTT.AT H E (Q) = A.C.G.GT.CA.CC.AA H E (SQ) = A.T.G.TGA.ATGC.ATA. C.GG.TC.AC.CAA H E (RQ) = C.T.A.G.GGA.CCT.AT.ACG.GT.CA.CC.AA LZ complexity of a sequence S, as c(S) is equal to the number of resulting words of its exhaustive history. c(S)=6, c(R)=7, c(Q)=7, c(SQ) = 10 and c(RQ) =12 Distance is calculated in the following manner Where c(S) is the LZ complexity of a sequence S, and c(SQ) is the LZ complexity of sequence S added to sequences Q. These distances would result in the distance matrices that would be used in building phylogenetic tree. Kolmogorov complexity Given two sequences x and y, we define the conditional Kolmogorov complexity as K(x|y) as the shortest binary program that computes x in terms of y. We also define the information distance ID between two sequences x and y as: ID(x, y) = max {K(x|y), K(y|x)} Kolomogrov complexity is not based on the number words but on the length of compressed sequences. It is a concept described as Universal Similarity Metric (USM), which can be approximated as Universal Compression Distance (UCD), Normalized Compression Distance (NCD) and Compression Distance (CD) as shown Then NCD(x,y) = min {NCD 1 (x,y), NCD 1 (y,x)} Where C(x) is the length of a compressed sequence x and C(xy) is the length of a compressed sequence xy (sequence y is added to the end of sequence x). These resulting measures of these formulas are used to build the distance matrices that will be used to construct the phylogenetic trees. E VALUATION and R ESULTS To assess the feasibility of compression techniques in the comparison of sequences, phylogenetic trees were obtained (Fig 3 ) with mitochondrial genome sequence data from 15 mammals. The result obtained form original sequence data provided a baseline for comparison of the trees obtained with the data set that included simulated nucleotide mutation of the original sequences. Clustering of the mutated sequences with its parent was indicative of dependable comparison method UCD(x,y) = CD(x,y) = NCD 1 (x,y) = LZ complexity: LZ complexity is defined as the least exhaustive history of a sequence and noted as c(sequence). With following sequences as example: S= ATGTGAATGCAT R= CTAGGGACTTAT Q= ACGGTCACCAA Fig.4 Phylogenetic trees obtained with (A) Kolmogorov (B) Lempel-Ziv and (C) Sequence alignment techniques using the Chew-Kedem data set of 36 protein as in table 1. (A) (B)(C) Fig 2. Steps involved in generating an exhaustive library of a given sequence ATGTGAATGCAT. The exhaustive history of a sequence is collection of a unique set of subsequence (words). Neighbor-joiningHierarchal Lempel-Ziv Distance Distance Distance Distance Kolmogorov CD NCD UCD Seq Alignment Fig.3 Phylogenetic tree showing consistency of tree topology between (A) with no mutation and (B) with 1, 5 and 10 % of random mutation (simulated) of the mitochondrial genome sequences. The distance matrix obtained with the distance measure 2 was used in constructing the Neighor-Joining tree. (A) (B) To determine the performance of various compression methods including the classical alignment based approach, Chew-Kedem data set were analyzed (Fig 4). This data set consists of 36 protein domains drawn from PDB entries of three classes (alpha-beta, mainly- alpha, mainly-beta). Implementation of the F-Measure[3] for cluster validation allowed for a comparison of the trees against the gold standard from NCBI (Table 1). A tree highly similar to the gold standard tree is expected to have a value closer to 1.