Su ffi x Tree of Alignment: An E ffi cient Index for Similar Data JOONG CHAE NA1, HEEJIN PARK2, MAXIME CROCHEMORE3, JAN HOLUB4, COSTAS S. ILIOPOULOS3, LAURENT.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Two-dimensional pattern matching M.G.W.H. van de Rijdt 23 August 2005.
Fast Algorithms For Hierarchical Range Histogram Constructions
Linear-time construction of CSA using o(n log n)-bit working space for large alphabets Joong Chae Na School of Computer Sci. & Eng. Seoul National University,
CPSC 335 Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Dynamic Programming.
Huffman Encoding Dr. Bernard Chen Ph.D. University of Central Arkansas.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CSE 746 – Introduction to Bioinformatics Research Project Two methods of DNA Sequencing – Comparing and Intertwining Suffix Trees and De Bruijn Graphs.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
McCrieght’s algorithm for linear- time suffix tree construction Example.
Chapter 4 Normal Forms for CFGs Chomsky Normal Form n Defn A CFG G = (V, , P, S) is in chomsky normal form if each rule in G has one of.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
Goodrich, Tamassia String Processing1 Pattern Matching.
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Class 2: Basic Sequence Alignment
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
LECTURE 13. Course: “Design of Systems: Structural Approach” Dept. “Communication Networks &Systems”, Faculty of Radioengineering & Cybernetics Moscow.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Symbol Tables and Search Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
1 5. Abstract Data Structures & Algorithms 5.2 Static Data Structures.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off: b extra space in tables (breathing.
Algorithms 2005 Ramesh Hariharan. Algebraic Methods.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Comp. Genomics Recitation 10 Clustering and analysis of microarrays.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Andrzej Ehrenfeucht, University of Colorado, Boulder
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
CSCE350 Algorithms and Data Structure
Objective of This Course
Trees Lecture 9 CS2110 – Fall 2009.
2018, Spring Pusan National University Ki-Joune Li
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Tries 2/27/2019 5:37 PM Tries Tries.
Sequences 5/17/ :43 AM Pattern Matching.
Optimal Partitioning of Data Chunks in Deduplication Systems
Presentation transcript:

Su ffi x Tree of Alignment: An E ffi cient Index for Similar Data JOONG CHAE NA1, HEEJIN PARK2, MAXIME CROCHEMORE3, JAN HOLUB4, COSTAS S. ILIOPOULOS3, LAURENT MOUCHARD5, AND KUNSOO PARK6 Presented by Ramin Fallahzadeh

Problem definition Indexing multiple data which are very similar: ◦Modifying existing data (e.g., new version of a source code) ◦Today’s back up vs yesterday’s back up ◦Individual’s genome vs Human reference genome (99% indentical)

Storing vs Indexing data Storing data: ◦Using alignment to store only the differences ◦Data compression schemes Indexing data: ◦Example: Search Engines ◦Suffix tree: linear time and space complexity ◦One solution: constructing generalized suffix tree

Generalized suffix tree GST(A,B): ◦|A|+|B| leaves ◦O(|A|+|B|) construction time ◦Drawbacks: ◦Some suffixes may be stored twice A = aaatcaaa B = aaatgaaa {aaa, aa, a} are stored twice in GST ◦two similar suffixes aaatcaaa and aaatgaaa are stored in distinct leaves even though they are very similar ◦Therefore for similar data most of the leaves are redundant!

Contribution Neither the suffix tree nor any variant of the suffix tree uses this similarity or alignment to index similar data efficiently!

Alignment

given alignment is not required to be optimal we can use a near-optimal alignment instead of the optimal alignment if the time to compute an alignment is important Since the given strings are assumed to be highly similar, a near-optimal alignment can be computed fast from exact string matching instead of dynamic programming requiring much time.

Naïve approach constructing the generalized suffix tree and deleting unnecessary leaves not time/space-efficient! The proposed algorithm is incremental, i.e., we construct the suffix tree of A and then transform it to the suffix tree of the alignment This algorithm uses constant-size extra working space except for our suffix tree itself  more space-efficient compared to the naïve method

Simple alignment α

General Alignment

Definitions

Example Generalized suffix tree: A = aaabaaabbaaba# B = aaabaabaabbaba#

Example Suffix tree of alignment A = aaabaaabbaaba# B = aaabaabaabbaba# Alignment: aaabaa(abba/baabb)aba# Type-1 Type-2 Type-3 Type-4

Construction

Example ST(A) A = aaabaaabbaaba#

Example ST’(A) when step A is applied: A = aaabaaabbaaba# B = aaabaabaabbaba#

Example Suffix tree of alignment A = aaabaaabbaaba# B = aaabaabaabbaba# Alignment: aaabaa(abba/baabb)aba#

Su ffi x Tree of General Alignments

Construction

Space Complexity

Time complexity

Thank you for your attention Any questions?