Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Boosting Textual Compression in Optimal Linear Time.
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Succinct Representation of Balanced Parentheses, Static Trees and Planar Graphs J. Ian Munro & Venkatesh Raman.
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
Sparse Compact Directed Acyclic Word Graphs
HABATAKITAI Laboratory Everything is String. Computing palindromic factorization and palindromic covers on-line Tomohiro I, Shiho Sugimoto, Shunsuke Inenaga,
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
1 Trees. 2 Outline –Tree Structures –Tree Node Level and Path Length –Binary Tree Definition –Binary Tree Nodes –Binary Search Trees.
An O(N 2 ) Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru.
Linear Time Algorithms for Finding and Representing all Tandem Repeats in a String Dan Gusfield and Jens Stoye Journal of Computer and System Science 69.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 20: Binary Trees.
Binary Trees Chapter 6.
Advanced Algorithms Analysis and Design Lecture 8 (Continue Lecture 7…..) Elementry Data Structures By Engr Huma Ayub Vine.
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.
Searching: Binary Trees and Hash Tables CHAPTER 12 6/4/15 Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education,
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Finding Characteristic Substrings from Compressed Texts Shunsuke Inenaga Kyushu University, Japan Hideo Bannai Kyushu University, Japan.
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
Graphs and Trees Mathematical Structures for Computer Science Chapter 5 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesGraphs and Trees.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
ADT description Implementations
Non Linear Data Structure
Succinct Data Structures
CSCE 210 Data Structures and Algorithms
Tries 07/28/16 11:04 Text Compression
Top 50 Data Structures Interview Questions
Andrzej Ehrenfeucht, University of Colorado, Boulder
DATA STRUCTURES AND OBJECT ORIENTED PROGRAMMING IN C++
Lecture Trees Chapter 9 of textbook 1. Concepts of trees
12. Graphs and Trees 2 Summary
Lecture 22 Binary Search Trees Chapter 10 of textbook
Lecture 18. Basics and types of Trees
13 Text Processing Hongfei Yan June 1, 2016.
Binary Trees, Binary Search Trees
TREES General trees Binary trees Binary search trees AVL trees
Chapter 6 Intermediate-Code Generation
Algorithms (2IL15) – Lecture 2
Trees Lecture 9 CS2110 – Fall 2009.
Reachability on Suffix Tree Graphs
Chapter 6: Transform and Conquer
String Data Structures and Algorithms
Trees.
Suffix Trees String … any sequence of characters.
Suffix Arrays and Suffix Trees
Important Problem Types and Fundamental Data Structures
Sequences 5/17/ :43 AM Pattern Matching.
Trees Lecture 10 CS2110 – Spring 2013.
Presentation transcript:

Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda. Kazuyuki Narisawa, Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda Kyushu University, Japan

Contents Introduction Problem definition Suffix tree based algorithm Simulation by suffix array Computational experiment Application Summary this is the contents of this presentation

Main contribution Time and space efficient computation of substring equivalence classes [Blumer et al. 1987] with suffix arrays Our main contribution is an efficient algorithm that uses suffix arrays, for the computation of substring equivalence classes defined by Blumer et al. Of course, this algorithm runs in linear time and space. Linear time and space is faster and requires less memory than suffix tree and CDAWG based methods.

Equivalence relation and classes Given a string w, the maximal extension of a substring x is        ・Every time x occurs in w,  x = αxβ    it is preceded by α and followed by β.        ・Strings α and β are longest possible. equivalence relation x  y  x = y equivalence class [x] = { y | y  x } Substrings with essentially identical occurrence in w Firstly, I will explain the equivalence relation and class. This is the definition of the equivalence relation and class. Given a string w, the maximal extension of a substring x is defined as follows. Based on maximal extension, we can define the equivalence relation, and hence equivalence classes on the substrings of w An equivalence class consists of substrings which appear essentially in the same place. For example, consider as w this sentence “betty bought a bit of better butter and made a better batter after breakfast.” In this sentence, the substring “bet” is always preceded by a “hyphen■” and followed by “ter■b”, and we can not extend this without changing occurrence frequency. example Betty–bought–a–bit–of–better–butter–and– made–a–better–batter–after–breakfast. bet = –better–b bet  [–better–b]

Problem Input : string w of length n Output: the equivalence classes on w Difficulty The total number of elements in the equivalence classes (shortly ECs) is O(n2). Solution The number of the ECs is O(n). Each EC can be succinctly represented in O(1) space. This is the problem that we will consider. The Input is a string w of length n . The Output is the equivalence classes on w A difficulty of this problem is that the total number of elements in the equivalence classes is not linear in n. In order to solve this difficulty, we consider a succinct representation that can describe each equivalence class in constant space.

Succinct representation of the ECs representative ・・・ the longest element(maximal extension) minimal strings ・・・ the elements which belong to another EC when the left or right most character is deleted [x] = Substring( x ) ∩ (  ∪ Superstring( y ) ) y is minimal - b e t r the elements of [x] can be enumerated with the representative and minimal strings So we consider the succinct representation of the equivalence classes. The succinct representation consists of the representative and minimal strings. An equivalence class is formed by the substrings of the representative and the superstrings of minimal strings. This example is the representative and the minimal strings of the substring “-better-b” . example representative Betty-bought-a-bit-of-better- butter-and-made-a-better- batter-after-breakfast. - b e t r minimal strings

Problem Input : string w of length n Output: succinct representations of the equivalence classes on w additionally, we will output size ( the number of elements in each EC) frequency ( the number of occurrences of the elements in each EC ) of each EC Thus, our problem’s output changes to succinct representations of the equivalence classes on w. Moreover, this study outputs additionally size and frequency of each equivalence class.

Possible solutions Suffix Tree[Weiner 1973] Compact Directed Acyclic Word Graph (CDAWG) [Blumer et al. 1985] ECs can be computed with either of the data structures in linear time and space. In order to solve that problem, we consider some data structures. Using Suffix tree or Compact directed acyclic word graph, we can solve the problem in linear time and space.

Suffix tree (with suffix link) ababbbabbc$ a b 1 2 7 3 11 10 5 6 8 4 9 c $ Thus we consider an algorithm with suffix tree. This is the suffix tree for the example string “ababbbabbc$” We will assume that strings ends with a distinct character $ that doesn’t appear anywhere else in the string. The red dotted arrows are suffix links. A suffix link points to a node representing the string with the leftmost character deleted. For example suffix link of “babb” points to “abb”. In this presentation, we won’t consider leaves because they form a trivial EC. Ignore leaves here because they form a trivial EC.

Equivalence classes on suffix tree ababbbabbc$ two nodes are connected by suffix link and subtrees have the same number of leaves equivalence relation $ a b c b b $ 11 EC abb babb bab ba 10 a c b $ b b a a b b b b a EC 9 b b b b Now we consider the equivalence classes on the suffix tree. For example, consider substrings in the same equivalence class as this node “babb”. We can see that “bab” and “ba” are in the same equivalence class because the subtree below has the same number of leaves which represent the number of occurrences. In addition, these two nodes are equivalent. We can judge if two nodes are equivalent by checking whether or not the two nodes are connected by a suffix link, and their subtrees have the same number of leaves. Thus, these two nodes represent substrings in the same equivalence class, and they consist of the substrings “abb”, “babb”, “bab” and “ba”. So, an algorithm for computing equivalence classes can be as follows. b babb bab ba c a a b c c $ b b a $ $ b b b 1 c b c b a $ $ c c b b $ $ 3 7 EC def. c $ 5 4 8 Essentially same occurrence substrings 2 6

Suffix tree algorithm foreach node v in suffix tree { if(node v is representative of EC [v]≡ ) { follow suffix link; while(node is in EC [v]≡) { compute size and minimal strings; } output succinct representation of EC [v]≡; This is an algorithm for computing the equivalence classes with suffix tree. This algorithm is based on a traversal of the suffix tree.

Algorithm with suffix tree representative, minimal strings, size, frequency output number of incoming suffix links = 1 or not ? representative ? follow suffix link representative same leaves number ? in same EC ? back to representative in other EC $ a b number of incoming suffix links = 1 or not ? representative ? c b $ 11 10 a c b $ Suffix tree requires large memory space. b b a b b b a continue suffix tree traversal not representative 9 b b b Here is an example. First the algorithm traverses the suffix tree. We will use a post-order traversal here. In visiting on interior node, we check whether the node is a representative element or not. In this case, this node is not a representative because the number of incoming suffix links is 1. So this algorithm continues suffix tree traversal. Next. We check whether this node is a representative or not. In this case, this node is a representative because the number of incoming links is not 1. So, in order to find the other nodes in the same equivalence class, we follow the suffix link and check if the node is in the same equivalence class or not. In this case, this node is not in same equivalence class. Thus we go back to the representative and output the succinct representation. If the node *is(強調して発音)* in the same equivalence class, we continue to follow the suffix link while they are in the same equivalence class. After this, this algorithm runs similarly. Hence, we can compute the equivalence classes using this algorithm. However, a problem of suffix tree is that it requires large memory space. So we consider suffix array instead of suffix tree. b c a a b c c $ b b a $ $ b b b c 1 c b b a $ $ c c b b $ $ 3 7 c $ 5 4 8 2 6

Suffix array [Manber and Myers 1993] Can simulate traversal on suffix tree using lcp and rank arrays [Kasai et al. 2003] Can simulate traversal on suffix links using additional data structure: suffix link table [Abouelhoda et al. 2004] Kasai et al. showed that traversal of suffix trees can be simulated on suffix arrays, by using lcp and rank arrays. However, for our algorithm, we need to follow suffix links. Abouelhoda et al showed that traversal of suffix links can also be simulated on suffix arrays by creating a suffix link table. In our work, we show that our algorithm won’t require a suffix link table, and can be done with just the lcp and rank arrays. simulate traversal on suffix links without suffix link table Our algorithm

lexicographically sort suffixes Suffix array lexicographically sort suffixes ababbbabbc$ Suffix Array 1 2 3 4 5 6 7 8 9 10 11 a b c $ 1 3 7 2 6 5 4 8 9 10 11 a b c $ a b 1 2 7 3 11 10 5 6 8 4 9 c $ From now we consider the algorithm using suffix array. First, I will introduce some data structures. A suffix array is an array of suffix indices sorted in lexicographic order of suffixes. This array corresponds to leaves of the suffix tree.

Lcp array ababbbabbc$ lcp[i]:the length of the longest common prefix of i th and (i –1) th suffixes a b 1 2 7 3 11 10 5 6 8 4 9 c $ Lcp Array 1 2 3 4 5 6 7 8 9 10 11 -1 The Lcp array contains the length of the longest common prefix of adjacent suffixes in the suffix array. Suffix Array 1 2 3 4 5 6 7 8 9 10 11

Rank array ababbbabbc$ rank[SA[i]] = i 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 The rank array is an inversed array of the suffix array. Suffix Array 1 2 3 4 5 6 7 8 9 10 11

Suffix array has less information Information available during traversal for each data structure, when visiting node v Suffix Tree Suffix Array 1. label from root to each node 2. label from parent to each node 3. num. leaves in each subtree 4. parent of each node 5. children of each node 6. suffix link of each node length of label from root to v length of label from root to the parent of v left most leaf ID in subtree rooted at v right most leaf ID in subtree rooted at v When we use suffix arrays for simulating suffix tree traversal, the main issue is that suffix arrays will generally have less information compared to the suffix tree.

Suffix array has less information 2 6 a b c $ 1 7 3 11 10 5 8 4 9 length of parent label from root:1 11 10 9 For example, in considering the pink node, on the suffix tree, we can get all information such as suffix links, parent node, root, leaves and etc. But, on suffix array, we only have the label length from root, the label length of parent from root, lcp and rank array. Thus suffix array algorithm seems to be more difficult than suffix tree algorithm. label length from root:4 1 2 3 6 7 8 4 5

Suffix array algorithm foreach v in suffix tree (simulated by suffix array){ if(node v is representative of EC [v]≡) { follow suffix link; while(node is in EC [v]≡) { compute size and minimal strings; } output succinct representation of EC [v]≡; difficulty 1 difficulty 2 difficulty 3 Here is the suffix array algorithm for computing the equivalence classes. This algorithm is the same as the suffix tree algorithm, except for the first line. But the problem is how to compute these three sections when using a suffix array. Next, we will show how each difficulty can be solved. These are difficult because suffix array has less information.

Solving difficulty 1 (representative judge) Suffix Array index L’= rank(l –1) R’= rank(r –1) L = rank(l) R = rank(r) First, this is for judging whether a given node v is a representative of an equivalence class or not. In this paper, we show that these conditions correspond to whether a node is a representative or not. R – L ≠ R’ – L’, (different num. leaves ? ) w[l – 1] ≠ w[r – 1], or (different first char ? ) l – 1 = 0 or r – 1 = 0 (first char in string ? ) Lemma: x = x  1.2. or 3.

Solving difficulty 2 (equivalence relation judge) x:label from root ax:label from root v l r l+1 r+1 Suffix Array index L = rank(l) R = rank(r) L’ = rank(l+1) R’ = rank(r+1) Second, this is for judging whether following the suffix link will lead to a node in the same equivalence class or not. We showed that these conditions can be used to check this. R – L = R’ – L’, (same number of leaves ?) lcp(L’) < |ax| – 1, and (left most ?) lcp(R’ + 1) < |ax| – 1 (right most ?) Lemma: ax  x  1.2. and 3.

Solving difficulty 3 (size computation) case 1 case 2 case 3 size = sum of this l r r’ l r r’ l r Suffix Array Lastly, this is for computing the size of equivalence classes. Using this expression, we can compute the size of the equivalence classes. Hence suffix array algorithm can compute the equivalence classes using only suffix, lcp and rank arrays. index L R R + 1 L R R + 1 L R label length of parent lcp(R + 1) lcp(R + 1) lcp(L) size = { lcp(R) – max{ lcp( L ), lcp( R +1) }} Lemma

Computational experiment Comparison of algorithms suffix tree CDAWG suffix array Data two English and two Genome corpora Canterbury corpus, Protein corpus Machine spec. Red Hat Linux CPU 2.8GHz, 1 GB memory We compared the algorithms using suffix trees, suffix arrays, and also a naïve algorithm using CDAWGs. We used four corpora.

Experimental result data data structure time(sec) memory (MB) name size (MB) construction enumeration total cantrby/plrabn12 0.47 suffix tree CDAWG suffix array 0.95 0.97 0.43 0.21 0.18 0.14 1.16 1.15 0.57 21.446 9.278 5.392 ProteinCorpus/sc 2.8 12.08 12.76 3.08 1.43 1.12 0.63 13.51 13.88 3.71 121.887 69.648 33.192 large/ bible.txt 3.9 7.33 6.68 4.71 2.23 1.62 1.50 9.56 8.30 6.21 191.869 56.255 46.319 large/ E.coli 4.5 8.17 8.58 5.95 2.91 2.31 1.46 11.08 10.89 7.41 232.467 139.802 53.086 These are the results On each data, we can see that the suffix array algorithm is best in terms of computation time and memory usage.

Application : spam detection Many copies of the same message are sent. SPAM the size of the equivalence classes formed by spams are larger than that of non spams. the number of the equivalence class the size of the equivalence class Using this algorithm, we developed a spam detection method. As you know, spam is a serious problem on the web. Spam has an aspect that many copies of the same message are sent. Thus the size of the equivalence classes formed by spams are larger than that of non spams. On this graph, these points in this red circle mean spams. This is Japanese “Sushi” using spam, but this spam does not relate to this study.

Application : spam detection “Unsupervised Spam Detection based on String Alienness Measures” by Kazuyuki Narisawa, Hideo Bannai, Kohei Hatano and Masayuki Takeda The Tenth International Conference Discovery Science Sendai, Japan, 1-4 October, 2007 (DS ‘07) Accepted This study is accepted by the conference DS ’07. if you are interested in our study and want to come the conference, you should search not “DS 07” but “Discovery Science 2007”.

Summary Presented an algorithm for computing the equivalence classes with suffix array simulating traversal on suffix tree + suffix links using only lcp and rank arrays running in linear time and space Compared with other data structures less memory faster computation Can be applied to spam detection[ DS ’07 ] To summarize, we proposed an algorithm for computing the equivalence classes with suffix array, and showed experimentally that this algorithm is time and space efficient.

Thank You that is all. thank you

sum of the length of label from parent to each node Compute size of the EC a b 1 2 7 3 11 10 5 6 8 4 9 c $ sum of the length of label from parent to each node 1 + 3 = 4

Compute minimal strings of the EC z x y z1 if the node is the representative, the label “xx1” is one of the minimal strings case 1 y1 z2 x1 y1 x2 zm yk xk if two label length relation is k > m, the label “zz1” is one of the minimal strings case 2

suffix tree each node has: parent leftmost child right sibling suffix link label of the incoming edge