Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.

Efficient Computation of Substring Equivalence Classes with Suffix Arrays
this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda. Kazuyuki Narisawa, Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda Kyushu University, Japan

Contents Introduction Problem definition Suffix tree based algorithm
Simulation by suffix array Computational experiment Application Summary this is the contents of this presentation

Main contribution Time and space efficient computation of substring equivalence classes [Blumer et al. 1987] with suffix arrays Our main contribution is an efficient algorithm that uses suffix arrays, for the computation of substring equivalence classes defined by Blumer et al. Of course, this algorithm runs in linear time and space. Linear time and space is faster and requires less memory than suffix tree and CDAWG based methods.

Equivalence relation and classes
Given a string w, the maximal extension of a substring x is 　　　　　　・Every time x occurs in w, 　x = αxβ 　　 it is preceded by α and followed by β. 　　　　　　・Strings α and β are longest possible. equivalence relation x  y  x = y equivalence class [x] = { y | y  x } Substrings with essentially identical occurrence in w Firstly, I will explain the equivalence relation and class. This is the definition of the equivalence relation and class. Given a string w, the maximal extension of a substring x is defined as follows. Based on maximal extension, we can define the equivalence relation, and hence equivalence classes on the substrings of w An equivalence class consists of substrings which appear essentially in the same place. For example, consider as w this sentence “betty bought a bit of better butter and made a better batter after breakfast.” In this sentence, the substring “bet” is always preceded by a “hyphen■” and followed by “ter■b”, and we can not extend this without changing occurrence frequency. example Betty–bought–a–bit–of–better–butter–and– made–a–better–batter–after–breakfast. bet = –better–b bet  [–better–b]

Problem Input : string w of length n
Output: the equivalence classes on w Difficulty The total number of elements in the equivalence classes (shortly ECs) is O(n2). Solution The number of the ECs is O(n). Each EC can be succinctly represented in O(1) space. This is the problem that we will consider. The Input is a string w of length n . The Output is the equivalence classes on w A difficulty of this problem is that the total number of elements in the equivalence classes is not linear in n. In order to solve this difficulty, we consider a succinct representation that can describe each equivalence class in constant space.

Succinct representation of the ECs
representative ・・・　the longest element(maximal extension) minimal strings ・・・ the elements which belong to another EC when the left or right most character is deleted [x] = Substring( x ) ∩ (　 ∪ Superstring( y ) ) y is minimal - b e t r the elements of [x] can be enumerated with the representative and minimal strings So we consider the succinct representation of the equivalence classes. The succinct representation consists of the representative and minimal strings. An equivalence class is formed by the substrings of the representative and the superstrings of minimal strings. This example is the representative and the minimal strings of the substring “-better-b” . example representative Betty-bought-a-bit-of-better- butter-and-made-a-better- batter-after-breakfast. - b e t r minimal strings

Problem Input : string w of length n
Output: succinct representations of the equivalence classes on w additionally, we will output size ( the number of elements in each EC) frequency ( the number of occurrences of the elements in each EC ) of each EC Thus, our problem’s output changes to succinct representations of the equivalence classes on w. Moreover, this study outputs additionally size and frequency of each equivalence class.

Possible solutions Suffix Tree[Weiner 1973]
Compact Directed Acyclic Word Graph (CDAWG) [Blumer et al. 1985] ECs can be computed with either of the data structures in linear time and space. In order to solve that problem, we consider some data structures. Using Suffix tree or Compact directed acyclic word graph, we can solve the problem in linear time and space.

Suffix tree (with suffix link)
ababbbabbc$ a b 1 2 7 3 11 10 5 6 8 4 9 c $ Thus we consider an algorithm with suffix tree. This is the suffix tree for the example string “ababbbabbc$” We will assume that strings ends with a distinct character $ that doesn’t appear anywhere else in the string. The red dotted arrows are suffix links. A suffix link points to a node representing the string with the leftmost character deleted. For example suffix link of “babb” points to “abb”. In this presentation, we won’t consider leaves because they form a trivial EC. Ignore leaves here because they form a trivial EC.

Equivalence classes on suffix tree
ababbbabbc$ two nodes are connected by suffix link and subtrees have the same number of leaves equivalence relation $ a b c b b $ 11 EC abb babb bab ba 10 a c b $ b b a a b b b b a EC 9 b b b b Now we consider the equivalence classes on the suffix tree. For example, consider substrings in the same equivalence class as this node “babb”. We can see that “bab” and “ba” are in the same equivalence class because the subtree below has the same number of leaves which represent the number of occurrences. In addition, these two nodes are equivalent. We can judge if two nodes are equivalent by checking whether or not the two nodes are connected by a suffix link, and their subtrees have the same number of leaves. Thus, these two nodes represent substrings in the same equivalence class, and they consist of the substrings “abb”, “babb”, “bab” and “ba”. So, an algorithm for computing equivalence classes can be as follows. b babb bab ba c a a b c c $ b b a $ $ b b b 1 c b c b a $ $ c c b b $ $ 3 7 EC def. c $ 5 4 8 Essentially same occurrence substrings 2 6

Suffix tree algorithm foreach node v in suffix tree {
if(node v is representative of EC [v]≡ ) { follow suffix link; while(node is in EC [v]≡) { compute size and minimal strings; } output succinct representation of EC [v]≡; This is an algorithm for computing the equivalence classes with suffix tree. This algorithm is based on a traversal of the suffix tree.

Algorithm with suffix tree
representative, minimal strings, size, frequency output number of incoming suffix links = 1 or not ? representative ? follow suffix link representative same leaves number ? in same EC ? back to representative in other EC $ a b number of incoming suffix links = 1 or not ? representative ? c b $ 11 10 a c b $ Suffix tree requires large memory space. b b a b b b a continue suffix tree traversal not representative 9 b b b Here is an example. First the algorithm traverses the suffix tree. We will use a post-order traversal here. In visiting on interior node, we check whether the node is a representative element or not. In this case, this node is not a representative because the number of incoming suffix links is 1. So this algorithm continues suffix tree traversal. Next. We check whether this node is a representative or not. In this case, this node is a representative because the number of incoming links is not 1. So, in order to find the other nodes in the same equivalence class, we follow the suffix link and check if the node is in the same equivalence class or not. In this case, this node is not in same equivalence class. Thus we go back to the representative and output the succinct representation. If the node *is(強調して発音)* in the same equivalence class, we continue to follow the suffix link while they are in the same equivalence class. After this, this algorithm runs similarly. Hence, we can compute the equivalence classes using this algorithm. However, a problem of suffix tree is that it requires large memory space. So we consider suffix array instead of suffix tree. b c a a b c c $ b b a $ $ b b b c 1 c b b a $ $ c c b b $ $ 3 7 c $ 5 4 8 2 6

Suffix array [Manber and Myers 1993]
Can simulate traversal on suffix tree using lcp and rank arrays [Kasai et al. 2003] Can simulate traversal on suffix links using additional data structure: suffix link table [Abouelhoda et al. 2004] Kasai et al. showed that traversal of suffix trees can be simulated on suffix arrays, by using lcp and rank arrays. However, for our algorithm, we need to follow suffix links. Abouelhoda et al showed that traversal of suffix links can also be simulated on suffix arrays by creating a suffix link table. In our work, we show that our algorithm won’t require a suffix link table, and can be done with just the lcp and rank arrays. simulate traversal on suffix links without suffix link table Our algorithm

lexicographically sort suffixes
Suffix array lexicographically sort suffixes ababbbabbc$ Suffix Array 1 2 3 4 5 6 7 8 9 10 11 a b c $ 1 3 7 2 6 5 4 8 9 10 11 a b c $ a b 1 2 7 3 11 10 5 6 8 4 9 c $ From now we consider the algorithm using suffix array. First, I will introduce some data structures. A suffix array is an array of suffix indices sorted in lexicographic order of suffixes. This array corresponds to leaves of the suffix tree.

Lcp array ababbbabbc$ lcp[i]：the length of the longest
common prefix of i th and (i –1) th suffixes a b 1 2 7 3 11 10 5 6 8 4 9 c $ Lcp Array 1 2 3 4 5 6 7 8 9 10 11 -1 The Lcp array contains the length of the longest common prefix of adjacent suffixes in the suffix array. Suffix Array 1 2 3 4 5 6 7 8 9 10 11

Rank array ababbbabbc$ rank[SA[i]] = i 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4
The rank array is an inversed array of the suffix array. Suffix Array 1 2 3 4 5 6 7 8 9 10 11

Suffix array has less information
Information available during traversal for each data structure, when visiting node v Suffix Tree Suffix Array 1. label from root to each node 2. label from parent to each node 3. num. leaves in each subtree 4. parent of each node 5. children of each node 6. suffix link of each node length of label from root to v length of label from root to the parent of v left most leaf ID in subtree rooted at v right most leaf ID in subtree rooted at v When we use suffix arrays for simulating suffix tree traversal, the main issue is that suffix arrays will generally have less information compared to the suffix tree.

Suffix array has less information
2 6 a b c $ 1 7 3 11 10 5 8 4 9 length of parent label from root：1 11 10 9 For example, in considering the pink node, on the suffix tree, we can get all information such as suffix links, parent node, root, leaves and etc. But, on suffix array, we only have the label length from root, the label length of parent from root, lcp and rank array. Thus suffix array algorithm seems to be more difficult than suffix tree algorithm. label length from root：4 1 2 3 6 7 8 4 5

Suffix array algorithm
foreach v in suffix tree (simulated by suffix array){ if(node v is representative of EC [v]≡) { follow suffix link; while(node is in EC [v]≡) { compute size and minimal strings; } output succinct representation of EC [v]≡; difficulty 1 difficulty 2 difficulty 3 Here is the suffix array algorithm for computing the equivalence classes. This algorithm is the same as the suffix tree algorithm, except for the first line. But the problem is how to compute these three sections when using a suffix array. Next, we will show how each difficulty can be solved. These are difficult because suffix array has less information.

Solving difficulty 1 (representative judge)
Suffix Array index L’= rank(l –1) R’= rank(r –1) L = rank(l) R = rank(r) First, this is for judging whether a given node v is a representative of an equivalence class or not. In this paper, we show that these conditions correspond to whether a node is a representative or not. R – L ≠ R’ – L’, (different num. leaves ? ) w[l – 1] ≠ w[r – 1], or (different first char ? ) l – 1 = 0 or r – 1 = 0 (first char in string ? ) Lemma: x = x  1.2. or 3.

Solving difficulty 2 (equivalence relation judge)
x:label from root ax:label from root v l r l+1 r+1 Suffix Array index L = rank(l) R = rank(r) L’ = rank(l+1) R’ = rank(r+1) Second, this is for judging whether following the suffix link will lead to a node in the same equivalence class or not. We showed that these conditions can be used to check this. R – L = R’ – L’, (same number of leaves ?) lcp(L’) < |ax| – 1, and (left most ?) lcp(R’ + 1) < |ax| – (right most ?) Lemma: ax  x  1.2. and 3.

Solving difficulty 3 (size computation)
case 1 case 2 case 3 size = sum of this l r r’ l r r’ l r Suffix Array Lastly, this is for computing the size of equivalence classes. Using this expression, we can compute the size of the equivalence classes. Hence suffix array algorithm can compute the equivalence classes using only suffix, lcp and rank arrays. index L R R + 1 L R R + 1 L R label length of parent lcp(R + 1) lcp(R + 1) lcp(L) size = { lcp(R) – max{ lcp( L ), lcp( R +1) }} Lemma

Computational experiment
Comparison of algorithms suffix tree CDAWG suffix array Data two English and two Genome corpora Canterbury corpus, Protein corpus Machine spec. Red Hat Linux CPU 2.8GHz, 1 GB memory We compared the algorithms using suffix trees, suffix arrays, and also a naïve algorithm using CDAWGs. We used four corpora.

Experimental result data data structure time(sec) memory （MB） name
size (MB) construction enumeration total cantrby/plrabn12 0.47 suffix tree CDAWG suffix array 0.95 0.97 0.43 0.21 0.18 0.14 1.16 1.15 0.57 21.446 9.278 5.392 ProteinCorpus/sc 2.8 12.08 12.76 3.08 1.43 1.12 0.63 13.51 13.88 3.71 69.648 33.192 large/ bible.txt 3.9 7.33 6.68 4.71 2.23 1.62 1.50 9.56 8.30 6.21 56.255 46.319 large/ E.coli 4.5 8.17 8.58 5.95 2.91 2.31 1.46 11.08 10.89 7.41 53.086 These are the results On each data, we can see that the suffix array algorithm is best in terms of computation time and memory usage.

Application : spam detection
Many copies of the same message are sent. SPAM the size of the equivalence classes formed by spams are larger than that of non spams. the number of the equivalence class the size of the equivalence class Using this algorithm, we developed a spam detection method. As you know, spam is a serious problem on the web. Spam has an aspect that many copies of the same message are sent. Thus the size of the equivalence classes formed by spams are larger than that of non spams. On this graph, these points in this red circle mean spams. This is Japanese “Sushi” using spam, but this spam does not relate to this study.

Application : spam detection
“Unsupervised Spam Detection based on String Alienness Measures” by Kazuyuki Narisawa, Hideo Bannai, Kohei Hatano and Masayuki Takeda The Tenth International Conference Discovery Science Sendai, Japan, 1-4 October, 2007 (DS ‘07) Accepted This study is accepted by the conference DS ’07. if you are interested in our study and want to come the conference, you should search not “DS 07” but “Discovery Science 2007”.

Summary Presented an algorithm for computing the equivalence classes with suffix array simulating traversal on suffix tree + suffix links using only lcp and rank arrays running in linear time and space Compared with other data structures less memory faster computation Can be applied to spam detection[ DS ’07 ] To summarize, we proposed an algorithm for computing the equivalence classes with suffix array, and showed experimentally that this algorithm is time and space efficient.

Thank You that is all. thank you

sum of the length of label from parent to each node
Compute size of the EC a b 1 2 7 3 11 10 5 6 8 4 9 c $ sum of the length of label from parent to each node 1 + 3 = 4

Compute minimal strings of the EC
z x y z1 if the node is the representative, the label “xx1” is one of the minimal strings case 1 y1 z2 x1 y1 x2 zm yk xk if two label length relation is k > m, the label “zz1” is one of the minimal strings case 2

suffix tree each node has: parent leftmost child right sibling
suffix link label of the incoming edge

Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.

Similar presentations

Presentation on theme: "Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.

Similar presentations

Presentation on theme: "Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda."— Presentation transcript:

Similar presentations

About project

Feedback