Reachability on Suffix Tree Graphs

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
WSPD Applications.
Lecture 3: Parallel Algorithm Design
2/14/13CMPS 3120 Computational Geometry1 CMPS 3120: Computational Geometry Spring 2013 Planar Subdivisions and Point Location Carola Wenk Based on: Computational.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
Sparse Compact Directed Acyclic Word Graphs
CS420 lecture one Problems, algorithms, decidability, tractability.
HABATAKITAI Laboratory Everything is String. Computing palindromic factorization and palindromic covers on-line Tomohiro I, Shiho Sugimoto, Shunsuke Inenaga,
Deterministic Selection and Sorting Prepared by John Reif, Ph.D. Analysis of Algorithms.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Goodrich, Tamassia String Processing1 Pattern Matching.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
Time Complexity Dr. Jicheng Fu Department of Computer Science University of Central Oklahoma.
UNC Chapel Hill M. C. Lin Point Location Reading: Chapter 6 of the Textbook Driving Applications –Knowing Where You Are in GIS Related Applications –Triangulation.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan.
1 Chapter 1 Introduction to the Theory of Computation.
Computational Geometry Piyush Kumar (Lecture 10: Point Location) Welcome to CIS5930.
CSC 211 Data Structures Lecture 13
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
CSCE350 Algorithms and Data Structure Lecture 19 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Lecture 9COMPSCI.220.FS.T Lower Bound for Sorting Complexity Each algorithm that sorts by comparing only pairs of elements must use at least 
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Decision Trees DEFINITION: DECISION TREE A decision tree is a tree in which the internal nodes represent actions, the arcs represent outcomes of an action,
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Lecture 3: Parallel Algorithm Design
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Andrzej Ehrenfeucht, University of Colorado, Boulder
Reducing the Space Requirement of LZ-index
CPSC 411 Design and Analysis of Algorithms
Course Description Algorithms are: Recipes for solving problems.
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Chapter 5. Optimal Matchings
Orthogonal Range Searching and Kd-Trees
CSCE 411 Design and Analysis of Algorithms
Quick-Sort 11/19/ :46 AM Chapter 4: Sorting    7 9
Analysis and design of algorithm
CSCE 411 Design and Analysis of Algorithms
Bart M. P. Jansen June 3rd 2016, Algorithms for Optimization Problems
CSE373: Data Structures & Algorithms Lecture 5: AVL Trees
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
Compact routing schemes with improved stretch
CPSC 411 Design and Analysis of Algorithms
Discrete Mathematics 7th edition, 2009
CS 583 Analysis of Algorithms
Course Description Algorithms are: Recipes for solving problems.
Switching Lemmas and Proof Complexity
Time Complexity and the divide and conquer strategy
Divide-and-Conquer 7 2  9 4   2   4   7
Presentation transcript:

Reachability on Suffix Tree Graphs Yasuto Higa, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda Department of Informatics, Kyushu University

A suffix trie is a kind of index structure. Tree (P. Weiner :1973) A suffix trie is a kind of index structure. The suffix trie of text T is a trie representing the set of suffixes of T. A suffix tree is a compacted suffix trie where the nodes of out degree 1 are removed. First, I’ll talk about the suffix tree. No.1 A suffix trie is a kind of index structure. No.2 The suffix trie of text T is a trie representing the set of suffixes of T. No.3 A suffix tree is a compacted suffix trie where the nodes of out degree 1 are removed. As a result, the space of suffix trees is order n. We assume that the last character of text T is “$” which occurs nowhere else in text T. Space ( n : length of T ) Suffix trie : O(n2) Suffix tree : O(n)

Reachable on suffix tree T = a b a b b a b b b a $ a b ab ab $ b b a v $ a b b b a $ a b b $ $ b abbba$ abbba$ b a b a b a b b A leaf node of the suffix tree represents a suffix of text T. Like this. And An internal node of the suffix tree represents a substring of text T. $ a b b b a b b a u $ b b b $ a b a a $ $ $ Important property Reachable on suffix tree Prefix

Suffix Links abb abb babb babb T = a b a b b a b b b a $ a b $ b b a $ v b b a b u a b a b a b Next I’ll talk about Suffix Links. Notice that dotted red lines represent suffix links. The suffix link of a node points to the node which represents the suffix obtained by removing the first character of this string. $ b b b a b b a $ b b b $ a b a a $ $ $

Suffix Link s Tree T = a b a b b a b b b a $ a b $ b b a $ a b b b a $ And suffix links also form a tree. We call “Suffix Link Tree”. $ a b b b a b b a $ b b b $ a b a a $ $ $

Reachable on suffix link tree T = a b a b b a b b b a $ v u This is a root node and these are leaf nodes. Now you can see this is a tree. Important property Reachable on suffix link tree Suffix

Suffix Tree Graph = Suffix Tree + Suffix Link Tree b $ b b a $ a b b b a $ a b b $ $ b b a b a b a b b $ a b b b a b b a $ b b b $ a b a a $ $ $

Substring inclusion problem T = a b a b b a b b b a $ a b ab ab $ b b a Substring inclusion problem Input : Two nodes u, v of a suffix tree. Output : whether or not the string of u is a substring of that of v. u $ a bba b b b a $ a v b babb babb b $ $ b b a b v a b a b a b Next, I’ll define “Substring inclusion problem”. Two nodes are given and whether or not the string of u is a substring of that of v is return. In this case, the answer is “yes”. In this case, the answer is “No”. This is the Substring inclusion problem. $ b b b a b b a $ b b b $ a b a a $ $ $

Substring inclusion problem Reachability problem on Suffix Tree Graph Reachability on Suffix Tree Graph T = a b a b b a b b b a $ a b $ ab ab b b a Substring inclusion problem Reachability problem on Suffix Tree Graph u $ bba a b b b a $ a v b babb babb b $ $ b b a b v a b a b a b Reachability problem on this graph. I’ll show two example. In this case, u is reachable from v. The string of u is a substring of that of v. In the other case, u is not reachable from v. The string of u is not a substring of that of v. $ b b b a b b a $ b b b $ a b a a $ $ $

- O(n2) Naïve solutions to substring inclusion problem - O(n) O(n3) preprocessing query processing no preprocessing - O(n) precomputing all possible queries O(n3) O(1) matching String no preprocessing - O(n) precomputing transitivity closure O(n2) O(1) reachability Graph We consider the case where test T is fixed and a lot of substring inclusion queries are performed. We can solve this problem but the matter is how effciently. Efficient algorithm for the case where text T is fixed and a lot of substring inclusion queries are performed.

? Interval labeling R. Agrawal, et al.(1989) An algorithm for reachability on DAGs. Query processing time is proportional to the number of interval labels of the node u. The total number of interval labels is O(n2) in the worst case for general DAGs. ? (1,6) (1,5) u v (15,16) (15,15) To do so, we use the Agrawal interval labeling algorithm for reachability on DAGs. This algorithm assigns some interval labels to each node of the DAG. One of these labels is special label, red one. What kind of special? A special label represents the node itself. When we check whether or not u is reachable from v, We have only to check whether or not the special label of v is sub subsumed by some interval label of u. The number of labels of each node depends on the structure of the suffix tree. Query processing time is proportional to the number of interval labels of the node u. The total number of interval labels is order n square in the worst case for general DAGs. Let’s use this algorithm on the suffix tree graph. There are many things I want to say. but (18,18)

Agrawal labeling algorithm on Suffix Tree Graph Input : Suffix Tree Graph Output : labeled Suffix Tree Graph 1 foreach node v in post order on suffix link tree do v.special:=[minimum post order number of subtree of v, post order number of v]; v.labels:={ v.special }; endfch foreach node v in post order on suffix tree do v.labels:=merge v.labels and { labels of children of v}; Remove if s.t. ; 2 3 4 5 6 7 8

Answering reachability query by interval labels (1,20) (1,10) 20 (1,9) (12,14) a b (12,13) (15,16) $ (15,19) (1,6) (1,11) (18,18) 14 11 19 (1,8) (15,16) b b (12,12) (18,18) a (1,9) (15,17) 18 $ (12,13) 17 a (15,15) (1,8) 13 b (1,10) b (12,12) b a (1,6) 10 $ a 12 b b (15,16) $ $ 16 9 b b a b (1,5) 15 (1,9) 8 a b a b (15,15) $ a b Sorry. I don’t have enough time to explain the algorithm. So I’ll show only the result. Now we can answer the reachability query effectively. b 7 (1,8) b (1,6) b a b b a 6 $ (1,7) b b b $ a b 5 a a (1,5) $ 4 $ (1,3) $ 3 (1,4) (1,2) 2 (1,1) 1

Time complexity Query time Each node has at most n labels, because the number of suffix link tree leaves is at most n. The labels of each node can be sorted during preprocessing without increasing time complexity. Therefore, query time is O(log n). (binary search) Each node has at most n labels, because the number of suffix link tree leaf nodes are at most n. The labels of each node can be sorted during preprocessing without increasing time complexity. Therefore, query time is O(log n) by using binary search.

Time complexity Preprocessing time Preprocessing time is proportional to the total number of interval labels. Therefore, we have only to count the total number of interval labels. (Lemma) And Preprocessing time is proportional to the total number of interval labels. This is lemma 2. Therefore, we have only to count the total number ofinterval labels.

The expected total number of interval labels Theorem The expected total number of interval labels is O(n log n) for random strings. The total number of interval labels is at most O(n・(height of the tree)). (Lemma) The expected height of the suffix tree of a random string is O(log n). The expected height of suffix tree of random string is order log n . and The total number of interval labels is at most order n cross the height of the tree. Therefore, The expected total number of interval labels is order n log n. This is the theorem 1. A. Apostolico et al.(1992)

… Worst case Lower bound: The following sequence of strings Xi gives a (the?) lower bound on the total number of labels. X1 = ab1ab2ab1ab1a$ X2 = ab2ab3ab2ab1ab2ab2a$ X3 = ab3ab4ab3ab1ab3ab2ab3ab3a$ (length 11) (length 20) (length 32) Next we consider the lower bound of the algorithm. This sequence of strings Xi gives a lower bound on the total number of labels. … (length )

… = abia Structure of Xi … … i/2 1 2 k … … i-1 i-2 i-k i/2 X1 = ab1ab2ab1ab1a$ X2 = ab2ab3ab2ab1ab2ab2a$ X3 = ab3ab4ab3ab1ab3ab2ab3ab3a$ X4 = ab4ab5ab4ab1ab4ab3ab4ab2ab4ab4a$ X5 = ab5ab6ab5ab1ab5ab4ab5ab2ab5ab3ab5ab5a$ X6 = ab6ab7ab6ab1ab6ab5ab6ab2ab6ab4ab6ab3ab6ab6a$ + = i Let’s think on Xi in a little more detail. Xi is made of two string sequence. One is …the other is… For this regular order, the suffix tree of Xi is very beautiful structure. …

The number of interval labels of each node on the suffix tree of bk bi bk bi a b b b b b b b b a b b b b b b b b … … … … $ b a b b b b b b b b a b b b b b b b … … … … b … … … b a b b b b b b b b a b b b b b … … … b bi a b b b b b b b b a b b b b … … … b This orange is the root and these green square are leaves. The suffix links is… a b b b b b b b b a b b b … … … b bk : leaf node … … b a b b b b b b b b a b : internal node … … b a b b b b b b b b a … …

The number of interval labels of each node on the suffix tree of bk bi bk bi a b b b b b b b b a b b b b b b b b … … … … $ b a b b b b b b b b a b b b b b b b … … … … b … … … b a b b b b b b b b a b b b b b … … … b bi a b b b b b b b b a b b b b … … … The suffix links of leaves are omitted. Since this suffix tree graph is beauty well structured, It’s easy for us to count the interval labels on each node. b a b b b b b b b b a b b b … … … b bk : leaf node … … b a b b b b b b b b a b : internal node … … b a b b b b b b b b a … …

The number of interval labels of each node on the suffix tree of bk bi bk bi a b b b b b b b b a b b b b b b b b … … … … $ b a b b b b b b b b a b b b b b b b … … … … b … … … b a b b b b b b b b a b b b b b … … … b bi a b b b b b b b b a b b b b … … … b a b b b b b b b b a b b b … … … b bk : leaf node … … b a b b b b b b b b a b : internal node … … b a b b b b b b b b a … …

The number of interval labels of each node on the suffix tree of … … … … … … … … … … … … … … Leave only node. This zone is increasing zone. The number of node is increasing like this. … … … … … … … … …

Root and each leaf has one interval label. The number of interval labels of each node on the suffix tree of increasing zone decreasing zone constant zone i+1 … 4 3 2 … … … 1 i+1 … 4 3 2 … … … 3 … i+2 i+1 … … 4 3 2 … k-1 … … … i+1 … 4 3 2 k … … … i+1 … 4 3 2 Leave only node. This zone is increasing zone. The number of node is increasing like this. k+1 … … … … … i+1 … 4 3 2 … … Root and each leaf has one interval label. i+1 i+1 … 4 3 2 … …

# Xi : the total number of labels for Xi constant zone increasing zone Where therefore decreasing zone root & leaves So the total number of labels for Xi is this one. Where i is order root n. Therefore the total number of labels for Xi is order n root n.

Theorem 2 the total number of interval labels is in the worst case. So we can conclude that the total number of interval labels is omega n root n in the worst case. It’s not difficult to prove that the upper bound is order n square. …but is this bound tight?

Upper bound A trivial upper bound is O(n2). …but is this bound tight?

Computational Experiments n : the length of strings Un : the maximum total number of labels for n n 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Un 18 22 26 30 34 39 44 49 54 59 We exhaustively enumerated all strings of length n consisting of a and b, and ending with $. For each n, the number of labels in the worst case was recorded. 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 64 69 74 79 85 91 97 103 109 115 121 127 133 139 145 151 158 We exhaustively enumerated all strings of length n consisting of a and b, and ending with $. For each n, the number of labels in the worst case was recorded. Can you see a pattern? It’s easy to see! Fn is equal to Un for n which is less than 33. Fn has a amazing property. Fn is equal to the total number of labels of strings Xi. F(n) = Un F(n) = #Xi (for all i)

Un = F(n) = #Xi Conjecture : The upper bound is Summary of worst case bounds For all Un = F(n) = #Xi The inductively determined function for Un. The exact upper bound on the total number of interval labels for A lower bound on the total number of interval labels Let’s round up the results. The exact upper bound on the total number of interval labels for n which is less than 33. Fn is the inductively determined function for Un. Un is equal to Fn between n= 7 and n=33. And amazingly, Fn is equal to the total number of interval labels of Xi for all i. So we Conjectured that The upper bound is also order n root n. Conjecture : The upper bound is The complexity of the algorithm is

precomputing transitivity closure Conclusion We considered the Substring inclusion problem and showed that it can be reduced to the Reachability problem on Suffix Tree Graphs. We showed bounds for Agrawal’s interval labeling algorithm when applied to Suffix Tree Graphs. preprocessing query processing no preprocessing - O(n) O(log n) precomputing transitivity closure O(n2) O(1) expected our results for Xi

Future work The upper bound is Future work is surely to proof that the upper bound of this algorithm is order n root n.

optimal pattern form a pattern set If pattern A is a substring of pattern B, the set B is the substring of the set A. “set A” is the set of strings that include pattern A “set B” is the set of strings that include pattern B. Set A Pattern A Pattern B text Set B

Suffix Link Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 a b $ b 1 2 3 4 5 6 7 8 9 10 11 a b $ b b a $ a b b b a $ a b b $ $ b b a b a b a b a b $ b b b a b b a $ b b b $ a b a a $ $ $

Suffix Link Tree 1. Assign a number to each nodes a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 20 a b $ 14 11 19 b b a 18 $ 17 a 1. Assign a number to each nodes in post order numbering 13 b b b a 10 $ a 12 b b $ $ 16 9 b b a b 15 8 a b a b b $ a b 7 b b a b b a 6 $ b b b $ a b 5 a a $ 4 $ $ 3 2 1

Suffix Link Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11

Suffix Link Tree 1. Assign a number to each nodes a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 20 14 11 19 18 17 1. Assign a number to each nodes in post order numbering 13 10 12 16 9 15 8 7 6 5 4 3 2 1

Suffix Tree with Suffix Links Graph a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 a b $ b b a $ a b b b a $ a b b $ $ b b a b a b a b a b $ b b b a b b a $ b b b $ a b a a $ $ $

1. assign a number to each nodes Suffix Tree Graph a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 20 a b $ 14 11 19 b b a 18 $ 17 a 1. assign a number to each nodes in post order 13 b b b a 10 $ a 12 b b $ $ 16 9 b b a b 15 8 a b a b a b $ b 7 b b a b b a 6 $ b b b $ a b 5 a a $ 4 $ $ 3 2 1

1. Assign a number to each nodes Suffix Tree Graph a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 20 a b $ 14 11 19 b b a 18 $ 17 a 1. Assign a number to each nodes in post order 13 b b b a 10 $ a 12 b b $ $ 16 9 b b a b 15 8 a b a b a b $ b 7 b b a b b a 6 $ b b b $ a b 5 a a $ 4 $ $ 3 2 1

Suffix Tree Graph a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 a b $ 1 2 3 4 5 6 7 8 9 10 11 a b $ b b a $ a b b b a $ a b b $ $ b b a b a b a b $ a b b b b a b b a $ b b b $ a b a a $ $ $

Suffix Tree Graph a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 20 a b 1 2 3 4 5 6 7 8 9 10 11 20 a b $ 14 11 19 b b a 18 $ 17 a 13 b b b a 10 $ a 12 b b $ $ 16 9 b b a b 15 8 a b a b b $ a b 7 b b a b b a 6 $ b b b $ a b 5 a a $ 4 $ $ 3 2 1

Suffix Tree Graph a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 20 a b 1 2 3 4 5 6 7 8 9 10 11 20 a b $ 14 11 19 b b a 18 $ 17 a 13 b b b a 10 $ a 12 b b $ $ 16 9 b b a b 15 8 a b a b b $ a b 7 b b a b b a 6 $ b b b $ a b 5 a a $ 4 $ $ 3 2 1

Suffix Tree with Suffix Links a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 20 a b $ 19 11 16 b b a 15 $ 14 a 18 b b b a 10 $ a 17 b b $ $ 13 9 b b a b 12 8 a b a b b $ a b 7 b b a b b a 6 $ b b b $ a b 5 a a $ 4 $ $ 3 2 1

Suffix Tree with Suffix Links a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 20 a b $ 19 11 16 b b a 15 $ 14 a 18 b b b a 10 $ a 17 b b $ $ 13 9 b b a b 12 8 a b a b b $ a b 7 b b a b b a 6 $ b b b $ a b 5 a a $ 4 $ $ 3 2 1

Agrawal a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 1011 (10,11) (10,11) (10,11) (10,11) 20 (10,11) (10,11) a (10,11) (10,11) $ b (10,11) (10,11) (10,11) (1,11) 19 11 16 (10,11) b a b (10,11) $ (1,10) (10,11) 15 10 18 14 (10,11) a (10,11) b b b $ (10,11) (1,9) 17 b a $ $ 13 9 12 8 a $ 7 (1,8) $ a b b b a b b a (1,6) b a $ $ a b b b $ a b b b a 6 (1,7) $ a b b b a 5 (1,5) 4 (1,3) 3 (1,4) (1,2) 2 (1,1) 1

Motivation

I’m going to answer these questions. FAQ What is the Suffix Tree Graph? I’m going to answer these questions.

Main discussion Background Outline distinguish the sets of strings More skillful pattern Substring and Suffix Tree Problem establishment Suffix Tree Graph (ST-Graph) How to use ST-Graph The complexity of the algorithm Main discussion Background

Outline Technical term Pattern discovery problem (background) Motivation Problem establishment Suffix Tree Graph Labeling algorithm Complexity of the algorithm Future works

Technical term (1/2) Substring For any string s ∈ ∑* Suffix Tree s = uvw u ; prefix v ; substring w ; suffix substring suffix prefix Substring Suffix Tree

Technical term (2/2) Substring 例:nonno$ Suffix Tree 3 6 no$ $ 4 1 n 5 6 $ n o no$ nno$ 例:nonno$ Substring Suffix Tree

Pattern Discovery problem Find a pattern string that occurs in all strings of A and in no strings of B. A B AKEBONO MUSASHIMARU CONTRIBUTIONS OF AI BEYOND MESSY LEARNING BASED ON LOCAL SEARCH ALGORITHMS BOOLEAN CLASSIFICATION SYMBOLIC TRANSFORMATION BACON SANDWICH PUBLICATION OF DISSERTATION WAKANOHANA TAKANOHANA CONTRIBUTIONS OF UN TRADITIONAL APPROACHES GENETIC ALGORITHMS PROBABILISTIC RULE NUMERIC TRANSFORMATION PLAIN OMELETTE TOY EXAMPLES

Pattern Discovery problem More skillful pattern

Motivation To build a pruning algorithm of pattern discovery algorithm

Problem establishment Input : Output :

Theory of graphs No algorithm calculates reachability on Suffix Tree Graph in theory of graphs So we have to build new algorithm Our strategy is to label on nodes

Labeling algorithm

Suffix Tree Graph 例:nonno$ 3 6 no$ $ 4 $ 1 n o nno$ o nno$ 2 $ 5 (1,3) (1,6) 例:nonno$ 3 6 no$ Preprocessing time is proportional to the number of labels. So time complexity is (1,4) $ (1,4) (7,7) (9,9) 4 (1,1) $ 1 n o nno$ (1,4) (7,7) o nno$ Naive algorithm 2 (1,2) (1,5) (7,8) $ 5 (1,5)

bi bi bk bk a b b b b b b a b b b b b b … … … … $ b a b b b b b b a b b b b b … … … … b … … … b bi a b b b b b b a b b b … … … b a b b b b b b a b b … … … b bk … … b a b b b b b b a … …

bi bi bk bk a b b b b b b a b b b b b b … … … … $ b a b b b b b b a b b b b b … … … … b … … … b a b b b b b b a b b b b … … … b bi a b b b b b b a b b b … … … b a b b b b b b a b b … … … b bk … … b a b b b b b b a b … … b a b b b b b b a … …

bi bi bk bk a b b b b b b b b a b b b b b b b b … … … … $ b a b b b b b b b b a b b b b b b b … … … … b … … … b a b b b b b b b b a b b b b b … … … b bi a b b b b b b b b a b b b b … … … b a b b b b b b b b a b b b … … … b bk … … b a b b b b b b b b a b … … b a b b b b b b b b a … …

Summary of the background ○(正例) ×(負例)

Problem establishment

Suffix Tree Graph (ST-Graph)

How to use ST-Graph

Labeling algorithm

The complexity of the algorithm

Suffix Tree with Suffix Links a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 a b $ b b a $ a b b b a $ a b b $ $ b b a b a b a b $ a b b b b a b b a $ b b b $ a b a a $ $ $

Suffix Tree with Suffix Links Graph a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 a b $ b b a $ a b b b a $ a b b $ $ b b a b a b a b a b $ b b b a b b a $ b b b $ a b a a $ $ $

Every nodes have a substring. Suffix Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 a b $ b b a ba ba $ a Every nodes have a substring. b abb abb b b a $ a b babb abb ba b $ $ b b a b a b a b b $ a b b b a b b a $ b b b $ a b a a $ $ $

Suffix Link Tree abb b babb a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 a b $ b b a $ a b b abb b b a $ a b babb b $ $ b b a b a b a b b $ a b b b a b b a $ b b b $ a b a a $ $ $

Every nodes have a substring. Suffix Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 a b $ b b a ba ba $ a Every nodes have a substring. For example… b b b a $ a b ba babb babb b $ $ b b a b a b a b b $ a b b b a babbba$ b ba babb b a $ b b b $ a b a a $ $ $

Suffix Link Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11

Suffix Link Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11

Suffix Link Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11

Suffix Link Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11

bi bi bk bk … … … … a b b b b b b b b a b b b b b b b b $ b … … … … b a b b b b b b b b a b b b b b b b … … … b … … … a b b b b b b b b a b b b b b b bi … … … a b b b b b b b b a b b b b b … … … b a b b b b b b b b a b b b bk … … b … … a b b b b b b b b a b b … … a b b b b b b b b a

Suffix Link Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 (1,20) 1 2 3 4 5 6 7 8 9 10 11 (1,20) 20 (1,10) (1,9) a b (12,14) (12,13) $ (15,16) (15,19) (1,11) (18,18) (1,6) 14 11 19 (15,16) b (1,8) b (18,18) (12,12) a (15,17) (1,9) 18 $ 17 (12,13) a (15,15) 13 (1,8) b (1,10) b (12,12) b a 10 $ a 12 (1,6) b b (15,16) $ $ 16 9 b b a b 15 (1,9) 8 a (1,5) b a b (15,15) a b $ b 7 (1,8) b (1,6) b a b b a 6 $ (1,7) b b b $ a b 5 a a (1,5) $ 4 $ (1,3) $ 3 (1,4) (1,2) 2 (1,1) 1

Suffix Link Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 (1,20) 1 2 3 4 5 6 7 8 9 10 11 (1,20) 20 (1,10) (1,9) a b (15,16) (12,13) $ (18,18) (15,19) (1,11) (12,14) (1,6) 14 11 19 (15,16) b (1,8) b (18,18) (12,12) a (15,17) (1,9) 18 $ 17 (15,15) a (12,13) 13 (1,8) b (1,10) b (12,12) b a 10 $ a 12 (1,6) b b (15,16) $ $ 16 9 b b a b 15 (1,9) 8 a (1,5) b a b (15,15) a b $ b 7 (1,8) b (1,6) b a b b a 6 $ (1,7) b b b $ a b 5 a a (1,5) $ 4 $ (1,3) $ 3 (1,4) (1,2) 2 (1,1) 1

Suffix Link Tree 1. Assign a number to each nodes a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 20 14 11 19 18 17 1. Assign a number to each nodes in post order numbering 13 10 12 16 9 15 8 7 6 5 4 3 2 1

Suffix Link Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 (1,20) 1 2 3 4 5 6 7 8 9 10 11 (1,20) 20 (1,11) (15,19) (12,14) 14 11 19 (18,18) 18 17 (15,17) (12,13) 13 (1,10) 10 12 16 9 (12,12) (15,16) 15 (1,9) 8 (15,15) 7 (1,8) (1,6) 6 (1,7) 5 (1,5) 4 (1,3) 3 (1,4) (1,2) 2 (1,1) 1

Suffix Link Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 (1,20) 1 2 3 4 5 6 7 8 9 10 11 (1,20) 20 (1,10) (1,9) a b (15,16) (12,13) $ (18,18) (15,19) (1,11) (12,14) (1,6) 14 11 19 (15,16) b (1,8) b (18,18) (12,12) a (15,17) (1,9) 18 $ 17 (15,15) a (12,13) 13 (1,8) b (1,10) b (12,12) b a 10 $ a 12 (1,6) b b (15,16) $ $ 16 9 b b a b 15 (1,9) 8 a (1,5) b a b (15,15) a b $ b 7 (1,8) b (1,6) b a b b a 6 $ (1,7) b b b $ a b 5 a a (1,5) $ 4 $ (1,3) $ 3 (1,4) (1,2) 2 (1,1) 1

Suffix Tree of Xi bi bi bk bk a b b b b b b b b a b b b b b b b b … … $ b a b b b b b b b b a b b b b b b b … … … … b a … … … b a b b b b b b b b a b b b b b … … … b bi a b b b b b b b b a b b b b … … … b a b b b b b b b b a b b b … … … b bk … … b a b b b b b b b b a b … … b a b b b b b b b b a Suffix Tree of Xi … …

Suffix Link Tree Assign an interval label to each node. a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 (1,20) 20 (1,11) (15,19) (12,14) 14 11 19 (18,18) 18 17 (15,17) (12,13) Assign an interval label to each node. One is the number of the node, the other is the minimum number in the subtree. 13 (1,10) 10 12 16 9 (12,12) (15,16) 15 (1,9) 8 (15,15) 7 (1,8) (1,6) 6 (1,7) 5 (1,5) 4 (1,3) 3 (1,4) (1,2) 2 (1,1) 1

Suffix Link Tree Assign an interval label to each node. a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 (1,20) 20 (1,11) (15,19) (12,14) 14 11 19 (18,18) 18 17 (15,17) (12,13) Assign an interval label to each node. One is the number of the node, the other is the minimum number in the subtree. 13 (1,10) 10 12 16 9 (12,12) (15,16) 15 (1,9) 8 (15,15) 7 (1,8) (1,6) 6 (1,7) 5 (1,5) 4 (1,3) 3 (1,4) (1,2) 2 (1,1) 1

Suffix Link Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 (1,20) 1 2 3 4 5 6 7 8 9 10 11 (1,20) 20 (1,11) (15,19) (12,14) 14 11 19 (18,18) 18 17 (15,17) (12,13) 13 (1,10) 10 12 16 9 (12,12) (15,16) 15 (1,9) 8 (15,15) 7 (1,8) (1,6) 6 (1,7) 5 (1,5) 4 (1,3) 3 (1,4) (1,2) 2 (1,1) 1

Suffix Link Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 (1,20) 1 2 3 4 5 6 7 8 9 10 11 (1,20) 20 (1,10) (1,9) a b (12,14) (12,13) $ (15,16) (15,19) (1,11) (18,18) (1,6) 14 11 19 (15,16) b (1,8) b (18,18) (12,12) a (15,17) (1,9) 18 $ 17 (12,13) a (15,15) 13 (1,8) b (1,10) b (12,12) b a 10 $ a 12 (1,6) b b (15,16) $ $ 16 9 b b a b 15 (1,9) 8 a (1,5) b a b (15,15) a b $ b 7 (1,8) b (1,6) b a b b a 6 $ (1,7) b b b $ a b 5 a a (1,5) $ 4 $ (1,3) $ 3 (1,4) (1,2) 2 (1,1) 1

Suffix Link Tree a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 (1,20) 1 2 3 4 5 6 7 8 9 10 11 (1,20) (1,10) 20 (1,9) (15,16) a b (12,13) (18,18) $ (15,19) (1,6) (1,11) (12,14) 14 11 19 (1,8) (15,16) b b (12,12) (18,18) a (1,9) (15,17) 18 $ (15,15) 17 a (12,13) (1,8) 13 b (1,10) b (12,12) b a (1,6) 10 $ a 12 b b (15,16) $ $ 16 9 b b a b (1,5) 15 (1,9) 8 a b a b (15,15) a b $ b 7 (1,8) b (1,6) b a b b a 6 $ (1,7) b b b $ a b 5 a a (1,5) $ 4 $ (1,3) $ 3 (1,4) (1,2) 2 (1,1) 1

Suffix Tree of Xi bi bi bk bk a b b b b b b b b a b b b b b b b b … … $ b a b b b b b b b b a b b b b b b b … … … … b … … … b a b b b b b b b b a b b b b b … … … b bi a b b b b b b b b a b b b b … … … b a b b b b b b b b a b b b … … … b bk … … b a b b b b b b b b a b … … b a b b b b b b b b a Suffix Tree of Xi … …

General case “Suffix Tree Graph” This graph is a DAG. a b a b b a b b b a $ 1 2 3 4 5 6 7 8 9 10 11 a b $ b b a $ a b b b a $ a b b $ $ b b a b a b a b a b $ b b b a b b a $ b b b This graph is a DAG. The name is “Suffix Tree Graph” $ a b a a $ $ $ This graph is a DAG.

General case T = a b a b b a b b b a $ a b $ b b a $ a b b b a $ a b b