Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
HABATAKITAI Laboratory Everything is String. Computing palindromic factorization and palindromic covers on-line Tomohiro I, Shiho Sugimoto, Shunsuke Inenaga,
Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.
Two implementation issues Alphabet size Generalizing to multiple strings.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
CS 171: Introduction to Computer Science II
Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.
Goodrich, Tamassia String Processing1 Pattern Matching.
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
CS2420: Lecture 13 Vladimir Kulyukin Computer Science Department Utah State University.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Costas Busch - RPI1 Mathematical Preliminaries. Costas Busch - RPI2 Mathematical Preliminaries Sets Functions Relations Graphs Proof Techniques.
Courtesy Costas Busch - RPI1 Mathematical Preliminaries.
Augmenting Suffix Trees, with Applications Yossi Matias, S. Muthukrishnan, Suleyman Cenk Sahinalp, Jacob Ziv Presented by Genady Garber.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 20: Binary Trees.
Binary Trees Chapter 6.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Finding Characteristic Substrings from Compressed Texts Shunsuke Inenaga Kyushu University, Japan Hideo Bannai Kyushu University, Japan.
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Mathematical Preliminaries. Sets Functions Relations Graphs Proof Techniques.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Tree Traversals, TreeSort 20 February Expression Tree Leaves are operands Interior nodes are operators A binary tree to represent (A - B) + C.
Foundation of Computing Systems
M180: Data Structures & Algorithms in Java Trees & Binary Trees Arab Open University 1.
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
24 January Trees CSE 2011 Winter Trees Linear access time of linked lists is prohibitive  Does there exist any simple data structure for.
Contents What is a trie? When to use tries
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,
Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.
15-853:Algorithms in the Real World
Succinct Data Structures
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Reducing the Space Requirement of LZ-index
Section 8.1 Trees.
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Binary Trees, Binary Search Trees
Reachability on Suffix Tree Graphs
String Data Structures and Algorithms
String Data Structures and Algorithms
Binary Trees, Binary Search Trees
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Sequences 5/17/ :43 AM Pattern Matching.
Binary Trees, Binary Search Trees
Presentation transcript:

Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2, Simon J. Puglisi 4, and Shiho Sugimoto 2 1.University of Sheffield, United Kingdom 2.Kyushu University, Japan 3.King’s College London, United Kingdom 4.University of Helsinki, Finland

Everything is String. A closed string is a string with a proper substring that occurs as a prefix and a suffix but does not have internal occurrences [Fici, 2011]. Closed Strings a b c a b a c a c b a a b c a a a a a a a Closing border

Everything is String. A closed string is a string with a proper substring that occurs as a prefix and a suffix but does not have internal occurrences [Fici, 2011]. –A string of length 1 is closed, where the closing border is the empty string ε. A closed string has a unique closing border. Closed Strings a b c a b a c a c b a a b c a a a a a a a Closing border

Everything is String. We introduce the Longest Closed Factor Array of a string w and an algorithm which computes it in O(n log n / loglog n) time and O(n) space. –n is the length of w. We introduce the Closed Factorization of a string w and the algorithm which compute it in O(n) time and space. –n is the length of w. Our Contribution

Everything is String. Definition of Longest Closed Factor Array w =ababaacbbbcbcc$ The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 ≤ i ≤ n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n].

Everything is String. Definition of Longest Closed Factor Array w =ababaacbbbcbcc$ A = The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 ≤ i ≤ n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n].

Everything is String. Definition of Longest Closed Factor Array w =ababaacbbbcbcc$ A = The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 ≤ i ≤ n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n].

Everything is String. Definition of Longest Closed Factor Array w =ababaacbbbcbcc$ A = The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 ≤ i ≤ n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n].

Everything is String. Definition of Longest Closed Factor Array w =ababaacbbbcbcc$ A = The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 ≤ i ≤ n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n].

Everything is String. Definition of Longest Closed Factor Array w =ababaacbbbcbcc$ A = The longest closed factor array of w of length n is an array A[1..n] of integers such that for any 1 ≤ i ≤ n, A[i] = l if and only if l is the length of the longest closed prefix of w[i..n].

Everything is String. Theorem 1 Given a string w of length n over an integer alphabet, the closed factor array of w can be computed in O(n log n / loglog n) time and O(n) space. Computing Longest Closed Factor Array

Everything is String. Lemma 1 The longest prefix of w[i..n] which has another occurrence to the right of i, is the closing border of the longest closed factor starting at i. Computing Longest Closed Factor Array w =ababaacbbbcbcc$

Everything is String. Lemma 1 The longest prefix of w[i..n] which has another occurrence to the right of i, is the closing border of the longest closed factor starting at i. Computing Longest Closed Factor Array w =ababaacbbbcbcc$

Everything is String. Lemma 1 The longest prefix of w[i..n] which has another occurrence to the right of i, is the closing border of the longest closed factor starting at i. Computing Longest Closed Factor Array w =ababaacbbbcbcc$

Everything is String. Lemma 1 The longest prefix of w[i..n] which has another occurrence to the right of i, is the closing border of the longest closed factor starting at i. Computing Longest Closed Factor Array w =ababaacbbbcbcc$

Everything is String. 1.Construct and preprocess the suffix tree of w. 2. i 1. 3.Compute the closing border b i starting at position i. –with the suffix tree constructed in Step 1 4.Find the leftmost occurrence j of b i in w[i+1..n]. –with a range successor query 5.A[i] j + |b i | – i. 6. i i Repeat Steps 3~5 until i = n. Outline of Our Algorithm

Everything is String. Construct the suffix tree of a given string w. Each leaf of the suffix tree stores the beginning position of the suffix corresponding to the leaf. Any internal node v of the suffix tree is labeled by the maximum leaf value in the subtree rooted at v. Step 1 a a a b a $ b $ $ a b a $ $ a b a b a $ $ w = abaaba$ SA

Everything is String. Outline of Our Algorithm 1.Construct and preprocess the suffix tree of w. 2. i 1. 3.Compute the closing border b i starting at position i. –with the suffix tree constructed in Step 1 4.Find the leftmost occurrence j of b i in w[i+1..n]. –with a range successor query 5.A[i] j + |b i | – i. 6. i i Repeat Steps 3~5 until i = n.

Everything is String. Compute the closing border b i starting at position i. –Find the highest node x labeled i. –The path from the root to the parent of x is the closing border of longest closed factor starting at position i. Step 3 a a a b a $ b $ $ a b a $ $ a b a b a $ $ Suffix Tree of abaaba$

Everything is String. Step 3 root i t i x u w i pathlabel(x) pathlabel(u) Suffix Tree of w a x : the highest node labeled i t u : the parent of x t How do we find node x?

Everything is String. Compute the closing border b i starting at position i. –Find the highest node x labeled i. Traverse the suffix tree from the root. –O(|x|) time for a constant alphabet. –O(|x| log n) time for an integer alphabet. An array P[1..n] enables us to find node x in O(1) time. –P[i] contains a pointer to node x in the tree for which i is the maximum leaf value. –P can be computed in O(n) time with pre-order traversing. Step 3

Everything is String. Outline of Our Algorithm 1.Construct and preprocess the suffix tree of w. 2. i 1. 3.Compute the closing border b i starting at position i. –with the suffix tree constructed in Step 1 4.Find the leftmost occurrence j of b i in w[i+1..n]. –with a range successor query 5.A[i] j + |b i | – i. 6. i i Repeat Steps 3~5 until i = n.

Everything is String. Step 4 root i i h x u G Suffix Tree of w a x : the highest node labeled i t t w i pathlabel(x) pathlabel(u) th u : the parent of x h is the successor of i in the set of the leaf values.

Everything is String. Compute the longest closed factor starting at position i. –Use a range successor query data structure for the suffix array [Yu et al., 2011]. Each internal node v stores the beginning and ending positions of the corresponding range in the suffix array. Step 4 a a a b a $ b $ $ a b a $ $ a b a b a $ $ Suffix Tree of a b a a b a $

Everything is String. Compute the longest closed factor starting at position i. –Use a range successor query data structure for the suffix array [Yu et al., 2011]. Each internal node v stores the beginning and ending positions of the corresponding range in the suffix array. Range successor query need O(log n / loglog n) time for each position i. Step 4 a a a b a $ b $ $ a b a $ $ a b a b a $ $ Suffix Tree of a b a a b a $

Everything is String. Given a string w of length n over an integer alphabet, the closed factor array of w can be computed in O(n log n / loglog n) time and O(n) space. Our Result 1

Everything is String. The closed factorization of string w of length n is a sequence (G 0,G 1,…,G k ) of strings such that G 0 = ε, w = G 1 …G k and, for each 1 ≤ j ≤ k, G j is the longest closed prefix of w[|G 1 …G j- 1 |+1..n]. Definition of Closed Factorization a b a b a a c b b b c b c c $

Everything is String. Theorem 2 Given a string w of length n over an integer alphabet, the closed factorization of w can be computed in O(n) time and space. Computing Closed Factorization

Everything is String. 1.Construct and preprocess the suffix tree of w. 2. i 1. 3.Compute the closing border b i starting at position i. –with the suffix tree constructed in Step 1 4.Find the leftmost occurrence j of b i in w[i+1..n]. –with the KMP algorithm –Stop the KMP algorithm as soon as j is found. 5. i j + |b i |. 6.Repeat Steps 3~5 until i = n. Outline of Our Algorithm

Everything is String. We can compute each factor G j in O(|G j |) time with the KMP algorithm. Because the sum of the lengths of all factors is n, the total time to compute the closed factorization is O(n). Algorithm of Closed Factorization

Everything is String. Given a string w of length n over an integer alphabet, the closed factorization of w can be computed in O(n) time and space. Our Result 2

Everything is String. We introduced the Longest Closed Factor Array of a string and proposed an algorithm which computes it in O(n log n / loglog n) time and O(n) space. We introduced the Closed Factorization of a string and proposed an algorithm which computes it in O(n) time and space. Conclusion

Everything is String. Can we efficiently compute the longest closed factor array without range successor queries? Can we find the longest closed factor containing each position without the longest closed factor array? Open Problems