Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai,

Slides:



Advertisements
Similar presentations
Introduction to Algorithms Graph Algorithms
Advertisements

Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Longest Common Subsequence
Fast Algorithms For Hierarchical Range Histogram Constructions
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
HABATAKITAI Laboratory Everything is String. Computing palindromic factorization and palindromic covers on-line Tomohiro I, Shiho Sugimoto, Shunsuke Inenaga,
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Lectures on Network Flows
Chapter 3 The Greedy Method 3.
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Rectangle Visibility Graphs: Characterization, Construction, Compaction Ileana Streinu (Smith) Sue Whitesides (McGill U.)
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
CS5371 Theory of Computation Lecture 1: Mathematics Review I (Basic Terminology)
GRAPH Learning Outcomes Students should be able to:
Minimal Spanning Trees What is a minimal spanning tree (MST) and how to find one.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Operations Research Assistant Professor Dr. Sana’a Wafa Al-Sayegh 2 nd Semester ITGD4207 University of Palestine.
Computing Left-Right Maximal Generic Words Takaaki Nishimoto, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda Kyushu University, Japan.
1 Chapter Elementary Notions and Notations.
Finding Characteristic Substrings from Compressed Texts Shunsuke Inenaga Kyushu University, Japan Hideo Bannai Kyushu University, Japan.
Kazunori Hirashima 1, Hideo Bannai 1, Wataru Matsubara 2, Kazuhiko Kusano 2, Akira Ishino 2, Ayumi Shinohara 2 1 Kyushu University, Japan 2 Tohoku University,
1 Combinatorial Algorithms Parametric Pruning. 2 Metric k-center Given a complete undirected graph G = (V, E) with nonnegative edge costs satisfying the.
An asymptotic lower bound for the maximal-number-of- runs function F. Franek & Q. Yang Dept. of Computing & Software McMaster University Hamilton, Ontario.
Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara.
Ravello, /09C.E. On some researches... Chiara Epifanio.
Lecture 11 Algorithm Analysis Arne Kutzner Hanyang University / Seoul Korea.
ICS 253: Discrete Structures I Induction and Recursion King Fahd University of Petroleum & Minerals Information & Computer Science Department.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Comp. Genomics Recitation 10 Clustering and analysis of microarrays.
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Comp. Genomics Recitation 7 Clustering and analysis of microarrays.
Dipankar Ranjan Baisya, Mir Md. Faysal & M. Sohel Rahman CSE, BUET Dhaka 1000 Degenerate String Reconstruction from Cover Arrays (Extended Abstract) 1.
Average Value of Sum of Exponents of Runs in Strings Kazuhiko Kusano, Wataru Matsubara, Akira Ishino, Ayumi Shinohara Graduate School of Information Sciences.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.
Representing Relations Using Digraphs
Introduction to Algorithms
More NP-Complete and NP-hard Problems
Applied Discrete Mathematics Week 2: Functions and Sequences
Chapter 5 Limits and Continuity.
Probabilistic Algorithms
Complexity & the O-Notation
The Greedy Method and Text Compression
Minimum Spanning Tree 8/7/2018 4:26 AM
Bipartite Graphs What is a bipartite graph?
Lectures on Network Flows
Maximum Flow 9/13/2018 6:12 PM Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.
Chapter 5. Optimal Matchings
The Taxi Scheduling Problem
Lecture 10 Algorithm Analysis
Intractable Problems Time-Bounded Turing Machines Classes P and NP
Intractable Problems Time-Bounded Turing Machines Classes P and NP
Chapter 11 Data Compression
Comparative RNA Structural Analysis
Reachability on Suffix Tree Graphs
Linear Programming Duality, Reductions, and Bipartite Matching
Splay trees (Sleator, Tarjan 1983)
Analysis of Algorithms
Advanced Algorithms Analysis and Design
All pairs shortest path problem
Algorithm Design Techniques Greedy Approach vs Dynamic Programming
Dynamic Programming II DP over Intervals
CPSC 411 Design and Analysis of Algorithms
Analysis of Algorithms
Presentation transcript:

Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda (Kyushu University) 1 PSC 2016

2 Factorization of string  A factorization of a string w is a sequence f 1, …, f m of non-empty substrings of w such that w = f 1 … f m. Each substring in a factorization is called a factor.  There exist several factorizations with specific properties. – Lempel-Ziv factorizations [Ziv & Lempel, 1977] – LZW factorizations [Welch, 1984] – Lyndon factorizations [Chen et al., 1958]

3 Factorization of string In this work, we consider factorizations with repetitions.  A factorization of a string w is a sequence f 1, …, f m of non-empty substrings of w such that w = f 1 … f m. Each substring in a factorization is called a factor.  There exist several factorizations with specific properties. – Lempel-Ziv factorizations [Ziv & Lempel, 1977] – LZW factorizations [Welch, 1984] – Lyndon factorizations [Chen et al., 1958]

4 Repetitive structures a b a a b a a b c d a b c d a b c period  Integer p ≥ 1 is said to be a period of string w if w[i] = w[i+p] (1 ≤ i ≤ |w|−p). 11

5 Repetitive structures a b a a b a a b c d a b c d a b c square  Integer p ≥ 1 is said to be a period of string w if w[i] = w[i+p] (1 ≤ i ≤ |w|−p).  w is a square if |w|/2 is a period of w. period

 Integer p ≥ 1 is said to be a period of string w if w[i] = w[i+p] (1 ≤ i ≤ |w|−p).  w is a square if |w|/2 is a period of w.  w is a repetition if the smallest period of w is at most |w|/2. 6 Repetitive structures a b a a b a a b c d a b c d a b c repetition period

 Square factorization : each factor of the factorization is a square.  Size of factorization : the number of factors in the factorization. 7 Square factorization No square factorization exists a b a a b a b b a b c d e f g a b a a b a b a a b a a a a a

 Repetition factorization : each factor of the factorization is a repetition.  Size of factorization : the number of factors in the factorization. 8 Repetition factorization a b a a b a b b a b c d e f g No repetition factorization exists a b a a b a b a a b a a a a a

 Repetition factorization : each factor of the factorization is a repetition.  Size of factorization : the number of factors in the factorization.  There can be multiple factorizations of the same string. 9 Repetition factorization Size : 3 size : 2 a b a a b a b a a b a a a a a

Related work 10 SquareRepetition Any factorization O(n) time O(n) space O(n) time O(n) space Smallest factorization O(n log n) time O(n) space Largest factorization O(n log n) time O(n) space n is the length of the input string. [2][1] [2] [1] Dumitran et al., 2015 [2] Matsuoka et al., 2016

11 Related work and our contribution SquareRepetition Any factorization O(n) time O(n) space O(n) time O(n) space Smallest factorization O(n log n) time O(n) space O(n log n) time O(n) space Largest factorization O(n log n) time O(n) space O(n log n) time O(n) space n is the length of the input string. Our contribution! [1] Dumitran et al., 2015 [2] Matsuoka et al., 2016 [2][1] [2]

A run in a string is a maximal repetition of the string, i.e., its periodicity does not extend to the left nor the right. 12 Runs ( = maximal repetitions) a a b a b a a b a b a b b b a b a b a a b a b b b a b a b a a b a b a b Runs of w w =

13 Runs ( = maximal repetitions) a b a b a a b a b a b b b a b a A run in a string is a maximal repetition of the string, i.e., its periodicity does not extend to the left nor the right. b ≠ w =

14 Runs ( = maximal repetitions) a b a b a a b a b a b b b a b a A run in a string is a maximal repetition of the string, i.e., its periodicity does not extend to the left nor the right. b b ≠ w =

We denote each run r by a triple (beg, end, p). ( beg : beginning position, end : ending position, p : smallest period) 15 Runs ( = maximal repetitions) a a b a b a a b a b a b b b a b a b a a b a b b b a b a b a a b a b a b (1,10,5) Runs of w w =

16 Runs ( = maximal repetitions) Let Runs(w) denote the set of all runs of string w. Runs(w) can be computed in O(n) time for integer alphabet. Lemma 2 [Crochemore & Ilie, 2008] For any string w of length n, |Runs(w)|= O(n). Lemma 1 [Kolpakov & Kucherov, 1999]

 Any repetition is a substring of a run, and has length at least twice the period.  A repetition factorization exists if and only if we can “traverse” such substrings from the beginning of the string to the end. 17 Main idea a a b a b a a b a b a b b b a b a b a a b a b b b a b a b a a b a b a b

We can formalize the idea and define a graph which we call the repetition graph. Then the problem reduces to a (weighted) path problem on the repetition graph. 18 Main idea a a b a b a a b a b a b b b a b a b a a b a b b b a b a b a a b a b a b

19 Repetition Graph a b a b a a b a b a b b b Below is an example of a repetition graph. The graph consists of two types of nodes (white and black), and several types of edges (black, red, and blue).

20 Repetition Graph a b a b a a b a b a b b b We consider positions between characters. One white node is defined for each position.

21 Repetition Graph a b a b a a b a b a b b b run square For each run r : (beg, end, p), add path for each square of period p within r  and connect black nodes of the same run with blue edges.

22 Repetition Graph For each run r : (beg, end, p), add path for each square of period p within r and connect black nodes of the same run with blue edges a b a b a a b a b a b b b run square

23 Repetition Graph a b a b a a b a b a b b b square By definition, there is a one to one correspondence between: non-empty paths that start and end at white nodes in the run, and subrepetitions of the run that start and end at those positions

24 Repetition Graph a b a b a a b a b a b b b square By construction, there is a one to one correspondence between: non-empty paths that start and end at white nodes in the run, and subrepetitions of the run that start and end at those positions

25 Repetition Graph a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 The following graph is the complete repetition graph for the example string.

26 Size of Repetition Graph There are n + 1 white nodes. For each square in run with same period ( “primitively rooted” squares), there is 1 black node and ≤ 3 edges (black, red, blue)  # of primitively rooted squares is O(n log n) [Crochemore and Rytter, 1995]  size of graph is O(n log n) a b a b a a b a b a b b b

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 Each path corresponding to a repetition has exactly 1 black node. Reduction to Weighted Path Problem

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 There is a one to one correspondence between: paths from first white node to last white node, and repetition factorizations of w Reduction to Weighted Path Problem

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 There is a one to one correspondence between: paths from first white node to last white node, and repetition factorizations of w Reduction to Weighted Path Problem

30 Reduction to Weighted Path Problem a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 There is a one to one correspondence between: paths from first white node to last white node, and repetition factorizations of w  size of the factorization = total number of black edges in path

31 Reduction to Weighted Path Problem a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 If we can solve the weighted path problem on the repetition graph s.t. a weight of each black edge is 1 and a weight of each other edge is 0, then we can get a smallest/largest repetition factorization weight

32 Reduction to Weighted Path Problem a b a b a a b a b a b b b r1r1 r4r4 r5r5 r6r6 There is a one to one correspondence between: paths from first white node to last white node, and repetition factorizations of w  size of the factorization = total number of black edges in path Size : 4 Weight : 4

Since repetition graphs are DAGs, the smallest/largest weighted path problem can be solved in linear time w.r.t. the size of the graph by using dynamic programming. 33 Reduction to the path problem 33

For each node, the value of the node can be computed from the smallest/largest value of incoming nodes. 34 Reduction to the path problem weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r

Suppose the calculation up to position 9 has been completed, we want to find the size of the factorization up to position Reduction to the path problem weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r

We first compute all values of black nodes at position 10, when we compute the value of white node at position Reduction to the path problem weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r

The value of the lower black node at position 10 is 3, = (the value of white node at position 6 ) Reduction to the path problem weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r

The value of the upper black node at position 10 is 1, = (the value of white node at position 0 ) Reduction to the path problem weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r

The value of white node is largest value of incoming nodes. 39 Reduction to the path problem weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r

Thus, we can get the following lemma. 40 Reduction to the path problem Given the repetition graph of a string of length n, a smallest / largest repetition factorization of the string can be computed in O(n log n) time. Lemma 4

41 Reducing space requirement  By definition, we can clearly construct the repetition graph in O(n log n) time.  O(n log n) time and O(n log n) space solution  We can simulate our algorithm without constructing the repetition graph explicitly.  O(n log n) time and O(n) space algorithm

42 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r When the calculation up to position 9 has been completed, we want to find the size of the factorization up to position 10. 2

43 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r We have runs r 2 and r 5 that contains squares that end at position 10, starting at positions 0 and 6, respectively.

44 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r We also know from r 5, that there is a previous black node at position 9.

45 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r The maximum path weight of the new black node for r 5 is the maximum of path weights between the previous black node for r 5, and the white node at position 6 plus 1.

46 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r The idea is that we do not need the value of the black node at position 9 after computing the adjacent black node. Not needed anymore

47 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r Similarly, the maximum path weight of the new black node for r 2 is the path weight of the white node at position 0 plus 1.

48 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r Then we can compute the value of the white node at position 10. 3

49 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r Since we only need to remember all white nodes and 1 black node per run, and since edges can be computed on the fly from the information of runs, it is possible to determine the size of the factorization in linear space.

50 Our main result We can compute a smallest/largest repetition factorization of a given string w in O(n log n) time and O(n) space. Theorem

Compute all runs in w, and sort them by the beginning positions. 51 A complete example a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6

Consider n+1 white nodes for each position. 52 A complete example a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6

For each position, if there is a square that ends at the position, then we compute the values of black node and white node a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 A complete example

At position 0 : Since we have no square that ends at position 0, we do not add any node and edge a b a b a a b a b a b b b 0 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 A complete example

At position 0 : Since there is no black node at position 0, the value of white node at position 0 is A complete example a b a b a a b a b a b b b 0 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6

At position 1-3 : Similar to position a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 00 A complete example

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 000 At position 1-3 : Similar to position 0. A complete example

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 1-3 : Similar to position 0. A complete example

At position 4 : Since there is a square that ends at position 4, we know there is a black node a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r A complete example

At position 4 : The value is the weight of the white node at position 0 plus a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r A complete example

At position 4 : Then the value of the white node at position 4 is a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r A complete example

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 5 : There is a square that ends at position 5. We can compute the values of black node and white node. A complete example

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 5 : We do not need the value corresponding to the black node at position 4. 1 A complete example

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 6 : There is a square that ends at position 6. We can compute the values of black node and white node. A complete example

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 7 : Since we have no square that ends at position 0, the value of white node at position 7 is 0. A complete example

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 8 : There is a square that ends at position 8. We can compute the values of black node and white node. A complete example

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 At position 9 : There is a square that ends at position 9. We can compute the values of black node and white node. A complete example

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 At position 10 : There are squares that end at position 10. We can compute the values of black nodes and white node. A complete example

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 11 : There is a square that ends at position 11. We can compute the values of black node and white node. A complete example

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 12 : There is a square that ends at position 12. We can compute the values of black node and white node. A complete example

a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 13 : There is a square that ends at position 13. We can compute the values of black node and white node. A complete example

72 Lower bound of repetition graph There is a string that the size of graph is the Θ(n log n). (Fibonacci string) For any Fibonacci string of length n, the number of primitively rooted square is Θ(n log n). Lemma 6 [Fraenkel and Simpson, 1999]

 We showed two algorithms to compute the smallest/largest repetition factorization for a given string of length n. With graph : O(n log n) time and space Without graph : O(n log n) time and O(n) space  Open question is whether there exists an efficient algorithm which computes repetition factorizations of smallest/largest size without relying on the graph. 73 Conclusions and open question