Computing smallest and largest repetition factorization in O(n log n) time Hiroe Inoue, Yoshiaki Matsuoka, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda (Kyushu University) 1 PSC 2016
2 Factorization of string A factorization of a string w is a sequence f 1, …, f m of non-empty substrings of w such that w = f 1 … f m. Each substring in a factorization is called a factor. There exist several factorizations with specific properties. – Lempel-Ziv factorizations [Ziv & Lempel, 1977] – LZW factorizations [Welch, 1984] – Lyndon factorizations [Chen et al., 1958]
3 Factorization of string In this work, we consider factorizations with repetitions. A factorization of a string w is a sequence f 1, …, f m of non-empty substrings of w such that w = f 1 … f m. Each substring in a factorization is called a factor. There exist several factorizations with specific properties. – Lempel-Ziv factorizations [Ziv & Lempel, 1977] – LZW factorizations [Welch, 1984] – Lyndon factorizations [Chen et al., 1958]
4 Repetitive structures a b a a b a a b c d a b c d a b c period Integer p ≥ 1 is said to be a period of string w if w[i] = w[i+p] (1 ≤ i ≤ |w|−p). 11
5 Repetitive structures a b a a b a a b c d a b c d a b c square Integer p ≥ 1 is said to be a period of string w if w[i] = w[i+p] (1 ≤ i ≤ |w|−p). w is a square if |w|/2 is a period of w. period
Integer p ≥ 1 is said to be a period of string w if w[i] = w[i+p] (1 ≤ i ≤ |w|−p). w is a square if |w|/2 is a period of w. w is a repetition if the smallest period of w is at most |w|/2. 6 Repetitive structures a b a a b a a b c d a b c d a b c repetition period
Square factorization : each factor of the factorization is a square. Size of factorization : the number of factors in the factorization. 7 Square factorization No square factorization exists a b a a b a b b a b c d e f g a b a a b a b a a b a a a a a
Repetition factorization : each factor of the factorization is a repetition. Size of factorization : the number of factors in the factorization. 8 Repetition factorization a b a a b a b b a b c d e f g No repetition factorization exists a b a a b a b a a b a a a a a
Repetition factorization : each factor of the factorization is a repetition. Size of factorization : the number of factors in the factorization. There can be multiple factorizations of the same string. 9 Repetition factorization Size : 3 size : 2 a b a a b a b a a b a a a a a
Related work 10 SquareRepetition Any factorization O(n) time O(n) space O(n) time O(n) space Smallest factorization O(n log n) time O(n) space Largest factorization O(n log n) time O(n) space n is the length of the input string. [2][1] [2] [1] Dumitran et al., 2015 [2] Matsuoka et al., 2016
11 Related work and our contribution SquareRepetition Any factorization O(n) time O(n) space O(n) time O(n) space Smallest factorization O(n log n) time O(n) space O(n log n) time O(n) space Largest factorization O(n log n) time O(n) space O(n log n) time O(n) space n is the length of the input string. Our contribution! [1] Dumitran et al., 2015 [2] Matsuoka et al., 2016 [2][1] [2]
A run in a string is a maximal repetition of the string, i.e., its periodicity does not extend to the left nor the right. 12 Runs ( = maximal repetitions) a a b a b a a b a b a b b b a b a b a a b a b b b a b a b a a b a b a b Runs of w w =
13 Runs ( = maximal repetitions) a b a b a a b a b a b b b a b a A run in a string is a maximal repetition of the string, i.e., its periodicity does not extend to the left nor the right. b ≠ w =
14 Runs ( = maximal repetitions) a b a b a a b a b a b b b a b a A run in a string is a maximal repetition of the string, i.e., its periodicity does not extend to the left nor the right. b b ≠ w =
We denote each run r by a triple (beg, end, p). ( beg : beginning position, end : ending position, p : smallest period) 15 Runs ( = maximal repetitions) a a b a b a a b a b a b b b a b a b a a b a b b b a b a b a a b a b a b (1,10,5) Runs of w w =
16 Runs ( = maximal repetitions) Let Runs(w) denote the set of all runs of string w. Runs(w) can be computed in O(n) time for integer alphabet. Lemma 2 [Crochemore & Ilie, 2008] For any string w of length n, |Runs(w)|= O(n). Lemma 1 [Kolpakov & Kucherov, 1999]
Any repetition is a substring of a run, and has length at least twice the period. A repetition factorization exists if and only if we can “traverse” such substrings from the beginning of the string to the end. 17 Main idea a a b a b a a b a b a b b b a b a b a a b a b b b a b a b a a b a b a b
We can formalize the idea and define a graph which we call the repetition graph. Then the problem reduces to a (weighted) path problem on the repetition graph. 18 Main idea a a b a b a a b a b a b b b a b a b a a b a b b b a b a b a a b a b a b
19 Repetition Graph a b a b a a b a b a b b b Below is an example of a repetition graph. The graph consists of two types of nodes (white and black), and several types of edges (black, red, and blue).
20 Repetition Graph a b a b a a b a b a b b b We consider positions between characters. One white node is defined for each position.
21 Repetition Graph a b a b a a b a b a b b b run square For each run r : (beg, end, p), add path for each square of period p within r and connect black nodes of the same run with blue edges.
22 Repetition Graph For each run r : (beg, end, p), add path for each square of period p within r and connect black nodes of the same run with blue edges a b a b a a b a b a b b b run square
23 Repetition Graph a b a b a a b a b a b b b square By definition, there is a one to one correspondence between: non-empty paths that start and end at white nodes in the run, and subrepetitions of the run that start and end at those positions
24 Repetition Graph a b a b a a b a b a b b b square By construction, there is a one to one correspondence between: non-empty paths that start and end at white nodes in the run, and subrepetitions of the run that start and end at those positions
25 Repetition Graph a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 The following graph is the complete repetition graph for the example string.
26 Size of Repetition Graph There are n + 1 white nodes. For each square in run with same period ( “primitively rooted” squares), there is 1 black node and ≤ 3 edges (black, red, blue) # of primitively rooted squares is O(n log n) [Crochemore and Rytter, 1995] size of graph is O(n log n) a b a b a a b a b a b b b
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 Each path corresponding to a repetition has exactly 1 black node. Reduction to Weighted Path Problem
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 There is a one to one correspondence between: paths from first white node to last white node, and repetition factorizations of w Reduction to Weighted Path Problem
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 There is a one to one correspondence between: paths from first white node to last white node, and repetition factorizations of w Reduction to Weighted Path Problem
30 Reduction to Weighted Path Problem a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 There is a one to one correspondence between: paths from first white node to last white node, and repetition factorizations of w size of the factorization = total number of black edges in path
31 Reduction to Weighted Path Problem a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 If we can solve the weighted path problem on the repetition graph s.t. a weight of each black edge is 1 and a weight of each other edge is 0, then we can get a smallest/largest repetition factorization weight
32 Reduction to Weighted Path Problem a b a b a a b a b a b b b r1r1 r4r4 r5r5 r6r6 There is a one to one correspondence between: paths from first white node to last white node, and repetition factorizations of w size of the factorization = total number of black edges in path Size : 4 Weight : 4
Since repetition graphs are DAGs, the smallest/largest weighted path problem can be solved in linear time w.r.t. the size of the graph by using dynamic programming. 33 Reduction to the path problem 33
For each node, the value of the node can be computed from the smallest/largest value of incoming nodes. 34 Reduction to the path problem weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r
Suppose the calculation up to position 9 has been completed, we want to find the size of the factorization up to position Reduction to the path problem weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r
We first compute all values of black nodes at position 10, when we compute the value of white node at position Reduction to the path problem weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r
The value of the lower black node at position 10 is 3, = (the value of white node at position 6 ) Reduction to the path problem weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r
The value of the upper black node at position 10 is 1, = (the value of white node at position 0 ) Reduction to the path problem weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r
The value of white node is largest value of incoming nodes. 39 Reduction to the path problem weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r
Thus, we can get the following lemma. 40 Reduction to the path problem Given the repetition graph of a string of length n, a smallest / largest repetition factorization of the string can be computed in O(n log n) time. Lemma 4
41 Reducing space requirement By definition, we can clearly construct the repetition graph in O(n log n) time. O(n log n) time and O(n log n) space solution We can simulate our algorithm without constructing the repetition graph explicitly. O(n log n) time and O(n) space algorithm
42 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r When the calculation up to position 9 has been completed, we want to find the size of the factorization up to position 10. 2
43 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r We have runs r 2 and r 5 that contains squares that end at position 10, starting at positions 0 and 6, respectively.
44 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r We also know from r 5, that there is a previous black node at position 9.
45 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r The maximum path weight of the new black node for r 5 is the maximum of path weights between the previous black node for r 5, and the white node at position 6 plus 1.
46 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r The idea is that we do not need the value of the black node at position 9 after computing the adjacent black node. Not needed anymore
47 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r Similarly, the maximum path weight of the new black node for r 2 is the path weight of the white node at position 0 plus 1.
48 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r Then we can compute the value of the white node at position 10. 3
49 Solution without Repetition Graph weight a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r Since we only need to remember all white nodes and 1 black node per run, and since edges can be computed on the fly from the information of runs, it is possible to determine the size of the factorization in linear space.
50 Our main result We can compute a smallest/largest repetition factorization of a given string w in O(n log n) time and O(n) space. Theorem
Compute all runs in w, and sort them by the beginning positions. 51 A complete example a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6
Consider n+1 white nodes for each position. 52 A complete example a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6
For each position, if there is a square that ends at the position, then we compute the values of black node and white node a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 A complete example
At position 0 : Since we have no square that ends at position 0, we do not add any node and edge a b a b a a b a b a b b b 0 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 A complete example
At position 0 : Since there is no black node at position 0, the value of white node at position 0 is A complete example a b a b a a b a b a b b b 0 r1r1 r2r2 r3r3 r4r4 r5r5 r6r6
At position 1-3 : Similar to position a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 00 A complete example
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 000 At position 1-3 : Similar to position 0. A complete example
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 1-3 : Similar to position 0. A complete example
At position 4 : Since there is a square that ends at position 4, we know there is a black node a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r A complete example
At position 4 : The value is the weight of the white node at position 0 plus a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r A complete example
At position 4 : Then the value of the white node at position 4 is a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r A complete example
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 5 : There is a square that ends at position 5. We can compute the values of black node and white node. A complete example
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 5 : We do not need the value corresponding to the black node at position 4. 1 A complete example
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 6 : There is a square that ends at position 6. We can compute the values of black node and white node. A complete example
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 7 : Since we have no square that ends at position 0, the value of white node at position 7 is 0. A complete example
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 8 : There is a square that ends at position 8. We can compute the values of black node and white node. A complete example
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 At position 9 : There is a square that ends at position 9. We can compute the values of black node and white node. A complete example
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r6 At position 10 : There are squares that end at position 10. We can compute the values of black nodes and white node. A complete example
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 11 : There is a square that ends at position 11. We can compute the values of black node and white node. A complete example
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 12 : There is a square that ends at position 12. We can compute the values of black node and white node. A complete example
a b a b a a b a b a b b b r1r1 r2r2 r3r3 r4r4 r5r5 r6r At position 13 : There is a square that ends at position 13. We can compute the values of black node and white node. A complete example
72 Lower bound of repetition graph There is a string that the size of graph is the Θ(n log n). (Fibonacci string) For any Fibonacci string of length n, the number of primitively rooted square is Θ(n log n). Lemma 6 [Fraenkel and Simpson, 1999]
We showed two algorithms to compute the smallest/largest repetition factorization for a given string of length n. With graph : O(n log n) time and space Without graph : O(n log n) time and O(n) space Open question is whether there exists an efficient algorithm which computes repetition factorizations of smallest/largest size without relying on the graph. 73 Conclusions and open question