Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Journal of Algorithms 52 (2004) 82–100 Presenter: Yung-Hsing Peng Date:
Example for the problem Let T = a 5 c 6 a 1 c 1 a 4 b 3 (run length coding) P1 = a 2 c P2 = c 1 a 1 b 1 δ is the scaling function with parameter k If k = 2, we have δ 2 (P1) = a 4 c 2, δ 2 (P2) = c 2 a 2 b 2 δ 2 (P1) can be found in T, so P1 is a valid pattern In this example, P2 is not a valid pattern since it failed to every k.
Algorithm for Discrete Scaling For every positive integer k, construct a new string T k for T take x y for example, if y is divisible by k, then replace it by x (y/k), else replace it by x (y/k) $ x (y/k) ex: T = a 5 c 6 a 1 c 1 a 4 b 3 (with max repeat m = 6) T 1 = a 5 c 6 a 1 c 1 a 4 b 3 T 2 = a 2 $a 2 c 3 $$a 2 b 1 $b 1 T 3 = a 1 $a 1 c 2 $$a 1 $a 1 b 1 T 4 = a 1 $a 1 c 1 $c 1 $a 1 $ T 5 = a 1 c 1 $c 1 $$$$ T 6 = $c 1 $$$$ Theorem: Let P be a valid pattern, then P must be find in T 1 $T 2 $T 3 ……$T m
An Efficient Method to Build T k Use T k-1 to compute T k (use the index I k ) ex: T = a 5 c 6 a 1 c 1 a 4 b 3 (with max repeat m = 6) (index) I 1 = {1,2,3,4,5,6} T 1 = a 5 c 6 a 1 c 1 a 4 b 3 I 2 = {1,2,5,6} T 2 = a 2 $a 2 c 3 $$a 2 b 1 $b 1 I 3 = {1,2,5,6} T 3 = a 1 $a 1 c 2 $$a 1 $a 1 b 1 I 4 = {1,2,5} T 4 = a 1 $a 1 c 1 $c 1 $a 1 $I 5 = {1,2} T 5 = a 1 c 1 $c 1 $$$$I 6 = {2} T 6 = $c 1 $$$$I 7 = {} For any I k, there are at most (n/k) elements |I 1 | + |I 2 | + |I 3 | + …. |I m | = nlogm T 1 $T 2 $T 3 $...$T m can be built in O(nlogm)
Time Complexity of Discrete Scaling Lemma: T 1 $T 2 $T 3 …$T m can be built in O(nlogm) Lemma: For each T k, its length is O(n/k) The length of T 1 $T 2 $T 3 …$T m is O(n/1 + n/2 + n/3 + ….+ n/m) = O(nlogm) The suffix tree of T 1 $T 2 …$T m can be built in O(nlogm) where n is the length of T and m is the max repeat length of characters in T
Algorithm for the Decision Version of the Real Scaling (1/2) For every critical real number k, construct a new string T k for T Since the input pattern P is discrete in its run length coding We can find all critical k by division. Ex: a 5 c 6 a 1 c 1 a 4 b 3 (1) divided by 1 {5, 6, 1, 4, 3} (2) divided by 2 {2.5, 3, 2, 1.5} (3) divided by 3 {1.66, 2, 1.33, 1} (4) divided by 4 {1.25, 1.5, 1} (5) divided by 5 {1, 1.2} (6) divided by 6 {1} If m is the max repeats in P, then the set Γ(T) of critical k can be computed by the union of (1)~(m)
Algorithm for the Decision Version of the Real Scaling (2/2) For all critical k in Γ(T), construct a new string T k for T take x y for example, if y is k-invertible, then replace it by x Ф(y, k), else replace it by x Ф(y, k) $ x Ф(y, k) where Ф(y, k) means the largest integer r that floor(k*r) ≤ y ex: T = a 5 c 6 a 1 c 1 a 4 b 3 (with max repeat m = 6) if k = 1.5, then T k = a 3 $a 3 c 4 $$a 3 b 2 Theorem: Let P be a valid pattern, then P must be find in T k1 $T k2 $T k3 ……$T kz, where z is the number of critical k In above example, if k = 1.7 then T k would be a 3 c 4 $$a 2 $a 2 b 2 The position of δ 1.7 (a 3 c 4 ) in T is different from that of δ 1.5 (a 3 c 4 ) in T This algorithm can only solve the decision version of real scaling.
Time Complexity of Decision Version of Real Scaling Lemma: In worst case, the total number of critical k is O(n) Lemma: Each T ki can be computed in O(n) Lemma: T k1 $T k2 $T k3 ……$T kz can be built in O(n 2 )
Algorithm for the Real Scaling (1/4) Core: Generate all valid patterns and use them to build a Real Scale Indexing Tree (RSIT) to speed up searching.
Algorithm for the Real Scaling (2/4) The upper bound for the number of all valid patterns Since there are O(n 3 ) patterns, straightforward implementations would take O(n 4 ) in order to insert all patterns into RSIT. This paper gives an O(n 3 ) algorithm for doing so.
Algorithm for the Real Scaling (3/4) P*(g, l) used to shrink the longest substring start from l, which can be shrink by g EX: T = a a a a b b b c c c a a a a, P = b c l = 4 P*(3,4) = b c a (means the red region shrinks by 3) P is a prefix of P*(3,4)
Algorithm for the Real Scaling (4/4)
Conclusion of Real Scaled Indexing Problem