Download presentation
Presentation is loading. Please wait.
Published byAudrey Martin Modified over 8 years ago
1
Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara 1, Tomoyuki Nakamura 1, Kazuo Hashimoto 1 1 Graduate School of Information Sciences Tohoku University, Japan 2 Department of Computer Science and Communication Engineering, Kyushu University, Japan
2
Background and motivations
3
What is compressed string algorithm? A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : input text
4
What is compressed string algorithm? A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : input text find palindromes output mm isi zz iprefrepi borroworrob wasitabarorabatisow oo :
5
What is compressed string algorithm? A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : A palindrome is a symmetric stri ng. It is interesting on their ow n as word puzzles. For example, “I prefer pi“, ”Bor row or rob?“, and “Was it a bar or a bat I saw?“ and so on. : find palindromes output mm isi zz iprefrepi borroworrob wasitabarorabatisow oo : decompress e)%eARY)(ReJD)OIHOIFEnkkdi we02kfo)J”LPEPJ9wEOW*# eO … compressed text One solution would be to decompress the compressed text. The decompressed size can be exponentially large with respect to the compressed size. decompressed text
6
Goal of algorithms for Compressed strings Process the compressed text without decompression. Processing time should be polynomial in n. – Decompressed size can be exponentially large with respect to n. n : the size of compressed text
7
Compressed schemes run-length encoding Lempel-Ziv grammar based compression : Straight Line Program [Rytter2003] Resulting achieve of most practical compression methods can be transformed into SLP generating the same original text. [Rytter2003] Resulting achieve of most practical compression methods can be transformed into SLP generating the same original text.
8
SLP T T : sequence of assignments X 1 = expr 1 ; X 2 = expr 2 ; … ; X n = expr n ; X k : variable, a ( a X i X j ( i, j < k ). expr k : Definition of Straight Line Program (SLP) SLP T for string w is a CFG in Chomsky normal form s.t. L( T ) = {w}.
9
Straight Line Program (SLP) Example X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 n N N = O(2 n ) T = SLP
10
Straight Line Program (SLP) Example X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 n N N = O(2 n ) T = SLP X8X8 X7X7 X5X5
11
Efficient algorithms for compressed strings substring matching – Karpinski et al (1996) O(n 4 logn) time – Miyazaki et al (1997) O(n 4 ) time – Lifshits (2006) O(n 3 ) time minimum period – Karpinski et al (1996) O(n 4 logn) time – Lifshits (2006) O(n 3 logN) time all squares – Gasieniec et al (1994) O(n 6 log 5 N) time
12
Hardness results Subsequence pattern matching – Lifshits and Lohrey (2006) NP-hard Longest common subsequence – Lifshits and Lohrey (2006) NP-hard Hamming distance – Lifshits (2007) #P-complete Is there any reasonable comparison measurement for compressed strings?
13
a b a a b a a a b b a a String comparison measures a b a a b a a a b b a a Hamming distance Longest common subsequence Longest common substring #P-comprete [Lifshits 07] NP-hard [Lifshits and Lohrey06] ?? O(N)O(N) uncompressed text compressed text O(N 2 / logN)O(N)O(N) we solve this problem a b a a b a a a b b a a
14
Our results
15
Problem Given two SLP T and S that are descriptions of text T and S respectively, compute LCStr(T, S). LCStr(T, S) : the length of longest common substring of T and S n : the total size of the input SLP Our Result1: Longest Common Substring Theorem O(n 4 logn) LCStr(T, S) can be computed in O(n 4 logn) time O(n 3 ) using O(n 3 ) space. Theorem O(n 4 logn) LCStr(T, S) can be computed in O(n 4 logn) time O(n 3 ) using O(n 3 ) space.
16
Problem Given SLP T, compute (compressed representations) the set of all palindromes of T. n : the size of SLP T N : the length of original text T (note that N = O(2 n ) Previous best result: O(n 5 log 4 N) time Our Result2: palindromes [Gasienec et al 1996] Theorem O(n 4 )O(n 2 ) The problem can be solved in O(n 4 ) time using O(n 2 ) space. Theorem O(n 4 )O(n 2 ) The problem can be solved in O(n 4 ) time using O(n 2 ) space.
17
Details of our algorithm Computing longest common substring Computing palindromes (omitted in this talk)
18
Property of common substrings (1/3) For each common substring Z of string S and T, there always exists a variable X i = X l X r and Y j = Y L Y R such that: – Z is a common substring of X i and Y j – Z contains an overlap between X l and Y R common substring Z Z Z Z XiXi XlXl XrXr YjYj YLYL YRYR w w Overlap
19
Property of common substrings (2/3) For each common substring Z of string S and T, there always exists a string w such that: – w is a substring of Z – w is an overlap of variables of S and T w w XiXi XlXl XrXr YjYj YLYL YRYR Overlap
20
Property of common substrings (3/3) For each common substring Z of string S and T, there always exists a string w such that: – Z can be calculate by expanding w common substring w w Z Z Z Z XiXi XlXl XrXr YjYj YLYL YRYR Extend Process Overlap
21
For any strings X, Y, Overlaps (OL) the set of the lengths of overlaps of X and Y. X Y
22
a a b a a b a Overlaps Example OL (“aabaaba”, “abaababb”) = {1, 3, 6} XlXl a b a a b a a b a b YRYR YRYR YRYR
23
Computing Overlaps [Karpinski et al 1996] Lemma For any variables X i and X j of SLP T, OL(X i, X j ) can be represented by O(n) arithmetic progressions. XiXi YjYj Theorem For any SLP T, OL(X i, X j ) can be computed in total of O(n 4 logn) time and O(n 3 ) space.
24
a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL
25
a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL match
26
a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL match
27
a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL match
28
a b a ∈ OL(X l, Y R ) How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YRYR YLYL mismatch
29
How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YrYr YlYl a b a ∈ OL(X l, Y R ) mismatch
30
How to extend overlaps a a a b a b a b a a b a b a b b a a b a a b a a b a b a b a a b a XlXl XrXr XiXi YjYj YrYr YlYl a b a ∈ OL(X l, Y R ) We are not allowed to process character by character.
31
First-mismatch function [Karpinski et al 1996] input : SLP variables X i and Y j, integer k output : position of first mismatch Mismatch k YjYj a b a b a a b a b a a b XiXi a b a b a b a a b a pp [p]}
32
First-mismatch function [Karpinski et al 1996] Lemma Provided that the sets of overlaps are already computed, FM(X i, Y j, k) can be computed in O(nlogn) time.
33
Extending overlaps using FM function Lemma Extending overlaps can be done by O(n) calls of FM function.
34
O(n 2 ) items pseudo-code Computing longest common substring O(n) calls of FM function. O(nlogn) times Totally, LCStr (S, T) can be computed in O(n 2 ×n×nlogn ) = O ( n 4 logn ) time.
35
Conclusions Computing longest common substring from compressed string – O(n 4 logn) time and O(n 3 ) space Computing all palindromes from compressed string – O(n 4 ) time and O(n 2 ) space
36
Thank you for your attention.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.