Download presentation
Presentation is loading. Please wait.
Published byGodfrey Garrison Modified over 6 years ago
1
Approximate Matching of Run-Length Compressed Strings
Veli Mäkinen* Gonzalo Navarro** Esko Ukkonen* * Department of Computer Science, University of Helsinki, Finland ** Department of Computer Science, University of Chile, Santiago, Chile
2
Approximate Matching of Run-Length Compressed Strings, CPM'2001
The Problem Run-length coding of a string is RL(aaaabbbbbbccaaabb)=a4b6c2a3b2. Problem: Given two run-length encoded strings, calculate their edit distance. The trivial solution is to decompress and then calculate, but can it be done faster... July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
3
Variations of the scheme
Different variations of edit distance. Levenshtein distance: Minimum number of character insertions, deletions, and substitutions to convert a string into another string. Longest common subsequence (LCS): Distance DID, minimum number of insertions and deletions, is a dual problem for LCS; 2·|LCS(A,B)|=m+n-DID(A,B). July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
4
Variations of the scheme (2)
Search problem: Search all approximate occurrences of a short pattern P inside a large text T, where both the pattern and the text are run-length encoded. With approximate occurrence we mean that the (edit) distance between the pattern and a substring of the text is at most some treshold value k. July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
5
Approximate Matching of Run-Length Compressed Strings, CPM'2001
Previous Results Bunke & Csirik, 95: O(m’n+mn’) for calculating the LCS between two strings of lengths m and n, run-length encoded to lengths m’ and n’. Apostolico & Landau & Skiena, 97: O(m’n’ log(m’n’)) for the LCS. Mitchell, 97: O((m’+n’+p) log(m’+n’+p)) for the LCS, where p is the amount of matches between compressed characters. July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
6
Approximate Matching of Run-Length Compressed Strings, CPM'2001
Our Results The first algorithm for the Levenshtein distance in this context: O(m’n+mn’) by generalizing the result of Bunke & Csirik for the LCS.* Search algorithm for Levenshtein and LCS distances: O(mm’n’). * Independently Arbell & Landau & Mitchell found similar algorithm. July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
7
Approximate Matching of Run-Length Compressed Strings, CPM'2001
Our Results (2) O(min(d2min(m’,n’), m’n’max(m’,n’))) for the LCS, where d=m+n-2·|LCS(A,B)|. Conjecture: O(m’n’) average case for the LCS. Experimental results to support the conjecture. July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
8
Bunke & Csirik algorithm for the LCS
a a a a a b b b b c c c c a a Equal letter box: values from the diagonal. 1 2 3 4 5 6 7 8 9 10 11 1 2 6 10 2 1 5 9 3 4 8 4 1 5 7 5 4 3 2 6 6 3 5 7 7 4 8 8 5 3 7 9 8 7 6 5 4 3 2 10 7 3 11 8 4 12 9 5 13 12 11 10 9 8 7 6 14 11 7 9 15 14 13 12 11 10 9 8 7 6 a b Different letter box: minimum of two values: (value from the left border+distance, value from the upper border +distance). 2·|LCS(A,B)|=m+n-DID(A,B) ==> |LCS(aaaaabbbbccccaa,aaabbbbaaaa)| = ( )/2=9 July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
9
O(m’n+mn’) for the Levenshtein distance
a a a a a b b b b c c c c a a 1 2 3 4 5 6 7 8 9 10 11 1 2 6 10 2 1 5 9 3 4 8 4 1 7 5 4 3 2 6 6 3 7 4 2 6 8 5 2 6 9 8 7 6 5 4 3 2 10 7 3 6 11 8 4 6 12 9 5 6 13 12 11 10 9 8 7 6 14 11 7 6 15 14 13 12 11 10 9 8 7 6 Equal letter boxes are calculated as in the LCS. a b Different letter boxes are more difficult... July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
10
O(m’n+mn’) for the Levenshtein distance (2)
mintop 4 X 5 t minleft X=min(mintop+t,minleft+l) l July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
11
O(m’n+mn’) for the Levenshtein distance (3)
min=1 0: 0 1: 1 2: 1 3: 0 mintop=1 X 5 4 t=1 minleft=3 X=min(mintop+t,minleft+l) = 2 l=5 July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
12
O(m’n+mn’) for the Levenshtein distance (4)
min=0 0: 1 1: 1 2: 1 3: 0 mintop=0 X 5 4 t=2 minleft=3 X=min(mintop+t,minleft+l) = 2 l=5 July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
13
O(m’n+mn’) for the Levenshtein distance (5)
min=0 0: 1 1: 2 2: 1 3: 0 mintop=0 X 4 t=3 minleft=3 X=min(mintop+t,minleft+l) = 3 l=5 July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
14
O(m’n+mn’) for the Levenshtein distance (6)
min=0 0: 1 1: 2 2: 2 3: 0 mintop=0 X t=4 minleft=3 X=min(mintop+t,minleft+l) = 4 l=5 July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
15
O(m’n+mn’) for the Levenshtein distance (7)
min=0 0: 1 1: 2 2: 1 3: 1 mintop=0 X 4 t=4 minleft=3 l=4 X=min(mintop+t,minleft+l) = 4 July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
16
O(m’n+mn’) for the Levenshtein distance (8)
min=0 0: 1 1: 1 2: 1 3: 1 mintop=0 X t=4 minleft=4 l=3 X=min(mintop+t,minleft+l) = 4 July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
17
O(m’n+mn’) for the Levenshtein distance (9)
min=1 0: 0 1: 1 2: 1 3: 1 mintop=1 X t=4 minleft=4 l=2 X=min(mintop+t,minleft+l) = 5 July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
18
O(m’n+mn’) for the Levenshtein distance (10)
min=2 0: 0 1: 0 2: 1 3: 1 mintop=2 4 X t=4 l=1 minleft=4 X=min(mintop+t,minleft+l) = 5 July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
19
O(m’n+mn’) for the Levenshtein distance (11)
Different letter boxes can be calculated as fast as equal letter boxes. m’ rows with n cells + n’ columns with m cells ==> O(m’n+mn’). July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
20
Approximate searching
Can be done by ”assigning first row to zero”, but time complexity is O(m’n+mn’). If all runs in the text are shorter than 2m, then n<2mn’, and the time complexity is O(m’mn’). If a run is longer than 2m-1, only the first 2m columns need to be calculated; the rest equals to the last column. ==> O(m’mn’) search algorithm. July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
21
”Greedy” algorithm for the LCS
Idea: Calculate only corners. a a a a a b b b b c c c c a a Different letter boxes are easy; a corner value can be calculated in constant time from corners above and on the left. 3 7 11 5 2 6 9 6 2 13 10 6 15 12 8 a b Equal letter boxes: corner values can be traced. Time complexity is O(min(m’n’(m’+n’),m’n+mn’)). July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
22
Diagonal algorithm for the LCS
Calculate first inside a diagonal band 0,...(n-m). a a a a a b b b b c c c c a a 3 7 11 5 2 6 9 6 2 13 10 6 15 12 8 6 a b Shortest path that goes outside this band has cost > d’=(n-m)+1. If dmnd’, then the band is wide enough. If dmn>d’, then double d’ and increase the band so that the shortest path that goes outside this band has cost >d’. July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
23
Diagonal algorithm for the LCS (2)
At the beginning, d’=4+1=5. a a a a a b b b b c c c c a a As dmn=6>5, we have to double d’. 3 7 11 5 2 4 6 9 6 2 4 13 10 6 8 15 12 8 a b After the first doubling, d’=10 and the diagonal band is -3,...,+7. As dmn=8<d’, the band is wide enough. ==> We can stop the doubling. July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
24
Diagonal algorithm for the LCS (3)
Time complexity is O(min(d2min(m’,n’), m’n’max(m’,n’))): a a a a a b b b b c c c c a a 3 7 11 5 2 4 6 9 6 2 4 13 10 6 8 15 12 8 a b The number of diagonals after the last doubling is d+1. The sum of diagonals before the last doubling is < d+1. Each diagonal has at most O(min(m’,n’)) corners. The length of each tracing path can be limited by 2d. July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
25
Improving the ”greedy” algorithm (1): Skipping different letter boxes
a a a a a b b b b c c c c a a Observation. Runs of different letter boxes can be skipped in tracing paths. ==> Average case O(m’n’max(m’,n’)/||2). 3 7 11 5 2 6 9 6 2 13 10 6 15 12 8 a b July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
26
Improving the ”greedy” algorithm (2): Bridges
a a a a a b b b b c c c c a a Observation. In different letter boxes, all values are known on the bottom or on the right border. ==> Tracing a corner value can be stopped at a bridge. 1 2 3 4 5 7 8 9 10 11 5 2 3 4 6 9 6 2 3 4 5 13 10 6 7 8 9 15 12 8 a b 7 8 9 July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
27
Approximate Matching of Run-Length Compressed Strings, CPM'2001
O(m’n’) average case? Conjecture: Assuming that run-lengths are equally distributed in both strings with the same mean, the expected running time of the ”greedy” algorithm using the bridge property, is O(m’n’). Experimental results support the conjecture: For randomly generated data the average tracing path length was <2. July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
28
Experimental results (1): Lengths of the runs
m’=n’=2000, ||=2 July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
29
Experimental results (2): m’ n’
||=2, runs in [1,1000] July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
30
Experimental results (3): ||
m’=n’=2000, runs in [1,1000] July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
31
Experimental results (4): B=”k random insertions/deletions on A”
m’=n’=2000, runs in [1,1000] July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
32
Approximate Matching of Run-Length Compressed Strings, CPM'2001
Experimental results (5): Real data: All pair of lines in three black/white images July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
33
Approximate Matching of Run-Length Compressed Strings, CPM'2001
References: A. Apostolico, G. Landau, and S. Skiena. Matching for run-length encoded strings. J. of Complexity, 15:4--16, 1999. (Also at Sequences '97, Positano Italy, June , 1997). O. Arbell, G. Landau, and J. Mitchell. Edit distance of run-length encoded strings. Submitted for publication, August 2000. H. Bunke and J. Csirik. An algorithm for matching run-length coded strings. Computing, 50: , 1993. An improved algorithm for computing the edit distance of run-length coded strings. Information Processing Letters, 54(2):93--96, 1995. V. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 6: , 1966. J. Mitchell. A geometric shortest path problem, with application to computing a longest common subsequence in run-length encoded strings. In Technical Report, Dept. of Applied Mathematics, SUNY Stony Brook, 1997. July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
34
Approximate Matching of Run-Length Compressed Strings, CPM'2001
References... P. Sellers. The theory and computation of evolutionary distances: Pattern recognition. J. of Algorithms, 1(4): , 1980. E. Ukkonen. Algorithms for approximate string matching. Information and Control 64(1--3): , 1985. Finding approximate patterns in strings. J. of Algorithms 6(1--3): , 1985. R. Wagner and M. Fisher. The string-to-string correction problem. J. of the ACM 21(1): , 1974. July 1, 2001 Approximate Matching of Run-Length Compressed Strings, CPM'2001
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.