By Makinen, Navarro and Ukkonen. Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m)

By Makinen, Navarro and Ukkonen

Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m) time algorithm that compute their edit distance. Let A be a short pattern, a B be a long text and a threshold parameter K. We will show an algorithm that will report all the approximate occurrences of A in B Which are at an edit distance K or less from the pattern.

Example of simple edit distance: A= aaabbc B= abbdb aaabbc 0123456 a1 b2 b3 d4 b5

We distinct between three kinds of edit distance: Levenshtein distance - D L (A,B) : Insertion =1,Deletion =1,Substitution=1. distance - D ID (A,B) : Insertion =1,Deletion =1,Substitution=∞ (no Substitution). Global distance - D G (A,B) : Arbitrary coast for Insertion,Deletion,Substitution.

KeyWords Run-Length compressed aaaabbcccaab = (a,4),(b,2),(c,3),(a,2),(b,1). White Box Black Box aaaa a a a aaaa B B B

Dividing the Edit Distance matrix into boxes aaaabbbbbbcccccbb a a a b b b b b a a a a a a b b

aaaabbbbbbcccccbb a a a b b b b b a a a a a a b b

An O(mn’ + m’n) Algorithm for D L Equal Letter Box (White): –“Copying” the values from the upper left borders to the bottom right borders, using as much diagonal moves as possible. 78889 7 8 9 8778 8 8 In an Equal letter box, D ID = D L because no substitutions are needed.

An O(mn’ + m’n) Algorithm for D L Different Letter Box (Black): –Filling only the borders: – 1 + min (t-1 + min relevant upper border, s-1 + min relevant left border ).

(3,5) t > s (5,3) t < s (5,5) t = s

78889 7 8 9 99 10 11 10 9 1+min( 3-1 + min(7,8), 1-1 + min(8,9)) = 9 1+min( 3-1 + min(7,8,8), 2-1 + min(7,8,9)) = 9 1+min( 3-1 + min(7,8,8,8), 3-1 + min(7,7,8,9)) = 10 1+min( 3-1 + min(8,8,8,9), 4-1 + min(7,7,8,9)) = 11 1+min( 2-1 + min(8,8,9), 4-1 + min(7,7,8)) = 10 1+min( 1-1 + min(8,9), 4-1 + min(7,7)) = 9

Three different points along our computation: t > st = st < s t s

Extending the Algorithm to Global Edit Distance. Inside a given box there are only three different costs involved : C I - insertion cost. C D - deletion cost. C S substitution cost. Since the triangle inequity holds : C S < C I + C D we will not differentiate between white box and black box. We assume without loss of generality that the cost C I and C D are the same in all boxes

Filling the borders: For each cell (s,t) in the border : cell(s,t) = min(upper triangle values, leftmost triangle values) triangle values = min( relevant border cells + each cell’s path to (s,t) ) path = C S * number of diagonal moves + C I * number of insertions + C D * number of deletions. example An O(mn’ + m’n) Algorithm for D G

Example 78889 7 8 9 C I = 7 C D = 4 C S = 10 20162120 13 17 min ( min(9+1*7, 8+1*10), min(7+1*10 + 2*4, 8+3*4)) =16 min (min(9+2*7, 8+1*10+1*7, 7+2*10), min(7+2*10 + 1*4, 8+2*4+1*10, 8+3*4)) =20 min (min(9+3*7, 8+1*10+2*7, 7+2*10+1*7, 7+3*10), min(7+3*10, 8+2*10+1*4, 8+1*10+2*4, 8+3*4)) =20 min (min(9+4*7, 8+1*10+3*7, 7+2*10+2*7, 7+3*10+1*7), min(8+3*10, 8+2*10+1*4, 8+1*10+2*4, 9+3*4)) =21 min(min(8+3*7, 7+1*10+2*7, 7+2*10+1*7), min(8+2*10, 8+1*10+1*4,9+2*4)) =17 min (min(7+3*7, 7+1*10+2*7), min( 8+1*10,9+1*4)) =13

Complexity: Stays O(m’n + n’m) same as the Levenshtein algorithm because we only add constant time calculations (multiplications)

Example: aaaabbbbbbcccccbb a a a b b b b b a a a a a a b b

aaaabbbbbbcccccbb 012345678910 a1 a2 a3 b4 b5 b6 b7 b8 a a a a a a b b

Example: aaaabbbbbbcccccbb 012345678910 a13 a22 a32101 b4 b5 b6 b7 b8 a a a a a a b b

Notice There are two cases computing the borders: 1. cell(s,t) gets its minimum value from the left border. 2. cell(s,t) gets its minimum value from the top border. In case 1, the minimum path cost to cell(s,t) can be written as CellValue = BorderCellValue + PathCost(BorderCell, Cell) = BorderCellValue + Diagonal * C S + (Position – Diagonal) * C I = BorderCellValue + Diagonal * (C S – C I ) + Position * C I. BORDERCELLVALUE + DIAGONAL * (C S – C I ) is not dependant on Position! Hence it can be calculated in advance, and kept in an array for all border cells, allowing each CellValue calculation to spend only constant time. Same applies for case 2, by changing C I to C D.

Approximate Searching Given a string A (short pattern ),a string B ( long text ) and a threshold parameter K,we are interested reporting all the “approximate occurrences “ of A in B using K errors or less. (the position of substrings in B that are at distance K or less from the pattern A.) A = aabca B = aaeeeeeebbbbbbbaaaaaabbbbbbcccaaaccccabcdcdcdeeeaaab K = 3

Classical algorithm to find the “matches” Computes a matrix exactly like previous algorithm with the only difference that the first row of the matrix I initialized with zeros. The last row of the matrix is examined,and every text position which is smaller then K is reported as a match. aabbbbaaacbbbbcca a a b

Classical algorithm to find the “matches” Computes a matrix exactly like previous algorithm with the only difference that the first row of the matrix I initialized with zeros. The last row of the matrix is examined,and every text position which is smaller then K is reported as a match. aabbbbaaacbbbbcca 00000000000000000 a a b

Classical algorithm to find the “matches” Computes a matrix exactly like previous algorithm with the only difference that the first row of the matrix I initialized with zeros. The last row of the matrix is examined,and every text position which is smaller then K is reported as a match. aabbbbaaacbbbbcca 00000000000000000 a00111100011111110 a10122210012222221 b21012221111222332

More efficient algorithm – pattern and text are run-length compressed: Fill the matrix only at beginning of text runs. Complete the first m columns only. aabbbbaaacbbbbcca 00000000000000000 a00111100011111110 a10122210012222221 b21012221111222332 Complexity = O(m 2 n’+R) For each run – m*m, there are n’ runs, R is the size of the output.

Improving the trivial algorithm Problem: We would have wanted to apply D L, but if the text is very large, m’n may be a lot bigger than m 2 n’. Solution: Combine the two. Evaluating only the borders of the runs, and only the first m cells of each run, yields an O(m’m + m +m) per run of the text, multiplying by n’ to the final O(m’mn’ + R) complexity. דוגמהדוגמה

By Makinen, Navarro and Ukkonen. Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m)

Similar presentations

Presentation on theme: "By Makinen, Navarro and Ukkonen. Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

By Makinen, Navarro and Ukkonen. Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m)

Similar presentations

Presentation on theme: "By Makinen, Navarro and Ukkonen. Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m)"— Presentation transcript:

Similar presentations

About project

Feedback