Download presentation
Presentation is loading. Please wait.
Published byMelina Carroll Modified over 8 years ago
1
Dynamic Programming (Edit Distance)
2
Edit Distance Input: – Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA Target: – Find the smallest distance between S1 and S2 – In other words, the smallest number of edit operations to covert S1 into S2 Edit Operations – Insert (I), Delete (d), align(a)
3
Example S1:TCGACGTCA S2: TGACGTGC Three operations to convert S1 to S2: S1:TCGACGTGCA S2: T GACGTGC – Delete C (position 2) and A (position 10) – Insert G (position 8)
4
Edit Distance ACGTCGCAT A C G T G T G C 0i S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a Cost of inserting T into S1 to match S2 S1 is empty S2 is empty
5
Edit Distance ACGTCGCAT A C G T G T G C 0i S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a Cost of inserting TC into S1 to match S2 S1 is empty S2 is empty 2i
6
Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a S1 is empty S2 is empty
7
Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a Cost of deleting T from S1 to match S2 S1 is empty S2 is empty
8
Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a Cost of deleting TG from S1 to match S2 S1 is empty S2 is empty
9
Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d 3d 4d 5d 6d 7d 8d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a S1 is empty S2 is empty
10
Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d 3d 4d 5d 6d 7d 8d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a S1 is empty S2 is empty What we did so far is called Initialization Phase M[0][j] = j * Cost of insert (for all j) M[k][0] = k * cost of delete (for all k) What we did so far is called Initialization Phase M[0][j] = j * Cost of insert (for all j) M[k][0] = k * cost of delete (for all k)
11
Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d 3d 4d 5d 6d 7d 8d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a S1 is empty S2 is empty For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different
12
Edit Distance ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different
13
Edit Distance ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty i,j Smallest Cost for converting S1[1..i] to match S2[1...j] n,m Our goal is to covert S1[1..n] to match S2[1…m]
14
Edit Distance ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min
15
Edit Distance: Case 1 ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min Optimal of matching TGA from S1 with TCGA from S2 + align C with C
16
Edit Distance: Case 2 ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min Optimal of matching TGA from S1 with TCGAC from S2 + delete C from S1
17
Edit Distance: Case 3 ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min Optimal of matching TGAC from S1 with TCGA from S2 + insert C from S1
18
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 Case 1: 0 + 0 = 0 Case 2: 1 + 1 = 2 Case 3: 1 + 1 =2 Case 1: 0 + 0 = 0 Case 2: 1 + 1 = 2 Case 3: 1 + 1 =2
19
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 1 Case 1: 1 + 1 = 2 Case 2: 2 + 1 = 3 Case 3: 0 + 1 =1 Case 1: 1 + 1 = 2 Case 2: 2 + 1 = 3 Case 3: 0 + 1 =1
20
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 1 2345678
21
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 1 2345678 1
22
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 1 2345678 1 11234567
23
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 1 2345678 1 11234567 Case 1: 4 + 0 = 4 Case 2: 5 + 1 = 6 Case 3: 3 + 1 = 4 Case 1: 4 + 0 = 4 Case 2: 5 + 1 = 6 Case 3: 3 + 1 = 4 Two equivalent options to reach this cell
24
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 1 2345678 1 11234567222123456
25
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = 0 1 2345678 1 11234567222123456323212345432321234543432123654543234765655323 Final answer (To covert from S1 to S2 we need 3 edit operations)
26
Summary of Steps >> We considers all combinations (all possible alignments) (Navigate the solution space) >> We started will small sub-problems to solve optimally (Optimal sub-structure) >> At each step from problem of size K, use the results from the possible K-1 sub-problems to find your best answer (Need to keep these results, not compute them again)
27
Edit Distance: Algorithm int matrix[n+1][m+1]; for (x = 0; x <= n; x++) matrix[x][0] = x; for (y = 1; y <= m; y++) matrix [0][y] = y; for (x = 1; x <= n; x++) for (y = 1; y <= m; y++) if (S1[x] == S2[y]) matrix[x][y] = matrix[x-1][y-1]; else matrix[x][y] = min(matrix[x][y-1] + 1, matrix[x-1][y] + 1); return matrix[n][m]; Initialization step S1 of size n, S2 of size m If matching, then go diagonal with 0 additional cost Consider the other two options and take the least
28
Edit Distance: Algorithm Analysis >> We compute (n m) cells >> For each cell we compare with at most 3 surrounding cells Time Complexity O (nm) Space Complexity is also O (nm)
29
How to Backtrack Keep extra information with each cell c – From where did you arrive to c (diagonal, left, or top) We now know that the cost is 3. What are the operations and in what order? Always in Dynamic Programming, to backtrack you may need to keep which optimal sub-problem did you use at each step
30
Backtrack A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 S1 is empty S2 is empty M[i, j] = 0 1 2345678 1 11234567222123456323212345432321234543432123654543234765655323 Means align Means insert Means delete Operations of S1 ACGTCGCAT ACGTG C C A G T ACGTG GC T Original S1 Insert C (position 2) Delete G (position 7) Insert A (position 9
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.