Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dynamic Programming (Edit Distance). Edit Distance Input: – Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.

Similar presentations


Presentation on theme: "Dynamic Programming (Edit Distance). Edit Distance Input: – Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA."— Presentation transcript:

1 Dynamic Programming (Edit Distance)

2 Edit Distance Input: – Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA Target: – Find the smallest distance between S1 and S2 – In other words, the smallest number of edit operations to covert S1 into S2 Edit Operations – Insert (I), Delete (d), align(a)

3 Example S1:TCGACGTCA S2: TGACGTGC Three operations to convert S1 to S2: S1:TCGACGTGCA S2: T GACGTGC – Delete C (position 2) and A (position 10) – Insert G (position 8)

4 Edit Distance ACGTCGCAT A C G T G T G C 0i S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a Cost of inserting T into S1 to match S2 S1 is empty  S2 is empty

5 Edit Distance ACGTCGCAT A C G T G T G C 0i S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a Cost of inserting TC into S1 to match S2 S1 is empty  S2 is empty 2i

6 Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a S1 is empty  S2 is empty

7 Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a Cost of deleting T from S1 to match S2 S1 is empty  S2 is empty

8 Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a Cost of deleting TG from S1 to match S2 S1 is empty  S2 is empty

9 Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d 3d 4d 5d 6d 7d 8d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a S1 is empty  S2 is empty

10 Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d 3d 4d 5d 6d 7d 8d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a S1 is empty  S2 is empty What we did so far is called Initialization Phase M[0][j] = j * Cost of insert (for all j) M[k][0] = k * cost of delete (for all k) What we did so far is called Initialization Phase M[0][j] = j * Cost of insert (for all j) M[k][0] = k * cost of delete (for all k)

11 Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d 3d 4d 5d 6d 7d 8d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert  i Cost of delete  d Cost of align  a S1 is empty  S2 is empty For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different

12 Edit Distance ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different

13 Edit Distance ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty i,j Smallest Cost for converting S1[1..i] to match S2[1...j] n,m Our goal is to covert S1[1..n] to match S2[1…m]

14 Edit Distance ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min

15 Edit Distance: Case 1 ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min Optimal of matching TGA from S1 with TCGA from S2 + align C with C

16 Edit Distance: Case 2 ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min Optimal of matching TGA from S1 with TCGAC from S2 + delete C from S1

17 Edit Distance: Case 3 ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min Optimal of matching TGAC from S1 with TCGA from S2 + insert C from S1

18 Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 Case 1: 0 + 0 = 0 Case 2: 1 + 1 = 2 Case 3: 1 + 1 =2 Case 1: 0 + 0 = 0 Case 2: 1 + 1 = 2 Case 3: 1 + 1 =2

19 Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 1 Case 1: 1 + 1 = 2 Case 2: 2 + 1 = 3 Case 3: 0 + 1 =1 Case 1: 1 + 1 = 2 Case 2: 2 + 1 = 3 Case 3: 0 + 1 =1

20 Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 1 2345678

21 Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 1 2345678 1

22 Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 1 2345678 1 11234567

23 Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 1 2345678 1 11234567 Case 1: 4 + 0 = 4 Case 2: 5 + 1 = 6 Case 3: 3 + 1 = 4 Case 1: 4 + 0 = 4 Case 2: 5 + 1 = 6 Case 3: 3 + 1 = 4 Two equivalent options to reach this cell

24 Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 1 2345678 1 11234567222123456

25 Edit Distance: Complete Example ACGTCGCAT A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty  S2 is empty M[i, j] = 0 1 2345678 1 11234567222123456323212345432321234543432123654543234765655323 Final answer (To covert from S1 to S2 we need 3 edit operations)

26 Summary of Steps >> We considers all combinations (all possible alignments) (Navigate the solution space) >> We started will small sub-problems to solve optimally (Optimal sub-structure) >> At each step from problem of size K, use the results from the possible K-1 sub-problems to find your best answer (Need to keep these results, not compute them again)

27 Edit Distance: Algorithm int matrix[n+1][m+1]; for (x = 0; x <= n; x++) matrix[x][0] = x; for (y = 1; y <= m; y++) matrix [0][y] = y; for (x = 1; x <= n; x++) for (y = 1; y <= m; y++) if (S1[x] == S2[y]) matrix[x][y] = matrix[x-1][y-1]; else matrix[x][y] = min(matrix[x][y-1] + 1, matrix[x-1][y] + 1); return matrix[n][m]; Initialization step S1 of size n, S2 of size m If matching, then go diagonal with 0 additional cost Consider the other two options and take the least

28 Edit Distance: Algorithm Analysis >> We compute (n m) cells >> For each cell we compare with at most 3 surrounding cells Time Complexity  O (nm) Space Complexity is also  O (nm)

29 How to Backtrack Keep extra information with each cell c – From where did you arrive to c (diagonal, left, or top) We now know that the cost is 3. What are the operations and in what order? Always in Dynamic Programming, to backtrack you may need to keep which optimal sub-problem did you use at each step

30 Backtrack A C G T G T G C 0123456789 1 2 3 4 5 6 7 8 S1 S2 S1 is empty  S2 is empty M[i, j] = 0 1 2345678 1 11234567222123456323212345432321234543432123654543234765655323 Means align Means insert Means delete Operations of S1 ACGTCGCAT ACGTG C C A G T ACGTG GC T Original S1 Insert C (position 2) Delete G (position 7) Insert A (position 9


Download ppt "Dynamic Programming (Edit Distance). Edit Distance Input: – Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA."

Similar presentations


Ads by Google