Dynamic Programming (Edit Distance)
Edit Distance Input: – Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA Target: – Find the smallest distance between S1 and S2 – In other words, the smallest number of edit operations to covert S1 into S2 Edit Operations – Insert (I), Delete (d), align(a)
Example S1:TCGACGTCA S2: TGACGTGC Three operations to convert S1 to S2: S1:TCGACGTGCA S2: T GACGTGC – Delete C (position 2) and A (position 10) – Insert G (position 8)
Edit Distance ACGTCGCAT A C G T G T G C 0i S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a Cost of inserting T into S1 to match S2 S1 is empty S2 is empty
Edit Distance ACGTCGCAT A C G T G T G C 0i S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a Cost of inserting TC into S1 to match S2 S1 is empty S2 is empty 2i
Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a S1 is empty S2 is empty
Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a Cost of deleting T from S1 to match S2 S1 is empty S2 is empty
Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a Cost of deleting TG from S1 to match S2 S1 is empty S2 is empty
Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d 3d 4d 5d 6d 7d 8d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a S1 is empty S2 is empty
Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d 3d 4d 5d 6d 7d 8d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a S1 is empty S2 is empty What we did so far is called Initialization Phase M[0][j] = j * Cost of insert (for all j) M[k][0] = k * cost of delete (for all k) What we did so far is called Initialization Phase M[0][j] = j * Cost of insert (for all j) M[k][0] = k * cost of delete (for all k)
Edit Distance ACGTCGCAT A C G T G T G C 0i2i3i4i5i6i7i8i9i 1d 2d 3d 4d 5d 6d 7d 8d S1 S2 **Edit operations on S1 that converts it into S2 Cost of Insert i Cost of delete d Cost of align a S1 is empty S2 is empty For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different
Edit Distance ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different For simplicity lets assume the following costs: Cost of insert (i) = 1 Cost of delete (d) = 1 0 if aligned characters are the same Cost of align (a) = 1 if aligned characters are different
Edit Distance ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty i,j Smallest Cost for converting S1[1..i] to match S2[1...j] n,m Our goal is to covert S1[1..n] to match S2[1…m]
Edit Distance ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min
Edit Distance: Case 1 ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min Optimal of matching TGA from S1 with TCGA from S2 + align C with C
Edit Distance: Case 2 ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min Optimal of matching TGA from S1 with TCGAC from S2 + delete C from S1
Edit Distance: Case 3 ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty i,j M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min Optimal of matching TGAC from S1 with TCGA from S2 + insert C from S1
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 Case 1: = 0 Case 2: = 2 Case 3: =2 Case 1: = 0 Case 2: = 2 Case 3: =2
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min 0 1 Case 1: = 2 Case 2: = 3 Case 3: =1 Case 1: = 2 Case 2: = 3 Case 3: =1
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min Case 1: = 4 Case 2: = 6 Case 3: = 4 Case 1: = 4 Case 2: = 6 Case 3: = 4 Two equivalent options to reach this cell
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = M[i-1, j-1] + cost of align S1[i] and S2[j] M[i-1, j] + cost of delete S1[i] M[i, j-1] + cost of insert S2[j] into S1 Min
Edit Distance: Complete Example ACGTCGCAT A C G T G T G C S1 S2 **Edit operations on S1 that converts it into S2 S1 is empty S2 is empty M[i, j] = Final answer (To covert from S1 to S2 we need 3 edit operations)
Summary of Steps >> We considers all combinations (all possible alignments) (Navigate the solution space) >> We started will small sub-problems to solve optimally (Optimal sub-structure) >> At each step from problem of size K, use the results from the possible K-1 sub-problems to find your best answer (Need to keep these results, not compute them again)
Edit Distance: Algorithm int matrix[n+1][m+1]; for (x = 0; x <= n; x++) matrix[x][0] = x; for (y = 1; y <= m; y++) matrix [0][y] = y; for (x = 1; x <= n; x++) for (y = 1; y <= m; y++) if (S1[x] == S2[y]) matrix[x][y] = matrix[x-1][y-1]; else matrix[x][y] = min(matrix[x][y-1] + 1, matrix[x-1][y] + 1); return matrix[n][m]; Initialization step S1 of size n, S2 of size m If matching, then go diagonal with 0 additional cost Consider the other two options and take the least
Edit Distance: Algorithm Analysis >> We compute (n m) cells >> For each cell we compare with at most 3 surrounding cells Time Complexity O (nm) Space Complexity is also O (nm)
How to Backtrack Keep extra information with each cell c – From where did you arrive to c (diagonal, left, or top) We now know that the cost is 3. What are the operations and in what order? Always in Dynamic Programming, to backtrack you may need to keep which optimal sub-problem did you use at each step
Backtrack A C G T G T G C S1 S2 S1 is empty S2 is empty M[i, j] = Means align Means insert Means delete Operations of S1 ACGTCGCAT ACGTG C C A G T ACGTG GC T Original S1 Insert C (position 2) Delete G (position 7) Insert A (position 9