Download presentation
Presentation is loading. Please wait.
Published byJoanna Woods Modified over 9 years ago
1
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming with cost/similarity/scoring matrix
2
Measuring Distance of S and T Consider S and T We can transform S into T using the following four operations –insertion of a character into S –deletion of a character from S –substitution (replacement) of a character in S by another character (typically in T) –matching (no operation)
3
Example S = vintner T = writers vintner wintner (Replace v with w) wrintner (Insert r) writner (Delete first n) writer (Delete second n) writers (Insert S)
4
Example Edit Transcript (or just transcript): –a string that describes the transformation of one string into the other Example –RIMDMDMMI –v intner –wri t ers
5
Edit Distance Edit distance of strings S and T –The minimum number of edit operations (insertion, deletion, replacement) needed to transform string S into string T –Levenshtein distance, Levenshtein appears to have been the first to define this concept Optimal transcript –An edit transcript of S and T that has the minimum number of edit operations –cooptimal transcripts
6
Alignment A global alignment of strings S and T is obtained –by inserting spaces (dashes) into S and T they should have the same number of characters (including dashes) at the end –then placing two strings over each other matching one character (or dash) in S with a unique character (or dash) in T –Note ALL positions in both S and T are involved
7
Alignments and Edit transcripts Example Alignment –v-intner- –wri-t-ers Alignments and edit transcripts are interrelated –edit transcript: emphasizes process the specific mutational events –alignment: emphasizes product the relationship between the two strings –Alignments are often easier to work with and visualize also generalize better to more than 2 strings
8
Edit Distance Problem Input –2 strings S and T Task –Output edit distance of S and T –Output optimal edit transcript –Output optimal alignment Solution method –Dynamic Programming
9
Definition of D(i,j) Let D(i,j) be the edit distance of S[1..i] and T[1..j] –The edit distance of the first i characters of S with the first j characters of T –Let |S| = n, |T| = m D(n,m) = edit distance of S and T We will compute D(i,j) for all i and j such that 0 <= i <= n, 0 <= j <= m
10
Recurrence Relation Base Case: –For 0 <= i <= n, D(i,0) = i –For 0 <= j <= m, D(0,j) = j Recursive Case: –0 < i <= n, 0 < j <= m –D(i,j) = min D(i-1,j) + 1(what does this mean?) D(i,j-1) + 1(what does this mean?) D(i-1,j-1) + d(i,j)(what does this mean?) –d(i,j) = 0 if S(i) = T(j) and is 1 otherwise
11
What the various cases mean D(i,j) = min –D(i-1,j) + 1: Align S[1..i-1] with T[1..j] optimally Match S(i) with a dash in T –D(i,j-1) + 1 Align S[1..i] with T[1..j-1] optimally Match a dash in S with T(j) –D(i-1,j-1) + d(i,j) Align S[1..i-1] with T[1..j-1] optimally Match S(i) with T(j)
12
Computing D(i,j) values D(i,j)writers 01234567 0 v1 i2 n3 t4 n5 e6 r7
13
Initialization: Base Case D(i,j)writers 01234567 001234567 v11 i22 n33 t44 n55 e66 r77
14
Row i=1 D(i,j)writers 01234567 001234567 v111234567 i22 n33 t44 n55 e66 r77
15
Entry i=2, j=2 D(i,j)writers 01234567 001234567 v111234567 i222? n33 t44 n55 e66 r77
16
Entry i=2, j=3 D(i,j)writers 01234567 001234567 v111234567 i2222? n33 t44 n55 e66 r77
17
Calculation methodologies Location of edit distance –D(n,m) Example was to calculate row by row Can also calculate column by column Can also use antidiagonals Key is to build from upper left corner
18
Traceback Using table to construct optimal transcript Pointers in cell D(i,j) –Set a pointer from cell (i,j) to cell (i, j-1) if D(i,j) = D(i, j-1) + 1 cell (i-1,j) if D(i,j) = D(i-1,j) + 1 cell (i-1,j-1) if D(i,j) = D(i-1,j-1) + d(i,j) –Follow path of pointers from (n,m) back to (0,0)
19
What the pointers mean horizontal pointer: cell (i,j) to cell (i, j-1) –Align T(j) with a space in S –Insert T(j) into S vertical pointer: cell (i,j) to cell (i-1, j) –Align S(i) with a space in T –Delete S(i) from S diagonal pointer: cell (i,j) to cell (i-1, j-1) –Align S(i) with T(j) –Replace S(i) with T(j)
20
Table and transcripts The pointers represent all optimal transcripts Theorem: –Any path from (n,m) to (0,0) following the pointers specifies an optimal transcript. –Conversely, any optimal transcript is specified by such a path. –The correspondence between paths and transcripts is one to one.
21
Running Time Initialization of table –O(n+m) Calculating table and pointers –O(nm) Traceback for one optimal transcript or optimal alignment –O(n+m)
22
Operation-Weight Edit Distance Consider S and T We can assign weights to the various operations –insertion/deletion of a character: cost d –substitution (replacement) of a character: cost r –matching: cost e –Previous case: d = r = 1, e = 0
23
Modified Recurrence Relation Base Case: –For 0 <= i <= n, D(i,0) = i d –For 0 <= j <= m, D(0,j) = j d Recursive Case: –0 < i <= n, 0 < j <= m –D(i,j) = min D(i-1,j) + d D(i,j-1) + d D(i-1,j-1) + d(i,j) –d(i,j) = e if S(i) = T(j) and is r otherwise
24
Alphabet-Weight Edit Distance Define weight of each possible substitution –r(a,b) where a is being replaced by b for all a,b in the alphabet –For example, with DNA, maybe r(A,T) > r(A,G) –Likewise, I(a) may vary by character Operation-weight edit distance is a special case of this variation Weighted edit distance refers to this alphabet- weight setting
25
Modified Recurrence Relation Base Case: –For 0 <= i <= n, D(i,0) = 1 <= k <= i I(S(k)) –For 0 <= j <= m, D(0,j) = 1 <= k <= j I(T(k)) Recursive Case: –0 < i <= n, 0 < j <= m –D(i,j) = min D(i-1,j) + I(S(i)) D(i,j-1) + I(T(j)) D(i-1,j-1) + d(i,j) –d(i,j) = r(S(i), T(j))
26
Measuring Similarity of S and T Definitions –Let be the alphabet for strings S and T –Let ’ be the alphabet with character - added –For any two characters x,y in ’, s(x,y) denotes the value (or score) obtained by aligning x with y –For a given alignment A of S and T, let S’ and T’ denote the strings after the chosen insertion of spaces and l their new length –The value of alignment A is 1<=i<=l s(S’(i),T’(i))
27
Example a b a a - b a b a a a a a b - b 1-2+1+1+0+2+0+2=5 sab- a1-20 b2 -0
28
String Similarity Problem Input –2 strings S and T –Scoring matrix s for alphabet ’ Task –Output optimal alignment value of S and T The alignment of S and T with maximal, not minimal, value –Output this alignment
29
Modified Recurrence Relation Base Case: –For 0 <= i <= n, V(i,0) = 1 <= k <= i s(S(k),-) –For 0 <= j <= m, V(0,j) = 1 <= k <= j s(-,T(k)) Recursive Case: –0 < i <= n, 0 < j <= m –V(i,j) = max V(i-1,j) + s(S(i),-) V(i,j-1) + s(-,T(j)) V(i-1,j-1) + s(S(i), T(j))
30
Longest Common Subsequence Problem Given 2 strings S and T, a common subsequence is a subsequence that appears in both S and T. The longest common subsequence problem is to find a longest common subsequence (lcs) of S and T –subsequence: characters need not be contiguous –different than substring Can you use dynamic programming to solve the longest common subsequence problem?
31
Computing alignments using linear space. Hirschberg [1977] Suppose we only need the maximum similarity/distance value of S and T without an alignment or transcript How can we conserve space? –Only save row i-1 when computing row i in the table
32
Illustration 01234nn-1 0 1 2 3 4 5 6 7 … m...
33
Linear space and an alignment Assume S has length 2n Divide and conquer approach –Compute value of optimal alignment of S[1..n] with all prefixes of T Store row n only at end along with pointer values of row n –Compute value of optimal alignment of S r [1..n] with all prefixes of T r Store only values in row n Find k such that –V(S[1..n],T[1..k]) + V(S r [1..n],T r [1..m-k]) –is maximized over 0 <= k <=m
34
Illustration 01234560123456 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 - 65432106543210 - 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 V(S[1..6], T[1..0]) V(S r [1..6], T r [1..18]) k=0 m-k=18
35
Illustration 01234560123456 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 - 65432106543210 - 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 V(S[1..6], T[1..1]) V(S r [1..6], T r [1..17]) k=1 m-k=17
36
Illustration 01234560123456 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 - 65432106543210 - 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 V(S[1..6], T[1..2]) V(S r [1..6], T r [1..16]) k=2 m-k=16
37
Illustration 01234560123456 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 - 65432106543210 - 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 V(S[1..6], T[1..9]) V(S r [1..6], T r [1..9]) k=9 m-k=9
38
Illustration 01234560123456 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 - 65432106543210 - 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 V(S[1..6], T[1..18]) V(S r [1..6], T r [1..0]) k=18 m-k=0
39
Illustration 01234560123456 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 - 65432106543210 - 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
40
Recursive Step Let k* be the k that maximizes –V(S[1..n],T[1..k]) + V(S r [1..n],T r [1..m-k]) Record all steps on row n including the one from n-1 and the one to n+1 Recurse on the two subproblems –S[1..n-1] with T[1..j] where j <= k* –S r [1..n] with T r [1..q] where q <= m-k*
41
Illustration 01234560123456 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 - 65432106543210 - 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
42
Time Required cmn time to get this answer so far Two subproblems have at most half the total size of this problem –At most the same cmn time to get the rest of the solution cmn/2 + cmn/4 + cmn/8 + cmn/16 + … <= cmn Final result –Linear space with only twice as much time
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.