Download presentation
Presentation is loading. Please wait.
Published byJoleen Sutton Modified over 9 years ago
1
String Algorithms David Kauchak cs302 Spring 2012
2
Where did “dynamic programming” come from? Richard Bellman On the Birth of Dynamic Programming Stuart Dreyfus http://www.eng.tau.ac.il/~ami/cd/o r50/1526-5463-2002-50-01- 0048.pdf
3
Strings Let Σ be an alphabet, e.g. Σ = (, a, b, c, …, z) A string is any member of Σ*, i.e. any sequence of 0 or more members of Σ ‘this is a string’ Σ* ‘this is also a string’ Σ* ‘1234’ Σ*
4
String operations Given strings s 1 of length n and s 2 of length m Equality: is s 1 = s 2 ? (case sensitive or insensitive) Running time O(n) where n is length of shortest string ‘this is a string’ = ‘this is a string’ ‘this is a string’ ≠ ‘this is another string’ ‘this is a string’ =? ‘THIS IS A STRING’
5
String operations Concatenate (append): create string s 1 s 2 Running time (assuming we generate a new string) Θ(n+m) ‘this is a’. ‘ string’ → ‘this is a string’
6
String operations Substitute: Exchange all occurrences of a particular character with another character Running time Θ(n) Substitute(‘this is a string’, ‘i’, ‘x’) → ‘thxs xs a strxng’ Substitute(‘banana’, ‘a’, ‘o’) → ‘bonono’
7
String operations Length: return the number of characters/symbols in the string Running time O(1) or Θ(n) depending on implementation Length(‘this is a string’) → 16 Length(‘this is another string’) → 24
8
String operations Prefix: Get the first j characters in the string Running time Θ(j) Suffix: Get the last j characters in the string Running time Θ(j) Prefix(‘this is a string’, 4) → ‘this’ Suffix(‘this is a string’, 6) → ‘string’
9
String operations Substring – Get the characters between i and j inclusive Running time Θ(j – i + 1) Prefix: Prefix(S, i) = Substring(S, 1, i) Suffix: Suffix(S, i) = Substring(S, i+1, length(n)) Substring(‘this is a string’, 4, 8) → ‘s is ’
10
Edit distance (aka Levenshtein distance) Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2 Insertion: ABACEDABACCEDDABACCED Insert ‘C’Insert ‘D’
11
Edit distance (aka Levenshtein distance) Deletion: ABACED Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2
12
Edit distance (aka Levenshtein distance) Deletion: ABACEDBACED Delete ‘A’ Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2
13
Edit distance (aka Levenshtein distance) Deletion: ABACEDBACEDBACE Delete ‘A’Delete ‘D’ Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2
14
Edit distance (aka Levenshtein distance) Substitution: ABACEDABADEDABADES Sub ‘D’ for ‘C’Sub ‘S’ for ‘D’ Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2
15
Edit distance examples Edit(Kitten, Mitten) =1 Operations: Sub ‘M’ for ‘K’Mitten
16
Edit distance examples Edit(Happy, Hilly) =3 Operations: Sub ‘a’ for ‘i’Hippy Sub ‘l’ for ‘p’Hilpy Sub ‘l’ for ‘p’Hilly
17
Edit distance examples Edit(Banana, Car) =5 Operations: Delete ‘B’anana Delete ‘a’nana Delete ‘n’naa Sub ‘C’ for ‘n’Caa Sub ‘a’ for ‘r’Car
18
Edit distance examples Edit(Simple, Apple) =3 Operations: Delete ‘S’imple Sub ‘A’ for ‘i’Ample Sub ‘m’ for ‘p’Apple
19
Edit distance Why might this be useful?
20
Is edit distance symmetric? that is, is Edit(s 1, s 2 ) = Edit(s 2, s 1 )? Why? sub ‘i’ for ‘j’ → sub ‘j’ for ‘i’ delete ‘i’ → insert ‘i’ insert ‘i’ → delete ‘i’ Edit(Simple, Apple) =? Edit(Apple, Simple)
21
Calculating edit distance X = A B C B D A B Y = B D C A B A Ideas?
22
Calculating edit distance X = A B C B D A ? Y = B D C A B ? After all of the operations, X needs to equal Y
23
Calculating edit distance X = A B C B D A ? Y = B D C A B ? Operations:Insert Delete Substitute
24
Insert X = A B C B D A ? Y = B D C A B ?
25
Insert X = A B C B D A ? Y = B D C A B ? Edit
26
Delete X = A B C B D A ? Y = B D C A B ?
27
Delete X = A B C B D A ? Y = B D C A B ? Edit
28
Substition X = A B C B D A ? Y = B D C A B ?
29
Substition X = A B C B D A ? Y = B D C A B ? Edit
30
Anything else? X = A B C B D A ? Y = B D C A B ?
31
Equal X = A B C B D A ? Y = B D C A B ?
32
Equal X = A B C B D A ? Y = B D C A B ? Edit
33
Combining results Insert: Delete: Substitute: Equal:
34
Combining results
35
Running time Θ(nm)
36
Variants Only include insertions and deletions What does this do to substitutions? Include swaps, i.e. swapping two adjacent characters counts as one edit Weight insertion, deletion and substitution differently Weight specific character insertion, deletion and substitutions differently Length normalize the edit distance
37
String matching Given a pattern string P of length m and a string S of length n, find all locations where P occurs in S P = ABA S = DCABABBABABA
38
String matching P = ABA S = DCABABBABABA Given a pattern string P of length m and a string S of length n, find all locations where P occurs in S
39
Uses grep/egrep search find java.lang.String.contains()
40
Naive implementation
41
Is it correct?
42
Running time? What is the cost of the equality check? Best case: O(1) Worst case: O(m)
43
Running time? Best case Θ(n) – when the first character of the pattern does not occur in the string Worst case O((n-m+1)m)
44
Worst case P = AAAA S = AAAAAAAAAAAAA
45
Worst case P = AAAA S = AAAAAAAAAAAAA
46
Worst case P = AAAA S = AAAAAAAAAAAAA
47
Worst case P = AAAA S = AAAAAAAAAAAAA repeated work!
48
Worst case P = AAAA S = AAAAAAAAAAAAA Ideally, after the first match, we’d know to just check the next character to see if it is an ‘A’
49
Patterns Which of these patterns will have that problem? P = ABAB P = ABDC P = BAA P = ABBCDDCAABB
50
Patterns Which of these patterns will have that problem? P = ABAB P = ABDC P = BAA P = ABBCDDCAABB If the pattern has a suffix that is also a prefix then we will have this problem
51
Finite State Automata (FSA) An FSA is defined by 5 components Q is the set of states q0q0 q1q1 q2q2 qnqn …
52
Finite State Automata (FSA) An FSA is defined by 5 components Q is the set of states q 0 is the start state A Q, is the set of accepting states where |A| > 0 Σ is the alphabet (e.g. {A, B} is the transition function from Q x Σ to Q q0q0 q1q1 q2q2 qnqn … Q Σ Q q 0 A q 1 q 0 B q 2 q 1 A q 1 … q0q0 q1q1 q2q2 A B A … q7q7
53
FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A An FSA starts at state q 0 and reads the characters of the input string one at a time. If the automaton is in state q and reads character a, then it transitions to state (q,a). If the FSA reaches an accepting state (q A), then the FSA has found a match.
54
FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A What pattern does this represent? P = ABA
55
FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA
56
FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA
57
FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA
58
FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA
59
FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA
60
FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA
61
FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA
62
FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA
63
FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA
64
FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA
65
FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA
66
FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA
67
Suffix function The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y σ(abcdab, ababcd) = ?
68
Suffix function The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y σ(abcdab, ababcd) = 2
69
Suffix function The suffix function σ(x,y) is the index of the longest suffix of x that is a prefix of y σ(daabac, abacac) = ?
70
Suffix function σ(daabac, abacac) = 4 The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y
71
Suffix function σ(dabb, abacd) = ? The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y
72
Suffix function σ(dabb, abacd) = 0 The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y
73
Building a string matching automata Given a pattern P = p 1, p 2, …, p m, we’d like to build an FSA that recognizes P in strings P = ababaca Ideas?
74
Building a string matching automata Q = q 1, q 2, …, q m corresponding to each symbol, plus a q 0 starting state the set of accepting states, A = {q m } vocab Σ all symbols in P, plus one more representing all symbols not in P The transition function for q Q and a Σ is defined as: (q, a) = σ(p 1…q a, P) P = ababaca
75
Transition function (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 ?a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 σ(a, ababaca)
76
Transition function (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 1?a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 σ(b, ababaca)
77
Transition function (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 10?a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 σ(b, ababaca)
78
Transition function (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 σ(b, ababaca)
79
Transition function (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 q0q0 q1q1 A B,C
80
Transition function (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 ?b q4q4 a q5q5 c q6q6 a q7q7 We’ve seen ‘aba’ so far σ(abaa, ababaca)
81
Transition function (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 1b q4q4 a q5q5 c q6q6 a q7q7 We’ve seen ‘aba’ so far σ(abaa, ababaca)
82
Transition function (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 140b q4q4 500a q5q5 1?c q6q6 a q7q7 We’ve seen ‘ababa’ so far
83
Transition function (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 140b q4q4 500a q5q5 1?c q6q6 a q7q7 We’ve seen ‘ababa’ so far σ(ababab, ababaca)
84
Transition function (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 140b q4q4 500a q5q5 14c q6q6 a q7q7 We’ve seen ‘ababa’ so far σ(ababab, ababaca)
85
Transition function (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 140b q4q4 500a q5q5 146c q6q6 700a q7q7 120
86
Matching runtime Once we’ve built the FSA, what is the runtime? Θ(n) - Each symbol causes a state transition and we only visit each character once What is the cost to build the FSA? How many entries in the table? Ω(m|Σ|) How long does it take to calculate the suffix function at each entry? Naïve: O(m) Overall naïve: O(m 2 |Σ|) Overall fast implementation O(m|Σ|)
87
Rabin-Karp algorithm P = ABA S = BABABBABABA - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare
88
P = ABA S = BABABBABABA Hash P T(P) Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare
89
P = ABA S = BABABBABABA Hash m symbol sequences and compare T(P) Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare T(BAB) =
90
P = ABA S = BABABBABABA Hash m symbol sequences and compare T(P) match Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare T(ABA) =
91
P = ABA S = BABABBABABA Hash m symbol sequences and compare T(P) Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare T(BAB) =
92
P = ABA S = BABABBABABA Hash m symbol sequences and compare T(P) … Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare T(BAB) =
93
Rabin-Karp algorithm P = ABA S = BABABBABABA For this to be useful/efficient, what needs to be true about T? T(P) … T(BAB) =
94
Rabin-Karp algorithm P = ABA S = BABABBABABA For this to be useful/efficient, what needs to be true about T? T(P) … Given T(s i…i+m-1 ) we must be able to efficiently calculate T(s i+1…i+m ) T(BAB) =
95
Calculating the hash function For simplicity, assume Σ = (0, 1, 2, …, 9). (in general we can use a base larger than 10). A string can then be viewed as a decimal number How do we efficiently calculate the numerical representation of a string? T(‘9847261’) = ?
96
Horner’s rule 9847261 9 * 10 = 90 (90 + 8)*10 = 980 (980 + 4)*10 = 9840 (9840 + 7)*10 = 98470 … = 9847621
97
Horner’s rule 9847261 9 * 10 = 90 (90 + 8)*10 = 980 (980 + 4)*10 = 9840 (9840 + 7)*10 = 98470 … = 9847621 Running time? Θ(m)
98
Calculating the hash on the string 963801572348267 m = 4 Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )? T(s i…i+m-1 )
99
Calculating the hash on the string 963801572348267 m = 4 Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )? T(s i…i+m-1 ) subtract highest order digit 801
100
Calculating the hash on the string 963801572348267 m = 4 Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )? T(s i…i+m-1 ) shift digits up 8010
101
Calculating the hash on the string 963801572348267 m = 4 T(s i…i+m-1 ) add in the lowest digit 8015 Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )?
102
Calculating the hash on the string 963801572348267 m = 4 T(s i…i+m-1 ) Running time? - Θ(m) for s 1…m - O(1) for the rest Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )?
103
Algorithm so far… Is it correct? Each string has a unique numerical value and we compare that with each value in the string Running time Preprocessing: Θ(m) Matching Θ(n-m+1) Is there any problem with this analysis?
104
Algorithm so far… Is it correct? Each string has a unique numerical value and we compare that with each value in the string Running time Preprocessing: Θ(m) Matching Θ(n-m+1) How long does the check T(P) = T(s i…i+m-1 ) take?
105
Modular arithmetics The run time assumptions we made were assuming arithmetic operations were constant time, which is not true for large numbers To keep the numbers small, we’ll use modular arithmetics, i.e. all operations are performed mod q a+b = (a+b) mod q a*b = (a*b) mod q …
106
Modular arithmetics If T(A) = T(B), then T(A) mod q = T(B) mod q In general, we can apply mods as many times as we want and we will not effect the result What is the downside to this modular approach? Spurious hits: if T(A) mod q = T(B) mod q that does not necessarily mean that T(A) = T(B) If we find a hit, we must check that the actual string matches the pattern
107
Runtime Preprocessing Θ(m) Running time Best case: Θ(n-m+1) – No matches and no spurious hits Worst case Θ((n-m+1)m)
108
Average case running time Assume v valid matches in the string What is the probability of a spurious hit? As with hashing, assume a uniform mapping onto values of q: What is the probability under this assumption? 1…q Σ*Σ*
109
Average case running time Assume v valid matches in the string What is the probability of a spurious hit? As with hashing, assume a uniform mapping onto values of q: What is the probability under this assumption? 1/q 1…q Σ*Σ*
110
Average case running time How many spurious hits? n/q Average case running time: O(n-m+1) + O(m(v+n/q)) iterate over the positions checking matches and spurious hits
111
Matching running times AlgorithmPreprocessing timeMatching time Naïve0O((n-m+1)m) FSAΘ(m|Σ|)Θ(n) Rabin-KarpΘ(m)O(n)+O(m(v+n/q)) Knuth-Morris-PrattΘ(m)Θ(n)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.