Presentation is loading. Please wait.

Presentation is loading. Please wait.

String Algorithms David Kauchak cs302 Spring 2012.

Similar presentations


Presentation on theme: "String Algorithms David Kauchak cs302 Spring 2012."— Presentation transcript:

1 String Algorithms David Kauchak cs302 Spring 2012

2 Where did “dynamic programming” come from? Richard Bellman On the Birth of Dynamic Programming Stuart Dreyfus http://www.eng.tau.ac.il/~ami/cd/o r50/1526-5463-2002-50-01- 0048.pdf

3 Strings Let Σ be an alphabet, e.g. Σ = (, a, b, c, …, z) A string is any member of Σ*, i.e. any sequence of 0 or more members of Σ ‘this is a string’  Σ* ‘this is also a string’  Σ* ‘1234’  Σ*

4 String operations Given strings s 1 of length n and s 2 of length m Equality: is s 1 = s 2 ? (case sensitive or insensitive) Running time O(n) where n is length of shortest string ‘this is a string’ = ‘this is a string’ ‘this is a string’ ≠ ‘this is another string’ ‘this is a string’ =? ‘THIS IS A STRING’

5 String operations Concatenate (append): create string s 1 s 2 Running time (assuming we generate a new string) Θ(n+m) ‘this is a’. ‘ string’ → ‘this is a string’

6 String operations Substitute: Exchange all occurrences of a particular character with another character Running time Θ(n) Substitute(‘this is a string’, ‘i’, ‘x’) → ‘thxs xs a strxng’ Substitute(‘banana’, ‘a’, ‘o’) → ‘bonono’

7 String operations Length: return the number of characters/symbols in the string Running time O(1) or Θ(n) depending on implementation Length(‘this is a string’) → 16 Length(‘this is another string’) → 24

8 String operations Prefix: Get the first j characters in the string Running time Θ(j) Suffix: Get the last j characters in the string Running time Θ(j) Prefix(‘this is a string’, 4) → ‘this’ Suffix(‘this is a string’, 6) → ‘string’

9 String operations Substring – Get the characters between i and j inclusive Running time Θ(j – i + 1) Prefix: Prefix(S, i) = Substring(S, 1, i) Suffix: Suffix(S, i) = Substring(S, i+1, length(n)) Substring(‘this is a string’, 4, 8) → ‘s is ’

10 Edit distance (aka Levenshtein distance) Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2 Insertion: ABACEDABACCEDDABACCED Insert ‘C’Insert ‘D’

11 Edit distance (aka Levenshtein distance) Deletion: ABACED Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2

12 Edit distance (aka Levenshtein distance) Deletion: ABACEDBACED Delete ‘A’ Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2

13 Edit distance (aka Levenshtein distance) Deletion: ABACEDBACEDBACE Delete ‘A’Delete ‘D’ Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2

14 Edit distance (aka Levenshtein distance) Substitution: ABACEDABADEDABADES Sub ‘D’ for ‘C’Sub ‘S’ for ‘D’ Edit distance between two strings is the minimum number of insertions, deletions and substitutions required to transform string s 1 into string s 2

15 Edit distance examples Edit(Kitten, Mitten) =1 Operations: Sub ‘M’ for ‘K’Mitten

16 Edit distance examples Edit(Happy, Hilly) =3 Operations: Sub ‘a’ for ‘i’Hippy Sub ‘l’ for ‘p’Hilpy Sub ‘l’ for ‘p’Hilly

17 Edit distance examples Edit(Banana, Car) =5 Operations: Delete ‘B’anana Delete ‘a’nana Delete ‘n’naa Sub ‘C’ for ‘n’Caa Sub ‘a’ for ‘r’Car

18 Edit distance examples Edit(Simple, Apple) =3 Operations: Delete ‘S’imple Sub ‘A’ for ‘i’Ample Sub ‘m’ for ‘p’Apple

19 Edit distance Why might this be useful?

20 Is edit distance symmetric? that is, is Edit(s 1, s 2 ) = Edit(s 2, s 1 )? Why? sub ‘i’ for ‘j’ → sub ‘j’ for ‘i’ delete ‘i’ → insert ‘i’ insert ‘i’ → delete ‘i’ Edit(Simple, Apple) =? Edit(Apple, Simple)

21 Calculating edit distance X = A B C B D A B Y = B D C A B A Ideas?

22 Calculating edit distance X = A B C B D A ? Y = B D C A B ? After all of the operations, X needs to equal Y

23 Calculating edit distance X = A B C B D A ? Y = B D C A B ? Operations:Insert Delete Substitute

24 Insert X = A B C B D A ? Y = B D C A B ?

25 Insert X = A B C B D A ? Y = B D C A B ? Edit

26 Delete X = A B C B D A ? Y = B D C A B ?

27 Delete X = A B C B D A ? Y = B D C A B ? Edit

28 Substition X = A B C B D A ? Y = B D C A B ?

29 Substition X = A B C B D A ? Y = B D C A B ? Edit

30 Anything else? X = A B C B D A ? Y = B D C A B ?

31 Equal X = A B C B D A ? Y = B D C A B ?

32 Equal X = A B C B D A ? Y = B D C A B ? Edit

33 Combining results Insert: Delete: Substitute: Equal:

34 Combining results

35 Running time Θ(nm)

36 Variants Only include insertions and deletions What does this do to substitutions? Include swaps, i.e. swapping two adjacent characters counts as one edit Weight insertion, deletion and substitution differently Weight specific character insertion, deletion and substitutions differently Length normalize the edit distance

37 String matching Given a pattern string P of length m and a string S of length n, find all locations where P occurs in S P = ABA S = DCABABBABABA

38 String matching P = ABA S = DCABABBABABA Given a pattern string P of length m and a string S of length n, find all locations where P occurs in S

39 Uses grep/egrep search find java.lang.String.contains()

40 Naive implementation

41 Is it correct?

42 Running time? What is the cost of the equality check? Best case: O(1) Worst case: O(m)

43 Running time? Best case Θ(n) – when the first character of the pattern does not occur in the string Worst case O((n-m+1)m)

44 Worst case P = AAAA S = AAAAAAAAAAAAA

45 Worst case P = AAAA S = AAAAAAAAAAAAA

46 Worst case P = AAAA S = AAAAAAAAAAAAA

47 Worst case P = AAAA S = AAAAAAAAAAAAA repeated work!

48 Worst case P = AAAA S = AAAAAAAAAAAAA Ideally, after the first match, we’d know to just check the next character to see if it is an ‘A’

49 Patterns Which of these patterns will have that problem? P = ABAB P = ABDC P = BAA P = ABBCDDCAABB

50 Patterns Which of these patterns will have that problem? P = ABAB P = ABDC P = BAA P = ABBCDDCAABB If the pattern has a suffix that is also a prefix then we will have this problem

51 Finite State Automata (FSA) An FSA is defined by 5 components Q is the set of states q0q0 q1q1 q2q2 qnqn …

52 Finite State Automata (FSA) An FSA is defined by 5 components Q is the set of states q 0 is the start state A  Q, is the set of accepting states where |A| > 0 Σ is the alphabet (e.g. {A, B}  is the transition function from Q x Σ to Q q0q0 q1q1 q2q2 qnqn … Q Σ Q q 0 A q 1 q 0 B q 2 q 1 A q 1 … q0q0 q1q1 q2q2 A B A … q7q7

53 FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A An FSA starts at state q 0 and reads the characters of the input string one at a time. If the automaton is in state q and reads character a, then it transitions to state  (q,a). If the FSA reaches an accepting state (q  A), then the FSA has found a match.

54 FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A What pattern does this represent? P = ABA

55 FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

56 FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

57 FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

58 FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

59 FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

60 FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

61 FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

62 FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

63 FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

64 FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

65 FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

66 FSA operation q0q0 q3q3 q1q1 AB q2q2 A B A B B A P = ABA S = BABABBABABA

67 Suffix function The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y σ(abcdab, ababcd) = ?

68 Suffix function The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y σ(abcdab, ababcd) = 2

69 Suffix function The suffix function σ(x,y) is the index of the longest suffix of x that is a prefix of y σ(daabac, abacac) = ?

70 Suffix function σ(daabac, abacac) = 4 The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y

71 Suffix function σ(dabb, abacd) = ? The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y

72 Suffix function σ(dabb, abacd) = 0 The suffix function σ(x,y) is the length of the longest suffix of x that is a prefix of y

73 Building a string matching automata Given a pattern P = p 1, p 2, …, p m, we’d like to build an FSA that recognizes P in strings P = ababaca Ideas?

74 Building a string matching automata Q = q 1, q 2, …, q m corresponding to each symbol, plus a q 0 starting state the set of accepting states, A = {q m } vocab Σ all symbols in P, plus one more representing all symbols not in P The transition function for q  Q and a  Σ is defined as:  (q, a) = σ(p 1…q a, P) P = ababaca

75 Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 ?a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 σ(a, ababaca)

76 Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 1?a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 σ(b, ababaca)

77 Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 10?a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 σ(b, ababaca)

78 Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 σ(b, ababaca)

79 Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 b q2q2 a q3q3 b q4q4 a q5q5 c q6q6 a q7q7 q0q0 q1q1 A B,C

80 Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 ?b q4q4 a q5q5 c q6q6 a q7q7 We’ve seen ‘aba’ so far σ(abaa, ababaca)

81 Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 1b q4q4 a q5q5 c q6q6 a q7q7 We’ve seen ‘aba’ so far σ(abaa, ababaca)

82 Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 140b q4q4 500a q5q5 1?c q6q6 a q7q7 We’ve seen ‘ababa’ so far

83 Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 140b q4q4 500a q5q5 1?c q6q6 a q7q7 We’ve seen ‘ababa’ so far σ(ababab, ababaca)

84 Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 140b q4q4 500a q5q5 14c q6q6 a q7q7 We’ve seen ‘ababa’ so far σ(ababab, ababaca)

85 Transition function  (q, a) = σ(p 1…q a, P) P = ababaca stateabcP q0q0 100a q1q1 120b q2q2 300a q3q3 140b q4q4 500a q5q5 146c q6q6 700a q7q7 120

86 Matching runtime Once we’ve built the FSA, what is the runtime? Θ(n) - Each symbol causes a state transition and we only visit each character once What is the cost to build the FSA? How many entries in the table? Ω(m|Σ|) How long does it take to calculate the suffix function at each entry? Naïve: O(m) Overall naïve: O(m 2 |Σ|) Overall fast implementation O(m|Σ|)

87 Rabin-Karp algorithm P = ABA S = BABABBABABA - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare

88 P = ABA S = BABABBABABA Hash P T(P) Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare

89 P = ABA S = BABABBABABA Hash m symbol sequences and compare T(P) Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare T(BAB) =

90 P = ABA S = BABABBABABA Hash m symbol sequences and compare T(P) match Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare T(ABA) =

91 P = ABA S = BABABBABABA Hash m symbol sequences and compare T(P) Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare T(BAB) =

92 P = ABA S = BABABBABABA Hash m symbol sequences and compare T(P) … Rabin-Karp algorithm - Use a function T to that computes a numerical representation of P - Calculate T for all m symbol sequences of S and compare T(BAB) =

93 Rabin-Karp algorithm P = ABA S = BABABBABABA For this to be useful/efficient, what needs to be true about T? T(P) … T(BAB) =

94 Rabin-Karp algorithm P = ABA S = BABABBABABA For this to be useful/efficient, what needs to be true about T? T(P) … Given T(s i…i+m-1 ) we must be able to efficiently calculate T(s i+1…i+m ) T(BAB) =

95 Calculating the hash function For simplicity, assume Σ = (0, 1, 2, …, 9). (in general we can use a base larger than 10). A string can then be viewed as a decimal number How do we efficiently calculate the numerical representation of a string? T(‘9847261’) = ?

96 Horner’s rule 9847261 9 * 10 = 90 (90 + 8)*10 = 980 (980 + 4)*10 = 9840 (9840 + 7)*10 = 98470 … = 9847621

97 Horner’s rule 9847261 9 * 10 = 90 (90 + 8)*10 = 980 (980 + 4)*10 = 9840 (9840 + 7)*10 = 98470 … = 9847621 Running time? Θ(m)

98 Calculating the hash on the string 963801572348267 m = 4 Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )? T(s i…i+m-1 )

99 Calculating the hash on the string 963801572348267 m = 4 Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )? T(s i…i+m-1 ) subtract highest order digit 801

100 Calculating the hash on the string 963801572348267 m = 4 Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )? T(s i…i+m-1 ) shift digits up 8010

101 Calculating the hash on the string 963801572348267 m = 4 T(s i…i+m-1 ) add in the lowest digit 8015 Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )?

102 Calculating the hash on the string 963801572348267 m = 4 T(s i…i+m-1 ) Running time? - Θ(m) for s 1…m - O(1) for the rest Given T(s i…i+m-1 ) how can we efficiently calculate T(s i+1…i+m )?

103 Algorithm so far… Is it correct? Each string has a unique numerical value and we compare that with each value in the string Running time Preprocessing: Θ(m) Matching Θ(n-m+1) Is there any problem with this analysis?

104 Algorithm so far… Is it correct? Each string has a unique numerical value and we compare that with each value in the string Running time Preprocessing: Θ(m) Matching Θ(n-m+1) How long does the check T(P) = T(s i…i+m-1 ) take?

105 Modular arithmetics The run time assumptions we made were assuming arithmetic operations were constant time, which is not true for large numbers To keep the numbers small, we’ll use modular arithmetics, i.e. all operations are performed mod q a+b = (a+b) mod q a*b = (a*b) mod q …

106 Modular arithmetics If T(A) = T(B), then T(A) mod q = T(B) mod q In general, we can apply mods as many times as we want and we will not effect the result What is the downside to this modular approach? Spurious hits: if T(A) mod q = T(B) mod q that does not necessarily mean that T(A) = T(B) If we find a hit, we must check that the actual string matches the pattern

107 Runtime Preprocessing Θ(m) Running time Best case: Θ(n-m+1) – No matches and no spurious hits Worst case Θ((n-m+1)m)

108 Average case running time Assume v valid matches in the string What is the probability of a spurious hit? As with hashing, assume a uniform mapping onto values of q: What is the probability under this assumption? 1…q Σ*Σ*

109 Average case running time Assume v valid matches in the string What is the probability of a spurious hit? As with hashing, assume a uniform mapping onto values of q: What is the probability under this assumption? 1/q 1…q Σ*Σ*

110 Average case running time How many spurious hits? n/q Average case running time: O(n-m+1) + O(m(v+n/q)) iterate over the positions checking matches and spurious hits

111 Matching running times AlgorithmPreprocessing timeMatching time Naïve0O((n-m+1)m) FSAΘ(m|Σ|)Θ(n) Rabin-KarpΘ(m)O(n)+O(m(v+n/q)) Knuth-Morris-PrattΘ(m)Θ(n)


Download ppt "String Algorithms David Kauchak cs302 Spring 2012."

Similar presentations


Ads by Google