String Processing
Basic String Techniques Storing strings Reading text input by line Concatenating strings Checking for matching string at beginning Finding a substring within a larger string Counting occurrances in a string (e.g. how many vowels) Tokenizing: splitting a string into substrings by delimiters Sorting an array of strings
String Matching Find occurrences of T (length m) inside S (length n) Basic matching can use library functions Requires reasonably small strings Longer matching: naïve approach Loop over S (1 to n) Check whether T occurs starting at that point (1 to m) So, O(nm) total Better: Knuth-Morris-Pratt (KMP) Algorithm
Knuth-Morris-Pratt (KMP) Algorithm Idea: preprocess T (the one to find) – use matches there to know where to start the next match Preprocess: For character i in T, If the string matched to character i, but not to character i+1, Then, how many characters of the string preceding character i+1 match the beginning of the string to match This tells you where to start matching again Match like naïve. But, when you stop getting a match: Go back the given number of spaces (based on preprocess) Start match there
KMP Algorithm - running Example: T is abracadabra This says that if there is a match up until that character, but NOT that character, how many characters match in the beginning of the string. e.g. if the “c” is the first not to match, it means that the string had “abra” at the beginning. That means that we can restart here, assuming 1 character (the “a”, which comes right before the “c” matches. Could represent differently – where the # stored is the number matching the prefix, but then need to offset everything else by 1 Example we will use: S is abrabracabracadabracadabra a b r c d 1 2 3
a b r c d 1 2 3 i: 01234 S: abrabracabracadabracadabra T: abracadabra j: 01234 Mismatch at slot 4 (i=4, j=4). Back table has value 1 there. So, next we’ll continue with i=4, but j will go back to 1.
a b r c d 1 2 3 i: 0123456789 S: abrabracabracadabracadabra T: abracadabra j: 0123456 Mismatch at slot j=6 (and i=9). Back table has value 1 there. So, next we’ll continue with i=9, but j will go back to slot 1:
a b r c d 1 2 3 i: 0123456789012345678 S: abrabracabracadabracadabra T: abracadabra j: 01234567890 Full match here. Mark as found (at slot 8). Next one starts 4 back (i=18, j=4).
a b r c d 1 2 3 i: 01234567890123456789012345 S: abrabracabracadabracadabra T: abracadabra j: 01234567890 Full match here again. Mark as found (at slot 15). Next one would start 4 back (i=21).
Dynamic Programming on Strings Edit Distance: Given two strings, how many edits (insert space, delete digit, or have mismatch) are needed between them? Use DP: String A[1..n], B[1..m]: For A[1..i], B[1..j], we have V(i,j) = edit distance for substrings. We want V(n,m) V(0,0) = 0 V(i,0) = penalty to delete all i elements from A V(0,j) = penalty to delete all j elements from B V(i,j) = max: V(i-1,j-1) + score(A[i],B[i]) V(i-1,j)+score(A[i],-) V(i,j-1)+score(-,B[j]) Where score(A[i],B[j]) = 2 if matching, -1 if nonmatching, and score(x,-)=score(-,x) = -1 (penalty to delete = penalty to add a space)
More DP on Strings For Longest Common Subsequence Same as String Alignment Penalty for mismatch = infinity Penalty for add/delete = 0 Points for match = 1