String matching
Exact String Matching Input: Two strings T[1…n] and P[1…m], containing symbols from alphabet . Example: = {A,C,G,T} T[1…12] = “CAGTACATCGAT” P[1..3] = “AGT” Goal: find all “shifts” 0≤s ≤n-m such that T[s+1…s+m] = P
Simple Algorithm for s ← 0 to n-m Match ← 1 for j ← 1 to m if T[s+j]≠P[j] then Match ← 0 exit loop if Match=1 then output s
Analysis Running time of the simple algorithm: Worst-case: O(nm) Average-case (random text): O(n) (expectation) Ts = time spend on checking shift s (the number of comparisons until 1st mismatch) E[Ts] < 2 (why) E[SsTs] = SsE[Ts] = O(n)
Approximate String Matching Input: Two strings T[1…n] and P[1…m], containing symbols from alphabet . Goal: find all “shifts” 0≤s ≤n-m such that T[s+1…s+m] is “highly similar” to P
Two common metrics for comparing strings Given two strings T[1…n] and P[1…m]: Hamming distance: the number of substitutions between the two strings. n=m Edit distance: the number of edit operations (including substitutions, insertions, and deletions) to transform one string to the other string.
Simple Algorithm for Hamming Distance for s ← 0 to n-m Mismatch ← 0 for j ← 1 to m if T[s+j]≠P[j] then Mismatch ← Mismatch+1 If Mismatch > threshold exit loop if Mismatch<=threshold then output s