Presentation is loading. Please wait.

Presentation is loading. Please wait.

String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or.

Similar presentations


Presentation on theme: "String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or."— Presentation transcript:

1 String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or find in which position(s) P occurs as a substring in T). The strings P and T are called pattern and target respectively. [Adapted from G.Plaxton]

2 String Searching Algorithms Some applications [Adapted from K.Wayne]

3 String Searching Algorithms Trivial Approach - Algorithm SimpleMatcher(string P, string T) n  length[T] m  length[P] for s  0 to n  m do if P[1...m] = T[s+1... s+m] then print s T(n,m) = (n  m + 1) m  (1) =  (n m)

4 String Searching Algorithms Rabin-Karp Algorithm - Idea Pattern - P[1  m] Target -T[1  n] p = P[m] + 10 P[m–1] + 100 P[m–2] +  + 10 m P[1] for s = 0 to n – m: t s = T[s+m] + 10 T[s+m–1] + 100 T[s+m–2] +  + 10 m T[s+1] P matches T at position i if and only if p = t i

5 String Searching Algorithms Rabin-Karp Algorithm - Idea p = P[m] + 10 P[m–1] + 100 P[m–2] +  + 10 m P[1] p = P[m] + 10 (P[m–1] + 10 (P[m–2] +   + 10 (P[2] + 10 P[1])  ))) t 0 = T[m] + 10 T[m–1] + 100 T[m–2] +  + 10 m T[1] t 0 = T[m] + 10 (T[m–1] + 10 (T[m–2] +   + 10 (T[2] + 10 T[1])  ))) t s+1 = 10 (t s – 10 m–1 T[s+1]) + T[s+m+1]

6 String Searching Algorithms Rabin-Karp Algorithm - Example 1 T = 289462340372392345 P = 234 p = 234 t = 289, 894, 946, 462, 623, 234, 340, 403, 37, 372, 723, 239, 392, 923, 234, 345

7 String Searching Algorithms Rabin-Karp Algorithm - Problem What to do, if p is too large to be stored as integer data type? (The simplest) solution: Use p mod q and t i mod q, instead of p and t i If p mod q  t i mod q no match is possible at position i If p mod q = t i mod q we have to check for match explicitly

8 String Searching Algorithms Rabin-Karp Algorithm - Algorithm RabinKarpMatcher(string P, string T, integer d, integer q) n  length[T]; m  length[P] h  d m–1 mod q p  0; t 0  0 for i  1 to m do p  (d p + P[i]) mod q t 0  (d t 0 + T[i]) mod q for s  0 to n – 1 do if p = t s then if P[1...m] = T[s+1... s+m] then print s if s < n – 1 then t s  (d (t s – T[s+1] h) + T[s+m+1]) mod q

9 String Searching Algorithms Rabin-Karp Algorithm - Example 2 T = 289462340372392345 P = 234 q = 5 p = 234 p mod q = 4 t = 289, 894, 946, 462, 623, 234, 340, 403, 37, 372, 723, 239, 392, 923, 234, 345 t mod q = 4, 4, 1, 2, 3, 4, 0, 3, 2, 2, 3, 4, 2, 3, 4, 0

10 String Searching Algorithms Rabin-Karp Algorithm - generalization Instead of calculating numbers mod q, we can use an arbitrary hash function

11 String Searching Algorithms Rabin-Karp Algorithm - Complexity [Adapted from T.Ralphs]

12 String Searching Algorithms Rabin-Karp Algorithm - Complexity Worst case: T(n,m) = (n  m + 1) m  (1) =  (n m) Average case: number of correct matches - v number of incorrect matches -  n/q T(n,m) =  (n + m) +  (m(v + n/q)) If v is small and m  q, then T(n,m) =  (n + m)

13 String Searching Algorithms Two dimensional pattern matching [Adapted from M.Crochemore,T.Lecroq]

14 String Searching Algorithms Two dimensional pattern matching [Adapted from M.Crochemore,T.Lecroq]

15 String Searching Algorithms Knuth-Morris-Pratt Algorithm - Idea [Adapted from A.Cawsey]

16 String Searching Algorithms Knuth-Morris-Pratt Algorithm - Some history [Adapted from K.Wayne]

17 String Searching Algorithms Knuth-Morris-Pratt Algorithm - Idea T = gadji beri bimba glandridi P = gadjama gadjiberibimbaglandridi gadjama gadjiberibimbaglandridi gadjama

18 String Searching Algorithms Knuth-Morris-Pratt Algorithm - Idea T = gadjama gramma berida P = gaga gagjamagrammaberida gaga gagjamagrammaberida gaga

19 String Searching Algorithms Knuth-Morris-Pratt Algorithm - Idea For each position q = 1, , m in P compute the number of positions by which pattern can be advanced, if a mismatch has been previously detected in q-th position.

20 String Searching Algorithms KMP - Algorithm KnuthMorrisPrattMatcher(string P, string T) n  length[T] m  length[P]   PrefixFunction(P) q  0 for i  1 to n do while q > 0 & P[q+1]  T[i] do q   [q] if P[q+1] = T[i] then q  q + 1 if q = m then print i  m + 1 q   [q]

21 String Searching Algorithms KMP - Prefix Function A << B - A is prefix of B, e.g. ab << abacae A >> B - A is suffix of B, e.g. ae >> abacae P s - initial substring of P of length s s P - terminal substring of P of length s Prefix function:  : {1,2, , m}  {0, 1, 2, , m–1}  [q] = max {k : k > P q }

22 String Searching Algorithms KMP - Prefix Function  [q] = max {k : k > P q }  [q] is the length of the longest prefix of P that is a proper suffix of P q. If a mismatch is detected at position q, then pattern can be advanced by q –  [q] positions.

23 String Searching Algorithms KMP - Prefix Function - Example  [q] = max {k : k > P q }  [q] is the length of the longest prefix of P that is a proper suffix of P q. P = abracadabra  = 0,0,0,1,0,1,0,1,2,3,4

24 String Searching Algorithms KMP - Prefix Function - Algorithm PrefixFunction(string P) m  length[P]  [1]  0 k  0 for q  2 to m do while k > 0 & P[k+1]  P[q] do k   [k] if P[k+1] = P[q] then k  k + 1  [q]  k return 

25 String Searching Algorithms KMP - Complexity KnuthMorrisPrattMatcher(string P, string T) n  length[T] m  length[P]   PrefixFunction(P) q  0 for i  1 to n do while q > 0 & P[q+1]  T[i] do q   [q] if P[q+1] = T[i] then q  q + 1 if q = m then print i  m + 1 q   [q]  (n) times In worst case  (m) times Thus T(n,m) = O(nm)...

26 String Searching Algorithms KMP - Complexity T(n,m) = T P (m) + n T While (m) = O(n m)? q value are increased at most n times always q  0 Thus, q can not be decreased more than n times, i.e. while loop can be executed no more than n times. T(n,m) = T P (m) + n T While (m) = T P (m) +  (n) KnuthMorrisPrattMatcher(string P, string T) n  length[T] m  length[P]   PrefixFunction(P) q  0 for i  1 to n do while q > 0 & P[q+1]  T[i] do q   [q] if P[q+1] = T[i] then q  q + 1 if q = m then print i  m + 1 q   [q]

27 String Searching Algorithms KMP - Prefix Function - Correctness  [q] = max {k : k > P q }  [q] is the length of the longest prefix of P that is a proper suffix of P q. We define :  0 [q] = q,  i+1 [q] =  [  i [q]]  *[q] = {q,  [q],  2 [q], ,  t [q] = 0}

28 String Searching Algorithms KMP - Prefix Function - Correctness  0 [q] = q,  i+1 [q] =  [  i [q]]  *[q] = {q,  [q],  2 [q], ,  t [q] = 0} Lemma Let P be a pattern of length m with prefix function . Then, for q = 1,2, , m we have  *[q] = {k : P k >> P q }

29 String Searching Algorithms KMP - Prefix Function - Correctness Lemma Let P be a pattern of length m with prefix function . For q = 1,2, , m, if  [q] > 0, then  [q] – 1   *[q–1].

30 String Searching Algorithms KMP - Prefix Function - Correctness Corollary Let P be a pattern of length m with prefix function . For q = 2, , m: For q = 2, , m we define E q–1   *[q–1] by E q–1 = {k : k   *[q–1] & P[k+1] = P[q]}  [q] = 0,if E q–1 =  1 + max{k  E q–1 },if E q–1  

31 String Searching Algorithms KMP - Prefix Function - Correctness We consecutively compute  [1],  [2], ,  [m]  [1] = 0 For k > 1: if P[k] = P[  [k–1] + 1], then  [k] =  [k–1] + 1, else, if P[k] = P[  [  [k–1]] + 1], then  [k] =  [  [k–1]] + 1, else, if P[k] = P[  [  [  [k–1]]] + 1], then  [k] =  [  [  [k–1]]] + 1, 

32 If P[k+1] = P[  [k]+1], then we can obtain  [k+1]   [k] + 1. For given value  [k+1] it is easy to see that  [k]   [k+1]  1. Thus:  [k+1] =  [k] + 1 iff P[k+1] = P[  [k]+1].  [k+1] <  [k] + 1 iff P[k+1]  P[  [k]+1].  [  [k]]  [k]  [  [k]]  [k] P[k+1] P[  [k]+1]

33 If P[k+1]  P[  [k]+1] and P[k+1] = P[  [  [k]]+1] then we can obtain  [k+1]   [  [k]] + 1. For given value  [k+1] it is easy to see that  [  [k]]   [k+1]  1. Thus:  [k+1] =  [  [k]] + 1 iff P[k+1]  P[  [k]+1] & P[k+1] = P[  [  [k]]+1]  [k+1] <  [  [k]] + 1 iff P[k+1]  P[  [k]+1] & P[k+1]  P[  [  [k]]+1]..................................................................................  [  [k]]  [k]  [  [k]]  [k] P[k+1] P[  [  [k]]+1]

34 String Searching Algorithms KMP - Prefix Function - Complexity T P (m) = const + m T While (m) = O(m 2 ) k value are increased at most n times always k  0 Thus, k can not be decreased more than n times, i.e. while loop can be executed no more than n times. T P (m) = const + m T While (m) =  (m) PrefixFunction(string P) m  length[P]  [1]  0 k  0 for q  2 to m do while k > 0 & P[k+1]  P[q] do k   [k] if P[k+1] = P[q] then k  k + 1  [q]  k return 

35 String Searching Algorithms KMP - Complexity T(n,m) = T P (m) + n T While (m) = T P (m) +  (n) =  (m) +  (n) =  (m + n) PrefixFunction(string P) m  length[P]  [1]  0 k  0 for q  2 to m do while k > 0 & P[k+1]  P[q] do k   [k] if P[k+1] = P[q] then k  k + 1  [q]  k return  KnuthMorrisPrattMatcher(string P, string T) n  length[T] m  length[P]   PrefixFunction(P) q  0 for i  1 to n do while q > 0 & P[q+1]  T[i] do q   [q] if P[q+1] = T[i] then q  q + 1 if q = m then print i  m + 1 q   [q]

36 String Searching Algorithms Boyer-Moore Algorithm - Idea 1 T = gadji beri bimba glandridi P = lonni gadjiberibimbaglandridi lonni gadjiberibimbaglandridi lonni Bad character heuristic

37 String Searching Algorithms Boyer-Moore Algorithm - Idea 2 T = gadji beri bimba glandridi P = ajiji gadjiberibimbaglandridi ajiji gadjiberibimbaglandridi ajiji Good suffix heuristic

38 String Searching Algorithms Boyer-Moore - Bad Character Function Bad character function: :   {0,1,2, , m} [s] = max {k : P[k] = s} (if such k exists) [s] = 0(otherwise) [Adapted from M.Goodrich, R.Tamassia]

39 String Searching Algorithms Boyer-Moore - Bad Character Function BadCharacterFunction(string P, set  ) m  length[P] for a   do [a]  0 for j  1 to m do [P[j]]  j return T B (m,|  |) =  (m + |  |)

40 String Searching Algorithms Boyer-Moore - Suffix Function Suffix function:  : {1, 2, , m}  {1, 2, , m}  [j] = m – max {k : k > P k  P K >> j P} [Adapted from R.Lee, C.Lu]

41 String Searching Algorithms Boyer-Moore - Suffix Function [Adapted from R.Lee, C.Lu]

42 String Searching Algorithms Boyer-Moore - Suffix Function [Adapted from R.Lee, C.Lu]

43 String Searching Algorithms Boyer-Moore - Suffix Function [Adapted from R.Lee, C.Lu]

44 String Searching Algorithms Boyer-Moore - Suffix Function SuffixFunction(string P) m  length[P]   PrefixFunction(P) P’  Reverse(P);  ’  PrefixFunction(P’) for j  0 to m do  [j]  m –  [m] for l  1 to m do j  m –  ’[l] if  [j] > l –  ’[l] then  [j]  l –  ’[l] return  T S (m) =  (m)

45 String Searching Algorithms Boyer-Moore Algorithm - Algorithm BoyerMooreMatcher(string P, string T, set  ) n  length[T] m  length[P]  LastOccurenceFunction(P,m,  )   GoodSuffixFunction(P,m) s  0 while s  n  m do j  m while j > 0 & P[j] = T[s + j] do j  j  1 if j = 0 then print s s  s +  [0] else s  s + max(  [j], j  [T[s + j]])

46 String Searching Algorithms Boyer-Moore Algorithm - Complexity T B (m,|  |) =  (m + |  |) T S (m) =  (m) T(n,m,|  |) = T B (m,|  |) + T S (m) + n T While (m) = =  (m + |  |) +  (m) + O(n m) = O(|  | + n m)? It can be shown that T(n,m,|  |) =  (|  | + n + m)

47 String Searching Algorithms Boyer-Moore Algorithm - Complexity It can be shown that: T(n,m,|  |) =  (|  | + n m) using only bad character rule T(n,m,|  |) =  (|  | + n + m) using only good suffix rule, if the pattern does not occur in text T(n,m,|  |) =  (|  | + n m) using only good suffix rule, if the pattern does occur in text

48 String Searching Algorithms Boyer-Moore Algorithm - Complexity With Galil's modification: T(n,m,|  |) =  (|  | + n + m) using only good suffix rule There is also a similar Apostolico-Giancarlo algorithm that achieves  (|  | + n + m) time bound (which is much easier to prove) On average the number of character comparisons is n/m (for large |  | )

49 String Searching Algorithms Algorithms - Complexity comparison [Adapted from H.Løvengreen]

50 String Searching Algorithms Algorithms - Efficiency comparison n=5000 [Adapted from I.Spence]

51 String Searching Algorithms Complexity - Lower Bound Theorem (Rivest) Any string searching algorithm has worst-case time complexity T(n,m) =  (m + n)

52 String Searching Algorithms Suffix Trees - The problem Theorem (Rivest) Any string searching algorithm has worst-case time complexity T(n,m) =  (m + n) Despite this, we probably can do better! (Well, for slightly different problem...) [Adapted from P.Kilpeläinen]

53 String Searching Algorithms Suffix Trees [Adapted from P.Kilpeläinen]

54 String Searching Algorithms Suffix Trees [Adapted from P.Kilpeläinen]

55 String Searching Algorithms Suffix Trees - Example [Adapted from P.Kilpeläinen]

56 String Searching Algorithms Suffix Trees - Do they always exist? [Adapted from P.Kilpeläinen]

57 String Searching Algorithms Suffix Trees - Application to string matching [Adapted from P.Kilpeläinen]

58 String Searching Algorithms Suffix Trees - Construction [Adapted from P.Kilpeläinen]

59 String Searching Algorithms Suffix Trees - Construction [Adapted from P.Kilpeläinen]

60 String Searching Algorithms Suffix Trees - Construction - Example [Adapted from P.Kilpeläinen]

61 String Searching Algorithms Suffix Trees - Construction - Example [Adapted from P.Kilpeläinen]

62 String Searching Algorithms Suffix Trees - Construction - Example [Adapted from P.Kilpeläinen]

63 String Searching Algorithms Suffix Trees - Construction - Complexity [Adapted from P.Kilpeläinen]

64 String Searching Algorithms Suffix Trees - Compact representation [Adapted from P.Kilpeläinen]

65 String Searching Algorithms Suffix Trees - Compact representation - Example [Adapted from P.Kilpeläinen]

66 String Searching Algorithms Suffix Trees - Some history [Adapted from P.Kilpeläinen]

67 String Searching Algorithms Suffix Trees - Ukkonen's algorithm [Adapted from P.Kilpeläinen]

68 String Searching Algorithms Suffix Trees - Implicit trees [Adapted from P.Kilpeläinen]

69 String Searching Algorithms Suffix Trees - Implicit trees [Adapted from P.Kilpeläinen]

70 String Searching Algorithms Suffix Trees - Implicit trees [Adapted from P.Kilpeläinen]

71 String Searching Algorithms Suffix Trees - String paths [Adapted from P.Kilpeläinen]

72 String Searching Algorithms Suffix Trees - Ukkonen's algorithm [Adapted from P.Kilpeläinen]

73 String Searching Algorithms Suffix Trees - Extensions [Adapted from P.Kilpeläinen]

74 String Searching Algorithms Suffix Trees - Extensions [Adapted from P.Kilpeläinen]

75 String Searching Algorithms Suffix Trees - Extensions - Example [Adapted from P.Kilpeläinen]

76 String Searching Algorithms Suffix Trees - Ukkonen's algorithm - Complexity [Adapted from P.Kilpeläinen]

77 String Searching Algorithms Suffix Trees - Ukkonen's algorithm - Complexity [Adapted from P.Kilpeläinen]

78 String Searching Algorithms Suffix Trees - Ukkonen's algorithm - Complexity [Adapted from P.Kilpeläinen]

79 String Searching Algorithms Suffix Trees - Suffix links [Adapted from P.Kilpeläinen]

80 String Searching Algorithms Suffix Trees - Suffix links [Adapted from P.Kilpeläinen]

81 String Searching Algorithms Suffix Trees - Suffix links [Adapted from P.Kilpeläinen]

82 String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]

83 String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]

84 String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]

85 String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]

86 String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]

87 String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]

88 String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]

89 String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]

90 String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]

91 String Searching Algorithms Suffix Trees - Eliminating extensions [Adapted from P.Kilpeläinen]

92 String Searching Algorithms Suffix Trees - Single phase algorithm [Adapted from P.Kilpeläinen]

93 String Searching Algorithms Suffix Trees - Ukkonen's algorithm - Complexity [Adapted from P.Kilpeläinen]

94 String Searching Algorithms Suffix Trees - Ukkonen's algorithm - Complexity [Adapted from P.Kilpeläinen]

95 String Searching Algorithms Suffix Trees - Ukkonen's algorithm - Complexity [Adapted from P.Kilpeläinen]


Download ppt "String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or."

Similar presentations


Ads by Google