Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.

Similar presentations


Presentation on theme: "Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation."— Presentation transcript:

1 Algorithms and Data Structures

2 /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation (e.g. KMP alg.) Tree implementation (e.g. suffix tree)

3 /course/eleg67701-f/Topic-1b3 Algorithm in action: data structure transformation Intermediate data structure Algorithm Input data structure Output data structure

4 /course/eleg67701-f/Topic-1b4 Basic Data Structures  Scalar or “Atomic” data structures Building blocks for other data structures Cannot be divided into sub-elements Integer, floating-point, character, access (pointer) types  Composite data structures arrays, records  Data Abstraction Abstract Data Types: A collection of data values together with a set of well-specified operations on that data, e.g. list, stack, queue, trees etc.

5 /course/eleg67701-f/Topic-1b5 Scalar Data Structure Conceptual View 0238 0239 0240 0241 0242 0243 0244 0245 Physical Layout in the Computer Memory Memory address value Variable name var1 Assignment operation: var1  value; var2  var1; var1  var3;

6 /course/eleg67701-f/Topic-1b6 Composite Data Structure: Array Conceptual View v1v1 Variable name Array A[1..5] A v2v2 v3v3 v4v4 v5v5 Accessing array elements: A[0]  5 k  1 A[k]  11 A[k+1]  A[k] + 3 0 1 2 3 4 0238 0239 0240 0241 0242 0243 0244 0245 Physical Layout in the Computer Memory Memory address v2v2 v1v1 v3v3 v4v4 v5v5 nil

7 /course/eleg67701-f/Topic-1b7 Data Abstraction: Tree Conceptual View v1v1 v2v2 v3v3 v4v4 T ___ __ _ ___ __ _ ___ __ _ ___ __ _ ___ __ _ Accessing the elements: T.value  12 T.left  new(T) T.right  new(T) 0238 0239 0240 0241 0242 0243 0244 0245 Physical Layout in the Computer Memory Memory address 0241 0244 v1v1 T  0238 nil v2v2 v3v3 0247......

8 /course/eleg67701-f/Topic-1b8 Space Analysis  Storage space, like time, is another limited resource that is important to programmers  Space requirements are also expressed as a function of the input size  Space functions are classified in the same manner as running times

9 /course/eleg67701-f/Topic-1b9 Complexity Analysis: Sorting AlgorithmTime-Complexity Insertionsort O(n 2 ) Quicksort O(n.log n) Space-Complexity O(n)

10 /course/eleg67701-f/Topic-1b10 Space-Time Tradeoff  Reductions in running time are often possible if we increase storage requirements  Decreasing the amount of storage used by an algorithm usually results in longer running times Using an array to lookup previously computed values can drastically increase the speed of a function

11 /course/eleg67701-f/Topic-1b11 Case Study: Searching for Patterns Problem: find the first occurrence of pattern P of length m inside the text S of length n.  String matching problem

12 /course/eleg67701-f/Topic-1b12 String Matching - Applications  Text editing  Term rewriting  Lexical analysis  Information retrieval  And, bioinformatics

13 /course/eleg67701-f/Topic-1b13 Model for Pattern-Matching Problem Pattern Matcher generator Pattern Matcher Pattern P Input string S Yes No

14 /course/eleg67701-f/Topic-1b14 Array Implementation Text S represented as an array of characters: S [1..n] Pattern P represented as an array of characters: P [1..m] agcagaagagta S Time complexity = O(m.n) Space complexity = O(m + n) P gaggagagag P gaggagagag P gaggagagag P gaggagagag P gaggagagag P gaggagagag P gaggagagag

15 /course/eleg67701-f/Topic-1b15 Can we be more clever ?  When a mismatch is detected, say at position k in the pattern string, we have already successfully matched k-1 characters.  We try to take advantage of this to decide where to restart matching agcagaagagta S P gaggagagag P gaggagagag P gaggagagag P gaggagagag P gaggagagag

16 /course/eleg67701-f/Topic-1b16 Problem of Matching Keyword PROBLEM. Given a pattern p consisting of a single keyword and an input string s, answer “yes” if p occurs as a substring of s, that is, if s=xpy, for some x and y; “no” otherwise. For convenience, we will assume p=p 1 p 2 …p m and s=s 1 s 2 …s n where p i represents the ith character of the pattern and s j the jth character of the input string.

17 /course/eleg67701-f/Topic-1b17 The Knuth-Morris-Pratt Algorithm Observation: when a mismatch occurs, we may not need to restart the comparison all way back (from the next input position). What to do: Constructing a table h, called the next function, that determines how many characters to slide the pattern to the right in case of a mismatch during the pattern-matching process. Knuth, D. E., Morris, J.H. and Pratt, V. R., Fast Pattern Matching Algorithm for Strings, SIAM J. Comput Sci., 43, 1977, 323-350

18 /course/eleg67701-f/Topic-1b18 The key idea is that if we have successfully matched the prefix p=p 1 p 2 …p i-1 of the keyword with the substring s j-i+1 s j-i+2 … s j-1 of the input string and p i = s j, then we do not need to reprocess any of the suffix s j-i+1 s j-i+2 … s j-1 since we know this portion of the text string is the prefix of the keyword that we have just matched.

19 /course/eleg67701-f/Topic-1b19 Note that the inner while loop will iterate as long as p_i and s_j do not match each other. Once they match, the inner while loop terminate, both i and j will shift by one, and inner loop repeats...

20 /course/eleg67701-f/Topic-1b20 An Important Property of the Next Function in KMP Algorithm The largest k less than i such that p 1 p 2 …p k-1 is a suffix of p 1 p 2 …p i-1 (i.e., p 1 …p k-1 = p i-k+1 …p i-1 ) and p i = p k. if there is no such i, then h i =0

21 /course/eleg67701-f/Topic-1b21 Backtrack or Not Backtrack ? Assume for some i and j, what should we do? KMP algorithm chose not to backtrack on the text S (e.g. j) for a good reason The choice is how to shift the pattern P (e.g. i) – i.e. by how much If for each j, the shift of P is a small constant, then the total time complexity is clearly linear in n P(i) = S(j)

22 /course/eleg67701-f/Topic-1b22 An Example Given: Input string: Scenario 1: i = 12 j = 12 Scenario 2: i j h 12 = 7, i = 7 Next function: 0 1 0 2 1 0 4 0 2 1 0 7 1 What is h i = h 12 = ? h i = 7

23 /course/eleg67701-f/Topic-1b23 An Example Scenario 3: i j h 7 = 4, i = 4 Subsequently i = 2, 1, 0 Finally, a match is found: i j (Contn’d)

24 /course/eleg67701-f/Topic-1b24 Question: when P(i) = S(j), how much should we shift? Observations: We should shift P to the right But – by how much? One answer is: do not backtrack S(j) P S i=1 j=1 i PiPi j SjSj Pattern Input

25 /course/eleg67701-f/Topic-1b25 Observation: Never backtrack on the input string S.

26 /course/eleg67701-f/Topic-1b26 How to Compute the Next Function? h i := h j h i := j j:= h j

27 /course/eleg67701-f/Topic-1b27 How to Compute the Next Function? h i := h j h i := j j:= h j Note: once p_i does not match p_j -- we know that j should be the index to be found where a prefix before i matches a suffix ends at j

28 /course/eleg67701-f/Topic-1b28 Interpretation of the Next Function  Interpretation  Question: how to compute the next function? aababaaba aababaaba 987654321 Note: P 2 = P 5 P 4 = P 9 010210402

29 /course/eleg67701-f/Topic-1b29 Interpretation of the Next Function  Interpretation  Question: how to compute the next function? 123456789 abaababaa abaababaa Note: P 1 = P 5 P 4 = P 9 010210402

30 /course/eleg67701-f/Topic-1b30 Interpretation of the Next Function  Interpretation  Question: how to compute the next function? 123456789 abaababaa abaababaa 010210402 Note: P 1 = P 5 P 4 = P 9

31 /course/eleg67701-f/Topic-1b31 KMP - Analysis  The KMP algorithm never needs to backtrack on the text string. Time complexity = O(m + n) Space complexity = O(m + n) preprocessing searching

32 /course/eleg67701-f/Topic-1b32 KMP Algorithm Complexity Analysis Hints  What is the cost in the building of the next function? ( hint: in the code for the next function, the operation j=h_j in the inner loop is never executed more often than the statement i := i+1 in the outer loop )  What is the cost of the matching itself? ( hint: similar to the above )

33 /course/eleg67701-f/Topic-1b33 Other String Matching Algorithms  The Boyer-Moore Algorithm [Boyer, R. S. and Moore, J. E., A Fast String Searching Algorithm, CACM, 20(10), 1977, 62-72]  The Karp-Rabin Algorithm [Karp, R. M. and Rpbin, M. O., Efficient Randomized Pattern-Matching Algorithm, IBM J. of Res. And Develop., 32(2), 1987, 249- 260].

34 /course/eleg67701-f/Topic-1b34 Matching of A Set of Key Words ?  Given a pattern of a set of keywords and an input string S, answer “yes” if some keywords occur as a substring of S, and “no” otherwise.  How to solve this ?

35 /course/eleg67701-f/Topic-1b35 What time complexity KMP algorithm will have when do a matching of k patterns? - Preprocessing each of the k patterns: assume each pattern has 0(m) in length, this will take 0(km) time - Searching each pattern will take o (n) time per pattern so, total time = k o(m+n) How about repeatedly apply KMP ?

36 /course/eleg67701-f/Topic-1b36 Question: Can we improve the time complexity when k is large? Answer: Yes, preprocessing the input string – tree implementation.

37 /course/eleg67701-f/Topic-1b37 Model for Pattern-Matching Problem Pattern Matcher generator Pattern Matcher Pattern P Input string S Yes No Pre Pro- cessing

38 /course/eleg67701-f/Topic-1b38 Tree Implementation -- suffix tree  Instead of preprocessing the pattern (P), preprocess the text T !  Use a tree structure where all suffixes of the text are represented;  Search for the pattern by looking for substrings of the text;  You can easily test whether P is a substring of T because any substring of T is the prefix of some suffix.

39 /course/eleg67701-f/Topic-1b39 Suffix Tree agagta$ agaaagta $ 3 c a x b a b x a c 6 2 x a b x a c 4 c w c c u Suffix tree for string xabxac. The node labels u and w on the two interior nodes will be used. Con’d A suffix tree T for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edge-labels beginning with the same character. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. That is, it spells out S[i…m].

40 /course/eleg67701-f/Topic-1b40 Note on Suffix Tree  Not all strings guaranteed to have corresponding suffix trees  For example: consider xabxa: it does not have a suffix tree: because here xa is both a prefix and suffix (I.e. xa does not necessarily ends at a leaf)  How to fix the problem: add $ - a special “termination” character to the alphabet.

41 /course/eleg67701-f/Topic-1b41 Algorithm for Constructing a Suffix Tree  A subtree can be constructed in linear time [Weiner73, McCreight76, Ukkonen95]

42 /course/eleg67701-f/Topic-1b42 Suffix Tree Time complexity = O(n + m) Space complexity = O(m + n) preprocessing searching

43 /course/eleg67701-f/Topic-1b43 Question  How to use suffix tree to help solving the string matching problem ?

44 /course/eleg67701-f/Topic-1b44 Other Tree based Methods  Suffix tree is not the only one..


Download ppt "Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation."

Similar presentations


Ads by Google