Download presentation
Presentation is loading. Please wait.
1
Algorithms and Data Structures
2
/course/eleg67701-f/Topic-1b2 Outline Data Structures Space Complexity Case Study: string matching Array implementation (e.g. KMP alg.) Tree implementation (e.g. suffix tree)
3
/course/eleg67701-f/Topic-1b3 Algorithm in action: data structure transformation Intermediate data structure Algorithm Input data structure Output data structure
4
/course/eleg67701-f/Topic-1b4 Basic Data Structures Scalar or “Atomic” data structures Building blocks for other data structures Cannot be divided into sub-elements Integer, floating-point, character, access (pointer) types Composite data structures arrays, records Data Abstraction Abstract Data Types: A collection of data values together with a set of well-specified operations on that data, e.g. list, stack, queue, trees etc.
5
/course/eleg67701-f/Topic-1b5 Scalar Data Structure Conceptual View 0238 0239 0240 0241 0242 0243 0244 0245 Physical Layout in the Computer Memory Memory address value Variable name var1 Assignment operation: var1 value; var2 var1; var1 var3;
6
/course/eleg67701-f/Topic-1b6 Composite Data Structure: Array Conceptual View v1v1 Variable name Array A[1..5] A v2v2 v3v3 v4v4 v5v5 Accessing array elements: A[0] 5 k 1 A[k] 11 A[k+1] A[k] + 3 0 1 2 3 4 0238 0239 0240 0241 0242 0243 0244 0245 Physical Layout in the Computer Memory Memory address v2v2 v1v1 v3v3 v4v4 v5v5 nil
7
/course/eleg67701-f/Topic-1b7 Data Abstraction: Tree Conceptual View v1v1 v2v2 v3v3 v4v4 T ___ __ _ ___ __ _ ___ __ _ ___ __ _ ___ __ _ Accessing the elements: T.value 12 T.left new(T) T.right new(T) 0238 0239 0240 0241 0242 0243 0244 0245 Physical Layout in the Computer Memory Memory address 0241 0244 v1v1 T 0238 nil v2v2 v3v3 0247......
8
/course/eleg67701-f/Topic-1b8 Space Analysis Storage space, like time, is another limited resource that is important to programmers Space requirements are also expressed as a function of the input size Space functions are classified in the same manner as running times
9
/course/eleg67701-f/Topic-1b9 Complexity Analysis: Sorting AlgorithmTime-Complexity Insertionsort O(n 2 ) Quicksort O(n.log n) Space-Complexity O(n)
10
/course/eleg67701-f/Topic-1b10 Space-Time Tradeoff Reductions in running time are often possible if we increase storage requirements Decreasing the amount of storage used by an algorithm usually results in longer running times Using an array to lookup previously computed values can drastically increase the speed of a function
11
/course/eleg67701-f/Topic-1b11 Case Study: Searching for Patterns Problem: find the first occurrence of pattern P of length m inside the text S of length n. String matching problem
12
/course/eleg67701-f/Topic-1b12 String Matching - Applications Text editing Term rewriting Lexical analysis Information retrieval And, bioinformatics
13
/course/eleg67701-f/Topic-1b13 Model for Pattern-Matching Problem Pattern Matcher generator Pattern Matcher Pattern P Input string S Yes No
14
/course/eleg67701-f/Topic-1b14 Array Implementation Text S represented as an array of characters: S [1..n] Pattern P represented as an array of characters: P [1..m] agcagaagagta S Time complexity = O(m.n) Space complexity = O(m + n) P gaggagagag P gaggagagag P gaggagagag P gaggagagag P gaggagagag P gaggagagag P gaggagagag
15
/course/eleg67701-f/Topic-1b15 Can we be more clever ? When a mismatch is detected, say at position k in the pattern string, we have already successfully matched k-1 characters. We try to take advantage of this to decide where to restart matching agcagaagagta S P gaggagagag P gaggagagag P gaggagagag P gaggagagag P gaggagagag
16
/course/eleg67701-f/Topic-1b16 Problem of Matching Keyword PROBLEM. Given a pattern p consisting of a single keyword and an input string s, answer “yes” if p occurs as a substring of s, that is, if s=xpy, for some x and y; “no” otherwise. For convenience, we will assume p=p 1 p 2 …p m and s=s 1 s 2 …s n where p i represents the ith character of the pattern and s j the jth character of the input string.
17
/course/eleg67701-f/Topic-1b17 The Knuth-Morris-Pratt Algorithm Observation: when a mismatch occurs, we may not need to restart the comparison all way back (from the next input position). What to do: Constructing a table h, called the next function, that determines how many characters to slide the pattern to the right in case of a mismatch during the pattern-matching process. Knuth, D. E., Morris, J.H. and Pratt, V. R., Fast Pattern Matching Algorithm for Strings, SIAM J. Comput Sci., 43, 1977, 323-350
18
/course/eleg67701-f/Topic-1b18 The key idea is that if we have successfully matched the prefix p=p 1 p 2 …p i-1 of the keyword with the substring s j-i+1 s j-i+2 … s j-1 of the input string and p i = s j, then we do not need to reprocess any of the suffix s j-i+1 s j-i+2 … s j-1 since we know this portion of the text string is the prefix of the keyword that we have just matched.
19
/course/eleg67701-f/Topic-1b19 Note that the inner while loop will iterate as long as p_i and s_j do not match each other. Once they match, the inner while loop terminate, both i and j will shift by one, and inner loop repeats...
20
/course/eleg67701-f/Topic-1b20 An Important Property of the Next Function in KMP Algorithm The largest k less than i such that p 1 p 2 …p k-1 is a suffix of p 1 p 2 …p i-1 (i.e., p 1 …p k-1 = p i-k+1 …p i-1 ) and p i = p k. if there is no such i, then h i =0
21
/course/eleg67701-f/Topic-1b21 Backtrack or Not Backtrack ? Assume for some i and j, what should we do? KMP algorithm chose not to backtrack on the text S (e.g. j) for a good reason The choice is how to shift the pattern P (e.g. i) – i.e. by how much If for each j, the shift of P is a small constant, then the total time complexity is clearly linear in n P(i) = S(j)
22
/course/eleg67701-f/Topic-1b22 An Example Given: Input string: Scenario 1: i = 12 j = 12 Scenario 2: i j h 12 = 7, i = 7 Next function: 0 1 0 2 1 0 4 0 2 1 0 7 1 What is h i = h 12 = ? h i = 7
23
/course/eleg67701-f/Topic-1b23 An Example Scenario 3: i j h 7 = 4, i = 4 Subsequently i = 2, 1, 0 Finally, a match is found: i j (Contn’d)
24
/course/eleg67701-f/Topic-1b24 Question: when P(i) = S(j), how much should we shift? Observations: We should shift P to the right But – by how much? One answer is: do not backtrack S(j) P S i=1 j=1 i PiPi j SjSj Pattern Input
25
/course/eleg67701-f/Topic-1b25 Observation: Never backtrack on the input string S.
26
/course/eleg67701-f/Topic-1b26 How to Compute the Next Function? h i := h j h i := j j:= h j
27
/course/eleg67701-f/Topic-1b27 How to Compute the Next Function? h i := h j h i := j j:= h j Note: once p_i does not match p_j -- we know that j should be the index to be found where a prefix before i matches a suffix ends at j
28
/course/eleg67701-f/Topic-1b28 Interpretation of the Next Function Interpretation Question: how to compute the next function? aababaaba aababaaba 987654321 Note: P 2 = P 5 P 4 = P 9 010210402
29
/course/eleg67701-f/Topic-1b29 Interpretation of the Next Function Interpretation Question: how to compute the next function? 123456789 abaababaa abaababaa Note: P 1 = P 5 P 4 = P 9 010210402
30
/course/eleg67701-f/Topic-1b30 Interpretation of the Next Function Interpretation Question: how to compute the next function? 123456789 abaababaa abaababaa 010210402 Note: P 1 = P 5 P 4 = P 9
31
/course/eleg67701-f/Topic-1b31 KMP - Analysis The KMP algorithm never needs to backtrack on the text string. Time complexity = O(m + n) Space complexity = O(m + n) preprocessing searching
32
/course/eleg67701-f/Topic-1b32 KMP Algorithm Complexity Analysis Hints What is the cost in the building of the next function? ( hint: in the code for the next function, the operation j=h_j in the inner loop is never executed more often than the statement i := i+1 in the outer loop ) What is the cost of the matching itself? ( hint: similar to the above )
33
/course/eleg67701-f/Topic-1b33 Other String Matching Algorithms The Boyer-Moore Algorithm [Boyer, R. S. and Moore, J. E., A Fast String Searching Algorithm, CACM, 20(10), 1977, 62-72] The Karp-Rabin Algorithm [Karp, R. M. and Rpbin, M. O., Efficient Randomized Pattern-Matching Algorithm, IBM J. of Res. And Develop., 32(2), 1987, 249- 260].
34
/course/eleg67701-f/Topic-1b34 Matching of A Set of Key Words ? Given a pattern of a set of keywords and an input string S, answer “yes” if some keywords occur as a substring of S, and “no” otherwise. How to solve this ?
35
/course/eleg67701-f/Topic-1b35 What time complexity KMP algorithm will have when do a matching of k patterns? - Preprocessing each of the k patterns: assume each pattern has 0(m) in length, this will take 0(km) time - Searching each pattern will take o (n) time per pattern so, total time = k o(m+n) How about repeatedly apply KMP ?
36
/course/eleg67701-f/Topic-1b36 Question: Can we improve the time complexity when k is large? Answer: Yes, preprocessing the input string – tree implementation.
37
/course/eleg67701-f/Topic-1b37 Model for Pattern-Matching Problem Pattern Matcher generator Pattern Matcher Pattern P Input string S Yes No Pre Pro- cessing
38
/course/eleg67701-f/Topic-1b38 Tree Implementation -- suffix tree Instead of preprocessing the pattern (P), preprocess the text T ! Use a tree structure where all suffixes of the text are represented; Search for the pattern by looking for substrings of the text; You can easily test whether P is a substring of T because any substring of T is the prefix of some suffix.
39
/course/eleg67701-f/Topic-1b39 Suffix Tree agagta$ agaaagta $ 3 c a x b a b x a c 6 2 x a b x a c 4 c w c c u Suffix tree for string xabxac. The node labels u and w on the two interior nodes will be used. Con’d A suffix tree T for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edge-labels beginning with the same character. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. That is, it spells out S[i…m].
40
/course/eleg67701-f/Topic-1b40 Note on Suffix Tree Not all strings guaranteed to have corresponding suffix trees For example: consider xabxa: it does not have a suffix tree: because here xa is both a prefix and suffix (I.e. xa does not necessarily ends at a leaf) How to fix the problem: add $ - a special “termination” character to the alphabet.
41
/course/eleg67701-f/Topic-1b41 Algorithm for Constructing a Suffix Tree A subtree can be constructed in linear time [Weiner73, McCreight76, Ukkonen95]
42
/course/eleg67701-f/Topic-1b42 Suffix Tree Time complexity = O(n + m) Space complexity = O(m + n) preprocessing searching
43
/course/eleg67701-f/Topic-1b43 Question How to use suffix tree to help solving the string matching problem ?
44
/course/eleg67701-f/Topic-1b44 Other Tree based Methods Suffix tree is not the only one..
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.