Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.

Slides:



Advertisements
Similar presentations
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Advertisements

© 2004 Goodrich, Tamassia Pattern Matching1. © 2004 Goodrich, Tamassia Pattern Matching2 Strings A string is a sequence of characters Examples of strings:
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Space-for-Time Tradeoffs
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST.
Two implementation issues Alphabet size Generalizing to multiple strings.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
Suffix Trees and Suffix Arrays
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Goodrich, Tamassia String Processing1 Pattern Matching.
Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan
BTrees & Bitmap Indexes
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Quick Search Algorithm A very fast substring search algorithm, SUNDAY D.M., Communications of the ACM. 33(8),1990, pp Adviser: R. C. T. Lee Speaker:
Knuth-Morris-Pratt Algorithm Prepared by: Mayank Agarwal Prepared by: Mayank Agarwal Nitesh Maan Nitesh Maan.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.
KMP String Matching Prepared By: Carlens Faustin.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b
MCS 101: Algorithms Instructor Neelima Gupta
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
MCS 101: Algorithms Instructor Neelima Gupta
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
15-853:Algorithms in the Real World
13 Text Processing Hongfei Yan June 1, 2016.
String Processing.
Space-for-time tradeoffs
Chapter 7 Space and Time Tradeoffs
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Space-for-time tradeoffs
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Space-for-time tradeoffs
Knuth-Morris-Pratt Algorithm.
Chap 3 String Matching 3 -.
String Processing.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Space-for-time tradeoffs
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Space-for-time tradeoffs
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Algorithms and Data Structures

/course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation (e.g. KMP alg.) Tree implementation (e.g. suffix tree)

/course/eleg67701-f/Topic-1b3 Algorithm in action: data structure transformation Intermediate data structure Algorithm Input data structure Output data structure

/course/eleg67701-f/Topic-1b4 Basic Data Structures  Scalar or “Atomic” data structures Building blocks for other data structures Cannot be divided into sub-elements Integer, floating-point, character, access (pointer) types  Composite data structures arrays, records  Data Abstraction Abstract Data Types: A collection of data values together with a set of well-specified operations on that data, e.g. list, stack, queue, trees etc.

/course/eleg67701-f/Topic-1b5 Scalar Data Structure Conceptual View Physical Layout in the Computer Memory Memory address value Variable name var1 Assignment operation: var1  value; var2  var1; var1  var3;

/course/eleg67701-f/Topic-1b6 Composite Data Structure: Array Conceptual View v1v1 Variable name Array A[1..5] A v2v2 v3v3 v4v4 v5v5 Accessing array elements: A[0]  5 k  1 A[k]  11 A[k+1]  A[k] Physical Layout in the Computer Memory Memory address v2v2 v1v1 v3v3 v4v4 v5v5 nil

/course/eleg67701-f/Topic-1b7 Data Abstraction: Tree Conceptual View v1v1 v2v2 v3v3 v4v4 T ___ __ _ ___ __ _ ___ __ _ ___ __ _ ___ __ _ Accessing the elements: T.value  12 T.left  new(T) T.right  new(T) Physical Layout in the Computer Memory Memory address v1v1 T  0238 nil v2v2 v3v

/course/eleg67701-f/Topic-1b8 Space Analysis  Storage space, like time, is another limited resource that is important to programmers  Space requirements are also expressed as a function of the input size  Space functions are classified in the same manner as running times

/course/eleg67701-f/Topic-1b9 Complexity Analysis: Sorting AlgorithmTime-Complexity Insertionsort O(n 2 ) Quicksort O(n.log n) Space-Complexity O(n)

/course/eleg67701-f/Topic-1b10 Space-Time Tradeoff  Reductions in running time are often possible if we increase storage requirements  Decreasing the amount of storage used by an algorithm usually results in longer running times Using an array to lookup previously computed values can drastically increase the speed of a function

/course/eleg67701-f/Topic-1b11 Case Study: Searching for Patterns Problem: find the first occurrence of pattern P of length m inside the text S of length n.  String matching problem

/course/eleg67701-f/Topic-1b12 String Matching - Applications  Text editing  Term rewriting  Lexical analysis  Information retrieval  And, bioinformatics

/course/eleg67701-f/Topic-1b13 Model for Pattern-Matching Problem Pattern Matcher generator Pattern Matcher Pattern P Input string S Yes No

/course/eleg67701-f/Topic-1b14 Array Implementation Text S represented as an array of characters: S [1..n] Pattern P represented as an array of characters: P [1..m] agcagaagagta S Time complexity = O(m.n) Space complexity = O(m + n) P gaggagagag P gaggagagag P gaggagagag P gaggagagag P gaggagagag P gaggagagag P gaggagagag

/course/eleg67701-f/Topic-1b15 Can we be more clever ?  When a mismatch is detected, say at position k in the pattern string, we have already successfully matched k-1 characters.  We try to take advantage of this to decide where to restart matching agcagaagagta S P gaggagagag P gaggagagag P gaggagagag P gaggagagag P gaggagagag

/course/eleg67701-f/Topic-1b16 Problem of Matching Keyword PROBLEM. Given a pattern p consisting of a single keyword and an input string s, answer “yes” if p occurs as a substring of s, that is, if s=xpy, for some x and y; “no” otherwise. For convenience, we will assume p=p 1 p 2 …p m and s=s 1 s 2 …s n where p i represents the ith character of the pattern and s j the jth character of the input string.

/course/eleg67701-f/Topic-1b17 The Knuth-Morris-Pratt Algorithm Observation: when a mismatch occurs, we may not need to restart the comparison all way back (from the next input position). What to do: Constructing a table h, called the next function, that determines how many characters to slide the pattern to the right in case of a mismatch during the pattern-matching process. Knuth, D. E., Morris, J.H. and Pratt, V. R., Fast Pattern Matching Algorithm for Strings, SIAM J. Comput Sci., 43, 1977,

/course/eleg67701-f/Topic-1b18 The key idea is that if we have successfully matched the prefix p=p 1 p 2 …p i-1 of the keyword with the substring s j-i+1 s j-i+2 … s j-1 of the input string and p i = s j, then we do not need to reprocess any of the suffix s j-i+1 s j-i+2 … s j-1 since we know this portion of the text string is the prefix of the keyword that we have just matched.

/course/eleg67701-f/Topic-1b19 Note that the inner while loop will iterate as long as p_i and s_j do not match each other. Once they match, the inner while loop terminate, both i and j will shift by one, and inner loop repeats...

/course/eleg67701-f/Topic-1b20 An Important Property of the Next Function in KMP Algorithm The largest k less than i such that p 1 p 2 …p k-1 is a suffix of p 1 p 2 …p i-1 (i.e., p 1 …p k-1 = p i-k+1 …p i-1 ) and p i = p k. if there is no such i, then h i =0

/course/eleg67701-f/Topic-1b21 Backtrack or Not Backtrack ? Assume for some i and j, what should we do? KMP algorithm chose not to backtrack on the text S (e.g. j) for a good reason The choice is how to shift the pattern P (e.g. i) – i.e. by how much If for each j, the shift of P is a small constant, then the total time complexity is clearly linear in n P(i) = S(j)

/course/eleg67701-f/Topic-1b22 An Example Given: Input string: Scenario 1: i = 12 j = 12 Scenario 2: i j h 12 = 7, i = 7 Next function: What is h i = h 12 = ? h i = 7

/course/eleg67701-f/Topic-1b23 An Example Scenario 3: i j h 7 = 4, i = 4 Subsequently i = 2, 1, 0 Finally, a match is found: i j (Contn’d)

/course/eleg67701-f/Topic-1b24 Question: when P(i) = S(j), how much should we shift? Observations: We should shift P to the right But – by how much? One answer is: do not backtrack S(j) P S i=1 j=1 i PiPi j SjSj Pattern Input

/course/eleg67701-f/Topic-1b25 Observation: Never backtrack on the input string S.

/course/eleg67701-f/Topic-1b26 How to Compute the Next Function? h i := h j h i := j j:= h j

/course/eleg67701-f/Topic-1b27 How to Compute the Next Function? h i := h j h i := j j:= h j Note: once p_i does not match p_j -- we know that j should be the index to be found where a prefix before i matches a suffix ends at j

/course/eleg67701-f/Topic-1b28 Interpretation of the Next Function  Interpretation  Question: how to compute the next function? aababaaba aababaaba Note: P 2 = P 5 P 4 = P

/course/eleg67701-f/Topic-1b29 Interpretation of the Next Function  Interpretation  Question: how to compute the next function? abaababaa abaababaa Note: P 1 = P 5 P 4 = P

/course/eleg67701-f/Topic-1b30 Interpretation of the Next Function  Interpretation  Question: how to compute the next function? abaababaa abaababaa Note: P 1 = P 5 P 4 = P 9

/course/eleg67701-f/Topic-1b31 KMP - Analysis  The KMP algorithm never needs to backtrack on the text string. Time complexity = O(m + n) Space complexity = O(m + n) preprocessing searching

/course/eleg67701-f/Topic-1b32 KMP Algorithm Complexity Analysis Hints  What is the cost in the building of the next function? ( hint: in the code for the next function, the operation j=h_j in the inner loop is never executed more often than the statement i := i+1 in the outer loop )  What is the cost of the matching itself? ( hint: similar to the above )

/course/eleg67701-f/Topic-1b33 Other String Matching Algorithms  The Boyer-Moore Algorithm [Boyer, R. S. and Moore, J. E., A Fast String Searching Algorithm, CACM, 20(10), 1977, 62-72]  The Karp-Rabin Algorithm [Karp, R. M. and Rpbin, M. O., Efficient Randomized Pattern-Matching Algorithm, IBM J. of Res. And Develop., 32(2), 1987, ].

/course/eleg67701-f/Topic-1b34 Matching of A Set of Key Words ?  Given a pattern of a set of keywords and an input string S, answer “yes” if some keywords occur as a substring of S, and “no” otherwise.  How to solve this ?

/course/eleg67701-f/Topic-1b35 What time complexity KMP algorithm will have when do a matching of k patterns? - Preprocessing each of the k patterns: assume each pattern has 0(m) in length, this will take 0(km) time - Searching each pattern will take o (n) time per pattern so, total time = k o(m+n) How about repeatedly apply KMP ?

/course/eleg67701-f/Topic-1b36 Question: Can we improve the time complexity when k is large? Answer: Yes, preprocessing the input string – tree implementation.

/course/eleg67701-f/Topic-1b37 Model for Pattern-Matching Problem Pattern Matcher generator Pattern Matcher Pattern P Input string S Yes No Pre Pro- cessing

/course/eleg67701-f/Topic-1b38 Tree Implementation -- suffix tree  Instead of preprocessing the pattern (P), preprocess the text T !  Use a tree structure where all suffixes of the text are represented;  Search for the pattern by looking for substrings of the text;  You can easily test whether P is a substring of T because any substring of T is the prefix of some suffix.

/course/eleg67701-f/Topic-1b39 Suffix Tree agagta$ agaaagta $ 3 c a x b a b x a c 6 2 x a b x a c 4 c w c c u Suffix tree for string xabxac. The node labels u and w on the two interior nodes will be used. Con’d A suffix tree T for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edge-labels beginning with the same character. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. That is, it spells out S[i…m].

/course/eleg67701-f/Topic-1b40 Note on Suffix Tree  Not all strings guaranteed to have corresponding suffix trees  For example: consider xabxa: it does not have a suffix tree: because here xa is both a prefix and suffix (I.e. xa does not necessarily ends at a leaf)  How to fix the problem: add $ - a special “termination” character to the alphabet.

/course/eleg67701-f/Topic-1b41 Algorithm for Constructing a Suffix Tree  A subtree can be constructed in linear time [Weiner73, McCreight76, Ukkonen95]

/course/eleg67701-f/Topic-1b42 Suffix Tree Time complexity = O(n + m) Space complexity = O(m + n) preprocessing searching

/course/eleg67701-f/Topic-1b43 Question  How to use suffix tree to help solving the string matching problem ?

/course/eleg67701-f/Topic-1b44 Other Tree based Methods  Suffix tree is not the only one..