Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.

Exact String Matching Algorithms

Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original authors. Thanks! 2

Overview Sequence alignment: two sub-problems: –How to score an alignment with errors –How to find an alignment with the best score Today: exact string matching –Does not allow any errors –Efficiency becomes the sole consideration Time and space

Why exact string matching? The most fundamental string comparison problem Often the core of more complex string comparison algorithms –E.g., BLAST Often repeatedly called by other methods –Usually the most time consuming part –Small improvement could improve overall efficiency considerably

Definitions Text: a longer string T (length m) Pattern: a shorter string P (length n) Exact matching: find all occurrences of P in T T P length m length n

The naïve algorithm

Time complexity Worst case: O(mn) How to speedup? –Pre-processing T or P –Why pre-processing can save us time? Uncovers the structure of T or P Determines when we can skip ahead without missing anything Determines when we can infer the result of character comparisons without doing them.

Cost for exact string matching Total cost = cost (preprocessing) + cost(comparison) + cost(output) Constant Minimize Overhead Hope: gain > overhead

String matching scenarios One T and one P –Search a word in a document One T and many P all at once –Search a set of words in a document –Spell checking (fixed P) One fixed T, many P –Search a completed genome for short sequences Two (or many) T’s for common patterns Q: Which one to pre-process? A: Always pre-process the shorter seq, or the one that is repeatedly used

Pre-processing algs Pattern preprocessing –Knuth-Morris-Pratt algorithm (KMP) –Aho-Corasick algorithm Multiple patterns –Boyer – Moore algorithm (discuss if have time) The choice of most cases Typically sub-linear time Text preprocessing –Suffix tree Very useful for many purposes –Suffix array –Burrows-Wheeler Transformation

Algorithm KMP: Intuitive example 1 Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened when comparing P[8] with T[i], we can shift P by four chars, and compare P[4] with T[i], without missing any possible matches. Number of comparisons saved: 6 abcxabc T abcxabcde P mismatch abcxabc T abcxabcde Naïve approach: abcxabcde ?

? Intuitive example 2 Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened between P[7] and T[j], we can shift P by six chars and compare T[j] with P[1] without missing any possible matches Number of comparisons saved: 7 abcxabc T abcxabcde P mismatch abcxabc T abcxabcde Naïve approach: abcxabcde Should not be a c abcxabcde ?

KMP algorithm: pre-processing Key: the reasoning is done without even knowing what string T is. Only the location of mismatch in P must be known. t t’ P t x T y t P y z z Pre-processing: for any position i in P, find P[1..i]’s longest proper suffix, t = P[j..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’ (i.e., y ≠ z) For each i, let sp(i) = length(t) ij ij

KMP algorithm: shift rule t t’ P t x T y t P y z z Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the right by i – sp(i) chars and compare x with z. This shift rule can be implicitly represented by creating a failure link between y and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume comparison between x and P[sp(i)+1]. ij ijsp(i)1

Failure Link Example P: aataac aataac sp(i) 010020 aaataaat aat aac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 (= 2 + 1)

Another example P: abababc abababc Sp(i) 0000040 ababa ababc If a char in T fails to match at pos 7, re-compare it with the char at pos 5 (= 4 + 1) abababab abab

KMP Example using Failure Link aataac aataac ^^* T: aacaataaaaataaccttacta aataac.* aataac ^^^^^* aataac..* aataac.^^^^^ Time complexity analysis: Each char in T may be compared up to n times. A lousy analysis gives O(mn) time. More careful analysis: number of comparisons can be broken to two phases: Comparison phase: the first time a char in T is compared to P. Total is exactly m. Shift phase. First comparisons made after a shift. Total is at most m. Time complexity: O(2m) Implicit comparison

Suffix Tree A tree data structure representing a set of strings. –Allow the storage of all substrings of a given string in linear space a c b c e e f d b f e g { aeef ad bbfe bbfg c }

Better than hash tables? Hash tables are certainly easier to understand. And, one can produce a hash table of all length k strings in O(m) time and look up a k-length string x in O(k) time, finding all p places where string x is found in O(p) time. This is the same as the bound for suffix trees. What if you don’t know how long the string x is going to be? And most other string matching tricks don’t work for it either.

Suffix Tree: definition A suffix tree ST for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty substring of S. No two edges out of a node can have edge- labels beginning with the same character. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge- labels on the path from the root to the leaf i exactly spells out the suffix of S that starts at position i.

Suffix Trees: Example Suffixes of ‘papua’ –‘papua’ –‘apua’ –‘pua’ –‘ua’ – ‘a’ – ‘’

Suffix Trees: Example Suffixes of ‘papua’ –‘papua’ –‘apua’ –‘pua’ –‘ua’ – ‘a’ – ‘’ NOTE: Assume the string terminates with some character found nowhere else in the string. (eg. ‘\0’)

Suffix Trees: Example Suffix tree for ‘papua’

caracas

mississippi

Exact matching in linear time Many others “We know of no other single data structure that allows efficient solutions to such a wide range of complex string problems.” - Dan Gusfield Suffix Trees

Exact string matching problem Given a pattern P of length m, find all occurrences of P in text T – O(n+m) algorithm Solution: Build a suffix tree ST for text T in O(m) time. Then, match the characters of P along the unique path in ST until either P is exhausted or no more matches are possible.

Exact string matching problem Find ‘ssi’ in ‘mississippi’

Exact string matching problem Find ‘ssi’ in ‘mississippi’ Every leaf below this point in the tree marks the starting location of ‘ssi’ in ‘mississippi’. (ie. ‘ssissippi’ and ‘ssippi’)

Comparing to the other algorithms Knuth–Morris–Pratt and Boyer-Moore both achieve this worst case bound. –O(m+n) when the text and pattern are presented together. Suffix trees are much faster when the text is fixed and known first while the patterns vary. –O(m) for single time processing the text, then only O(n) for each new pattern. Based on suffix trees, is faster for searching a number of patterns at one time against a single text (exact set matching problem) – Aho-Corasick algorithm: preprocessing P instead of T.

Building the Suffix Tree How do we build a suffix tree? while suffixes remain: add next shortest suffix to the tree

Building the Suffix Tree papua

Building the Suffix Tree How do we build a suffix tree? while suffices remain: add next shortest suffix to the tree Naïve method - O(m 2 ) (m = text size)

Building the Suffix Tree in O(m) time In the previous example, we assumed that the tree can be built in O(m) time. Weiner showed original O(m) algorithm (Knuth is claimed to have called it “the algorithm of 1973”) More space efficient algorithm by McCreight in 1976 Simpler ‘on-line’ algorithm by Ukkonen in 1995

Ukkonen’s Algorithm Build suffix tree T for string S[1..m] –Build the tree in m phases, one for each character. At the end of phase i, we will have tree T i, which is the tree representing the prefix S[1..i]  online construction In each phase i, we have i extensions, one for each character in the current prefix. At the end of extension j, we will have ensured that S[j..i] is in the tree T i.

Ukkonen’s Algorithm

An Exmaple

Ukkonen’s Algorithm This is an O(m 3 ) time, O(m 2 ) space algorithm. We need a few implementation speed-ups to achieve the O(m) time and O(m) space bounds.

Suffix extension rules Three possible ways to extend S[j..i] with character i+1. 1.S[j..i] ends at a leaf. Add the character i+1 to the end of the leaf edge. 2.No path from the end of S[j..i] starts with the i+1 character. Split the edge and create a new node if necessary, then add a new leaf with character i+1. (This is the only extension that increases the number of leaves! The new leaf represents the suffix starting at position j.) 3.There is already a path from the end of S[j..i] starts with the i+1 character, or S[j..i+1] correspond to a path. Do nothing.

Ukkonen’s Algorithms

Ukkonen’s Algorithm: Speed-up 1 Suffix Links – speed up navigation to the next extension point in the tree

Ukkonen’s Algorithm: Speed-up 2 Skip/Count Trick – instead of stepping through each character, we know that we can just jump, as long as we’re the right distance

Ukkonen’s Algorithm: Speed-up 3 Edge-Label Compression – since we have a copy of the string, we don’t need to store copies of the substrings for each edge

Edge-Label Compression –since we have a copy of the string, we don’t need to store copies of the substrings for each edge –O(m 2 ) space becomes O(m) space Ukkonen’s Algorithm: Speed-up 3

Ukkonen’s Algorithm: Speed-up 4 A match is a show stopper. – If we find a match to our next character (rule 3 applies), we’re done this phase.

Ukkonen’s Algorithm: Speed-up 5 Once a leaf, always a leaf (implicitly implement rule 1). –We don’t need to update each leaf, since it will always be the end of the current string. We can get these updates for free. –Either 1)maintain a global end-of-string index or 2) insert the whole string for every leaf

Ukkonen’s Algorithm: Speed-ups Because of speed-ups 4 and 5, we can pick up the next phase right where we ended the last one!

Ukkonen’s Algorithm – mississippi with Speed-ups void SuffixTree::update(char* … s, intlen) { int i; int j; … for (i = 0, j = 0; i < len; i) { work … i++) { while (j <= … all the } }}}}

Ukkonen’s Algorithm – The Punch Line By combining all of the speed-ups, we can now construct a suffix tree T m representing the string S[1..m] in –O(m) time and in –O(m) space!

Exact string matching Both P (|P|=n) and T (|T|=m) are known: –Suffix tree method achieves same worst-case bound O(n+m) as KMP. T is fixed and build suffix tree, then P is input, k is the number of occurrences of P –Using suffix tree: O(n+k) –In contrast (KMP, preprocess P): O(n+m) for any single P P is fixed, then T is input –Selecting KMP rather than suffix tree – or Aho-Corasick algorithm (exact set matching problem)

Applications Problems –linear-time longest common substring –constant-time least common ancestor –maximally repetitive structures –all-pairs suffix-prefix matching –compression –inexact matching –conversion to suffix arrays

Drawbacks Suffix trees consume a lot of space –The space requirements of suffix trees can become prohibitive –|Tree(T)| is about 20|T| in practice It is O(n) but the constant is quite big Notice that if we indeed want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node

Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 3142

How do we build it ? Build a suffix tree Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. O(n) time

How do we search for a pattern ? If P occurs in T then all its occurrences are consecutive in the suffix array. Do binary search in Pos to find P! –Compare P with suffix Pos(m/2) –If lexicographically less, P is in first half of T –If lexicographically more, P is in second half of T –Iterate! Takes O(mlogn) time

Example Let S = mississippi i ippi issippi ississippi mississippi pi 8 5 2 1 10 9 7 4 11 6 3 ppi sippi sisippi ssippi ssissippi L R Let P = issa M

How do we accelerate the search ? L R Maintain = LCP(P,L) Maintain r = LCP(P,R) M If  = r then start comparing M to P at + 1 r

How do we accelerate the search ? L R Suppose we know LCP(L,M) If LCP(L,M) <  we go left If LCP(L,M) > we go right If LCP(L,M) = we start comparing at  + 1 M If  > r then r

Analysis of the acceleration If we do more than a single comparison in an iteration then max(, r ) grows by 1 for each comparison  O(logn + m) time

67 Traverse suffix tree and write out leaves in order. Suffix Tree → Suffix Array in O(n) Time 7 6 1 3 5 2 4 $ a banana$ na $ $ na$ $ T[1..7] = “banana$” 1234567 SuffixStarting index $7 A$6 ANA$4 ANANA$2 BANANA$1 NA$5 NANA$3

68 Longest common prefixes (LCPs) of 0 tell us branches emanating from root: Suffix Array → Suffix Tree SuffixStarting index $ A$ ANA$ ANANA$ BANANA$ NA$ NANA$ LCP Length 6 4 2 7 1 5 3 0 1 3 0 0 2 000 SuffixIndex $ 7 SuffixIndex A$ ANA$ ANANA$ SuffixIndex BANANA$ SuffixIndex NA$ NANA$ 6 4 2 1 5 3 1 3 2

69 Then continue subdividing… Suffix Array → Suffix Tree LCP Length 6 4 2 7 1 5 3 0 1 3 0 0 2 000 SuffixIndex $ 7 SuffixIndex ANA$ ANANA$ SuffixIndex BANANA$ SuffixIndex NA$ NANA$ 1 5 3 1 3 2 Starting Index SuffixIndex A$ 64 2

70 Then continue subdividing… Suffix Array → Suffix Tree LCP Length 6 4 2 7 1 5 3 0 1 3 0 0 2 000 SuffixIndex $ 7 SuffixIndex BANANA$ SuffixIndex NANA$ 1 3 5 1 3 2 Starting Index SuffixIndex A$ 6 SuffixIndex NA$ SuffixIndex ANANA$ 2 4 SuffixIndex ANA$

71 Voila: a suffix tree! (augmented with letter depths) Suffix Array → Suffix Tree LCP Length 6 4 2 7 1 5 3 0 1 3 0 0 2 000 7 1 3 5 1 3 2 Starting Index 6 2 4

Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.

Similar presentations

Presentation on theme: "Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.

Similar presentations

Presentation on theme: "Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original."— Presentation transcript:

Similar presentations

About project

Feedback