CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms

Boyer – Moore algorithm Three ideas: –Right-to-left comparison –Bad character rule –Good suffix rule

Boyer – Moore algorithm Right to left comparison x y y Skip some chars without missing any occurrence.

Extended bad character rule charPosition in P a6, 3 b7, 4 p2 t1 x5 T: xpbctbxabpqqaabpqz P: tpabxab *^^ P: tpabxab Find T(k) in P that is immediately left to i, shift P to align T(k) with that position k i = 55 – 3 = 2. so shift 2 Preprocessing O(n) Restart the comparison here.

(Strong) good suffix rule t x t y t’ t y In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, such that the char left to t ≠ the char left to t’ T P P z z z ≠ y t y t’ P z z t y P z z t x T

Example preprocessing qcabdabdab charPositions in P a9, 6, 3 b10, 7, 4 c2 d8, 5 q1 q c a b d a b d a b 1 2 3 4 5 6 7 8 9 10 0 0 0 0 2 0 0 2 0 0 dab cab Bad char rule Good suffix rule dabdab cabdab Where to shift depends on T Does not depend on T

Tricky case Pattern: abcab a b c a b 0 0 0 1 0 * ^ ^ T: x y a a b c a b shift = 4 – 1 = 3 a b c a b N N 0 N N c b c b i-L

Example preprocessing qcabdabdab charPositions in P a9, 6, 3 b10, 7, 4 c2 d8, 5 q1 q c a b d a b d a b 1 2 3 4 5 6 7 8 9 10 0 0 0 0 0 3 0 0 3 0 dab cab Bad char rule Good suffix rule Where to shift depends on T Does not depend on T dabdab cabdab

Example preprocessing qcabdabdab charPositions in P a9, 6, 3 b10, 7, 4 c2 d8, 5 q1 q c a b d a b d a b 1 2 3 4 5 6 7 8 9 10 N N N N 2 N N 2 N N dab cab Bad char rule Good suffix rule dabdab cabdab Where to shift depends on T Does not depend on T

Algorithm KMP: Basic idea t t’ P t x T y t P y z z In pre-processing: for any position i in P, find the longest suffix t, such that t = t’, and y ≠ z. For each i, let Sp’(i) = length(t) ij

Failure link P: aataac aataac Sp’(i) 010020 aaataaat aat aac If a char in T fails to match at pos 6, re-compare it with the char at pos 3

FSA P: aataac 123450 a ataac 6 a t All other input goes to state 0 Sp’(i) 010020 aaataaat aat aac If the next char in T is t, we go to state 3

Tricky case Pattern: abcab abcab 0 0 0 0 2 a bbca c Failure link FSA dummy

How to actually do pre-processing? Similar pre-processing for KMP and B-M –Find matches between a suffix and a prefix –Both can be done in linear time –P is usually short, even a more expensive pre-processing may result in a gain overall t t’ P yx KMP t y t’ P x B-M i i j j For each i, find a j. similar to DP. Start from i = 2

Fundamental pre-processing Z i : length of longest substring starting at i that matches a prefix of P –i.e. t = t’, x ≠ y, Z i = |t| –With the Z-values computed, we can get the preprocessing for both KMP and B-M in linear time. aabcaabxaaz Z = 01003100210 How to compute Z-values in linear time? t t’ P i x y i+z i -1zizi 1

Computing Z in Linear time t t’ P l x y rk We already computed all Z- values up to k-1. need to compute Zk. We also know the starting and ending points of the previous match, l and r. t t’ P l x y rk We know that t = t’, therefore the Z-value at k-l+1 may be helpful to us. 1 k-l+1

Computing Z in Linear time No char inside the box is compared twice. At most one mismatch per iteration. Therefore, O(n). P k The previous r is smaller than k. i.e., no previous match extends beyond k. do explicit comparison. P l x y rk Z k-l+1 <= r-k+1. Z k = Z k-l+1 No comparison is needed. 1 k-l+1 Case 1: Case 2: P l rk Z k-l+1 > r-k+1. Z k = Z k-l+1 Comparison start from r 1 k-l+1 Case 3:

Z-preprocessing for B-M and KMP Both KMP and B-M preprocessing can be done in O(n) t t’ i x y j = i+z i -1 zizi 1 t t’ yx KMP t y t’ x B-M i j Z j i j For each j sp’(j+z j -1) = z(j) Use Z backwards

Keyword tree for spell checking O(n) time to construct. n: total length of patterns. Search time: O(m). m: length of word Common prefix only need to be compared once. p o t a t o e t r y t e r y s c i e n c e hoo l 1 2 3 4 5

Aho-Corasick algorithm Generalizing KMP Create failure links Basis of the fgrep algorithm Given the following patterns: –potato –tattoo –theater –other

Failure link p o t a t o t e r 0 t h e r 1 2 3 4 a t t o o h a t e potterisapersonwhomakespottery

Failure link p o t a t o t e r 0 t h e r 1 2 3 4 a t t o o h a t e O(n) preprocessing, and O(m+k) searching. k is # of occurrence. Can create a FSA similarly. Requires more space, and preprocessing time depends on alphabet size.

A problem with failure link Patterns: {potato, other, pot} p o t a t o 0 t h e r 1 2 3

A problem with failure link for multiple patterns Patterns: {potato, other, pot, the, he, era} p o t a t o 0 t h e r 1 2 t h e 3 4 potherarac he 5e r a

Output link Patterns: {potato, other, pot, the} p o t a t o 0 t h e r 1 2 t h e 3 4 potherarac he Failure link: taken when a mismatch occurs. Output link: always taken. (but will return). 5 e r a

Suffix Tree All algorithms we talked about so far preprocess pattern(s) –Karp-Rabin: small pattern, small alphabet –Boyer-Moore: fastest in practice. O(m) worst case. –KMP: O(m) –Aho-Corasick: O(m) In some cases we may prefer to pre-process T –Fixed T, varying P Suffix tree: basically a keyword tree of all suffixes

Suffix tree T: xabxac Suffixes: 1.xabxac 2.abxac 3.bxac 4.xac 5.ac 6.c a b x a c b x a c c c x a b x a c c 1 2 3 4 5 6 Naïve construction: O(m 2 ) using Aho-Corasick. Smarter: O(m). Very technical. big constant factor Create an internal node only when there is a branch

Suffix tree implementation Explicitly labeling seq end T: xabxa T: xabxa$ a b x a b x a x a b x a 1 2 3 a b x a b x a x a b x a 1 2 3 $ $ $ $ $ 4 5

Suffix tree implementation Implicitly labeling edges T: xabxa$ a b x a b x a x a b x a 1 2 3 $ $ $ $ $ 4 5 2:2 3:$ 1 2 3 $ $ 4 5 1:2 3:$

Suffix links Similar to failure link in a keyword tree Only link internal nodes having branches x a b c d e f g h i j a b c d e f g h i j xabcf f

Suffix tree construction 1:$ 1 1234567890... acatgacatt...

Suffix tree construction 2:$ 2 1:$ 1 1234567890... acatgacatt...

Suffix tree construction 2:$ a 4:$ 2 3 2:$ 1 1234567890... acatgacatt...

Suffix tree construction 2:$ 2 4:$ 4 a 3 2:$ 1 1234567890... acatgacatt...

Suffix tree construction 2:$ 2 4:$ 4 5:$ 5 a 4:$ 3 2:$ 1 1234567890... acatgacatt...

Suffix tree construction 2:$ 2 4:$ 4 5:$ c a t t 5 6 a 4:$ 3 5:$ 1 $ 1234567890... acatgacatt...

Suffix tree construction With this suffix link, when we later need to add another suffix, say acaty, we can use the link to avoid going back to the root and re-compare “cat” 5:$ 2 4:$ 4 5:$ 5 c a t t 7 c a t t 6 a 4:$ 3 5:$ 1 $ 1234567890... acatgacatt...

Suffix tree construction 5:$ 2 4:$ 4 5:$ 5 c a t t 7 c a t t 6 a 3 1 t 8 t $ 1234567890... acatgacatt...

Suffix tree construction 5:$ 2 4 5 c a t t 7 c a t t 6 a 3 1 t 8 t t t 9 $ 1234567890... acatgacatt...

Suffix tree construction 5:$ 2 4 5 c a t t 7 c a t t 6 a 3 1 t 8 t t t 9 10 $ $ 1234567890... acatgacatt...

ST Application: pattern matching Find all occurrence of P=xa in T –Find node v in the ST that matches to P –Traverse the subtree rooted at v to get the locations a b x a c b x a c c c x a b x a c c 1 2 3 4 5 6 T: xabxac O(m) to construct ST (large constant factor) O(n) to find v – linear to length of P instead of T! O(k) to get all leaves, k is the number of occurrence.

ST application: repeats finding Genome contains many repeated DNA sequences Repeat sequence length: Varies from 1 nucleotide to whole gene –Highly repetitive DNA in some non-coding regions 6 to 10bp x 100,000 to 1,000,000 times –Genes may have multiple copies (50 to 10,000)

Find longest repeated substring Do a tree traversal, compute the lengths of labels at each node O(m) L = 4 2:5 6:10 15:18 L = 9 L = 8

Repeats finding Find all repeats that are at least k-residue long and appear at least p times in the seq –Phase 1: top-down, count lengths of labels at each node –Phase 2: bottom-up: count # of leaves descended from each internal node (L, N) For each node with L >= k, and N >= p, print all leaves O(m) to traverse tree

Repeats finding Find repeats with at least 3 bases and 2 occurrence –cat –acat –aca 5:e 2 4 1234567890 acatgacatt 5:e 5 c a t t 7 c a t t 6 a 3 1 t 8 t t t 9 10 $

Repeats finding 1.Left-maximal repeat –S[i+1..i+k] = S[j+1..j+k] –S[i] != S[j] 2.Right-maximal repeat –S[i+1..i+k] = S[j+1..j+k], –S[i+k+1] != S[j+k+1] 3.Maximal repeat –S[i+1..i+k] = S[j+1..j+k] –S[i] != S[j], and S[i+k+1] != S[j+k+1] acatgacatt 1.aca 2.cat 3.acat

Repeats finding How to find maximal repeat? –A right-maximal repeats with different left chars 5:e 2 4 1234567890 acatgacatt 5:e 5 c a t t 7 c a t t 6 a 3 1 t 8 t t t 9 10 $ Left char = [] gcc aa

ST application: word enumeration Find all k-mers that occur at least p times –Compute (L, N) for each node –Find nodes v with L>=k, and L(parent) =y –Traverse sub-tree rooted at v to get the locations L<k L>=k, N>=p L = K L=k This can be used in many applications. For example, to find words that appeared frequently in a genome or a document

Joint Suffix Tree Build a ST for many than two strings Two strings S 1 and S 2 S* = S 1 & S 2 Build a suffix tree for S* in time O(|S 1 | + |S 2 |) The separator will only appear in the edge ending in a leaf

S1 = abcd S2 = abca S* = abcd&abca$ a b c d & a b c a bcd&abcabcd&abca c d & a b c d d & a b c d & a b c d a a a $ 1,1 2,1 1,2 1,3 1,4 2,2 2,3 2,4

To Simplify We don’t really need to do anything, since all edge labels were implicit. The right hand side is more convenient to look at a b c d & a b c a bcd&abcabcd&abca c d & a b c d d & a b c d & a b c d a a a $ 1,1 2,1 1,2 1,3 1,4 2,2 2,3 2,4 useless a b c d bcdbcd c d d a a a $ 1,1 2,1 1,2 1,3 1,4 2,2 2,3 2,4

Application of JST Longest common substring –For each internal node v, keep a bit vector B[2] –B[1] = 1 if a child of v is a suffix of S1 –Find all internal nodes with B[1] = B[2] = 1 –Report one with the longest label –Can be extended to k sequences. Just use a longer bit vector. a b c d bcdbcd c d d a a a $ 1,1 2,1 1,2 1,3 1,4 2,2 2,3 2,4 O(m), m the total seq length

Application of JST Given K strings, find all substrings with L>=l, that appear in at least d strings Exact motif finding problem Build a joint suffix tree with all strings S* = S 1 & S 2 % S 3 * S 4 @ S 5 ! S 6 + S 7 –Use a unique end char for each string –Not really necessary if caution is taken in construction

L< k L >= k B = 1010 | 0011 = 1011 |B| = 3 1,x 3,x 4,x B = 0011 O(mK), m the total seq length. K is for “bitwise or” two bit vectors 3,x B = 1010

Many other applications Reproduce the behavior of Aho-Corasick DNA finger printing –A database of people’s DNA sequence –Given a short DNA, which person is it from? Recognizing DNA contamination Indexing sequence databases … Catch –Large constant factor for space requirement (15-40 bytes per base for DNA) –Large constant factor for construction –Suffix array: trade off time for space

Summary One T, one P –Boyer-Moore is the choice –KMP works but not the best One T, many P –Aho-Corasick –Suffix Tree One fixed T, many varying P –Suffix tree Two or more T’s –Suffix tree, joint suffix tree, suffix array Alphabet independent Alphabet dependent

CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Similar presentations

Presentation on theme: "CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Similar presentations

Presentation on theme: "CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms."— Presentation transcript:

Similar presentations

About project

Feedback