Lecture 9-10 Exact String Matching Algorithms

Lecture 9-10 Exact String Matching Algorithms
CS5263 Bioinformatics Lecture 9-10 Exact String Matching Algorithms

Overview Pair-wise alignment Multiple alignment
Commonality: allowing errors when comparing strings Two sub-problems: How to score an alignment with errors How to find an alignment with the best score Today: exact string matching Do not allow any errors Efficiency becomes the sole consideration

Why exact string matching?
The most fundamental string comparison problem Work processors Information retrieval DNA sequence retrieval Many many more Is it still an interesting research problem? Yes, if database is large Exact string matching is often the core of more complex string comparison algorithms E.g., BLAST Often repeatedly called by other methods Usually the most time consuming part Small improvement could improve overall efficiency considerably

Definitions Text: a longer string T (length m)
Pattern: a shorter string P (length n) Exact matching: find all occurrences of P in T T length m P length n

The naïve algorithm

Time complexity Worst case: O(mn) Best case: O(m) Average case?
e.g. aaaaaaaaaaaaaa vs baaaaaaa Average case? Alphabet A, C, G, T Assume both P and T are random Equal probability In average how many chars do you need to compare before giving up?

Average case time complexity
P(mismatch at 1st position): ¾ P(mismatch at 2nd position): ¼ * ¾ P(mismatch at 3nd position): (¼)2 * ¾ P(mismatch at kth position): (¼)k-1 * ¾ Expected number of comparison per position: p = 1/4 k (1-p) p(k-1) k = (1-p) / p * k pk k = 1/(1-p) = 4/3 Average complexity: 4m/3 Not as bad as you thought it might be

Biological sequences are not random
T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaab Plus: 4m/3 average case is still bad for long genomic sequences! Especially if this has to be done again and again Smarter algorithms: O(m + n) in worst case sub-linear in practice

How to speedup? Pre-processing T or P
Why pre-processing can save us time? Uncovers the structure of T or P Determines when we can skip ahead without missing anything Determines when we can infer the result of character comparisons without doing them. ACGTAXACXTAXACGXAX ACGTACA

Cost for exact string matching
Total cost = cost (preprocessing) + cost(comparison) + cost(output) Overhead Minimize Constant Hope: gain > overhead

String matching scenarios
One T and one P Search a word in a document One T and many P all at once Search a set of words in a document Spell checking (fixed P) One fixed T, many P Search a completed genome for short sequences Two (or many) T’s for common patterns Q: Which one to pre-process? A: Always pre-process the shorter seq, or the one that is repeatedly used

Pre-processing algs Pattern preprocessing Text preprocessing
Karp – Rabin algorithm Small alphabet and short patterns Knuth-Morris-Pratt algorithm (KMP) Aho-Corasick algorithm Multiple patterns Boyer – Moore algorithm The choice of most cases Typically sub-linear time Text preprocessing Suffix tree Very useful for many purposes

Karp – Rabin Algorithm Let’s say we are dealing with binary numbers
Text: Pattern: Convert pattern to integer = 2^5 + 2^3 + 2^2 = 44

Karp – Rabin algorithm Text: 01010001011001010101001
Pattern: = 44 decimal = 2^5 + 2^3 + 2^2 + 2^1 = 46 = 46 * 2 – = 29 = 29 * = 59 = 59 * = 54 = 54 * = 44

Karp – Rabin algorithm 10111011001010101001 = 46 (% 13 = 7)
What if the pattern is too long to fit into a single integer? Pattern: But our machine only has 5 bits Basic idea: hashing. 44 % 13 = 5 = 46 (% 13 = 7) = 46 * 2 – = 29 (% 13 = 3) = 29 * = 59 (% 13 = 7) = 59 * = 54 (% 13 = 2) = 54 * = 44 (% 13 = 5)

Algorithm KMP Not the fastest Best known Good for “real-time matching”
i.e. text comes one char at a time No memory of previous chars Idea Left-to-right comparison Shift P more than one char whenever possible

Intuitive example 1 T abcxabc mismatch P abcxabcde Naïve approach: T abcxabc ? abcxabcde abcxabcde abcxabcde abcxabcde Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened when comparing P[8] with T[i], we can shift P by four chars, and compare P[4] with T[i], without missing any possible matches. Number of comparisons saved: 6

Intuitive example 2 Should not be a c T abcxabc mismatch P abcxabcde Naïve approach: T abcxabc ? abcxabcde ? abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened between P[7] and T[j], we can shift P by six chars and compare T[j] with P[1] without missing any possible matches Number of comparisons saved: 7

KMP algorithm: pre-processing
Key: the reasoning is done without even knowing what string T is. Only the location of mismatch in P must be known. x T t z y P t’ t j i z y P t’ t j i Pre-processing: for any position i in P, find P[1..i]’s longest proper suffix, t = P[j..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’ (i.e., y ≠ z) For each i, let sp(i) = length(t)

KMP algorithm: shift rule
x T t z y P t’ t j i z y P t’ t 1 sp(i) j i Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the right by i – sp(i) chars and compare x with z. This shift rule can be implicitly represented by creating a failure link between y and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume comparison between x and P[sp(i)+1].

Failure Link Example P: aataac
If a char in T fails to match at pos 6, re-compare it with the char at pos 3 (= 2 + 1) a a t a a c sp(i) aa at aat aac

Another example P: abababc
If a char in T fails to match at pos 7, re-compare it with the char at pos 5 (= 4 + 1) a b a b a b c Sp(i) ab ab abab abab ababa ababc

KMP Example using Failure Link
t a a c T: aacaataaaaataaccttacta aataac ^^* Time complexity analysis: Each char in T may be compared up to n times. A lousy analysis gives O(mn) time. More careful analysis: number of comparisons can be broken to two phases: Comparison phase: the first time a char in T is compared to P. Total is exactly m. Shift phase. First comparisons made after a shift. Total is at most m. Time complexity: O(2m) aataac .* aataac ^^^^^* Implicit comparison aataac ..* aataac .^^^^^

KMP algorithm using DFA (Deterministic Finite Automata)
P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 Failure link a a t a a c If the next char in T is t after matching 5 chars, go to state 3 a t a a t a a c 1 2 3 4 5 DFA 6 a a All other inputs goes to state 0.

DFA Example T: aacaataataataaccttacta 1201234534534560001001
1 2 3 4 5 DFA 6 a a T: aacaataataataaccttacta Each char in T will be examined exactly once. Therefore, exactly m comparisons are made. But it takes longer to do pre-processing, and needs more space to store the FSA.

Difference between Failure Link and DFA
Preprocessing time and space are O(n), regardless of alphabet size Comparison time is at most 2m (at least m) DFA Preprocessing time and space are O(n ||) May be a problem for very large alphabet size For example, each “char” is a big integer Chinese characters Comparison time is always m.

Boyer – Moore algorithm
Often the choice of algorithm for many cases One T and one P We will talk about it later if have time In its original version does not guarantee linear time Some modification did it In practice sub-linear

The set matching problem
Find all occurrences of a set of patterns in T First idea: run KMP or BM for each P O(km + n) k: number of patterns m: length of text n: total length of patterns Better idea: combine all patterns together and search in one run

A simpler problem: spell-checking
A dictionary contains five words: potato poetry pottery science school Given a document, check if any word is (not) in the dictionary Words in document are separated by special chars. Relatively easy.

Keyword tree for spell checking
This version of the potato gun was inspired by the Weird Science team out of Illinois p s o c h o o l e 5 t i e t a t r n t e y c o r e y 3 1 4 2 O(n) time to construct. n: total length of patterns. Search time: O(m). m: length of text Common prefix only need to be compared once. What if there is no space between words?

Aho-Corasick algorithm
Basis of the fgrep algorithm Generalizing KMP Using failure links Example: given the following 4 patterns: potato tattoo theater other

Keyword tree p t t o h e h t a r e t a a 4 t t t o e o r 1 o 2 3

Keyword tree potherotathxythopotattooattoo p t t o h e h t a r e t a a
p t t o h e h t a r e a t a 4 t t t o e o r 1 o 2 3 potherotathxythopotattooattoo

Keyword tree potherotathxythopotattooattoo O(mn) p t t o h e h t a r e
p t t o h e h t a r e t a a 4 t t t o e o r 1 o 2 3 potherotathxythopotattooattoo O(mn) m: length of text. n: length of longest pattern

Keyword Tree with a failure link
p t t o h e h t a r e t a a 4 t t t o e o r 1 o 2 3 potherotathxythopotattooattoo

Keyword Tree with all failure links
p t t o h e h t a r e t a a 4 t t t o e o r 1 o 3 2

Example potherotathxythopotattooattoo p t t o h e h t a r e t a a 4 t
p t t o h e h t a r e t a a 4 t t t o e o r 1 o 3 2 potherotathxythopotattooattoo

Aho-Corasick algorithm
O(n) preprocessing, and O(m+k) searching. n: total length of patterns. m: length of text k is # of occurrence. Can create a DFA similar as in KMP. Requires more space, Preprocessing time depends on alphabet size Search time is constant A: Where can this algorithm be used in previous topics? Q: BLAST Given a query sequence, we generate many seed sequences (k-mers) Search for exact matches to these seed sequences Extend exact matches into longer inexact matches

Suffix Tree All algorithms we talked about so far preprocess pattern(s) Karp-Rabin: small pattern, small alphabet Boyer-Moore: fastest in practice. O(m) worst case. KMP: O(m) Aho-Corasick: O(m) In some cases we may prefer to pre-process T Fixed T, varying P Suffix tree: basically a keyword tree of all suffixes

Suffix tree T: xabxac Suffixes: xabxac abxac bxac xac ac c
1 b b x x c 6 4 a a c c 5 2 3 Naïve construction: O(m2) using Aho-Corasick. Smarter: O(m). Very technical. big constant factor Difference from a keyword tree: create an internal node only when there is a branch

Suffix tree implementation
Explicitly labeling sequence end T: xabxa$ x a x b a a x a b a x a $ 1 1 b b $ b b x x $ x x a 4 a a a $ 5 $ 2 3 2 3

Suffix tree implementation
Implicitly labeling edges T: xabxa$ 1:2 x a b 3:$ a x a 2:2 $ 1 1 b $ $ b $ $ x x 3:$ 3:$ 4 4 a a $ 5 5 $ 2 2 3 3

Suffix links Similar to failure link in a keyword tree
Only link internal nodes having branches x a b P: xabcf a b c f c d d e e f f g g h h i i j j

Suffix tree construction
acatgacatt 1:$ 1

acatgacatt 1:$ 2:$ 1 2

acatgacatt a 2:$ 2:$ 4:$ 3 1 2

acatgacatt a 4:$ 2:$ 2:$ 4:$ 4 3 1 2

5:$ acatgacatt 5 a 4:$ 2:$ 2:$ 4:$ 4 3 1 2

5:$ acatgacatt 5 a 4:$ c a 2:$ t 4:$ 4 t 5:$ $ 6 3 1 2

5:$ acatgacatt 5 a c 4:$ a c t a t 4:$ 4 t 5:$ 5:$ t $ 7 6 3 1 2

5:$ acatgacatt 5 a c 4:$ a c t t a t 4 t 5:$ 5:$ 5:$ t t $ 7 6 8 3 1 2

5:$ acatgacatt 5 t a c a t 5:$ c t a t 9 t 4 t 5:$ 5:$ 5:$ t t $ 7 6 8 3 1 2

5:$ acatgacatt 5 t a c $ 10 a c t 5:$ t a t 9 t 4 t 5:$ 5:$ 5:$ t t $ 7 6 8 3 1 2

ST Application 1: pattern matching
Find all occurrence of P=xa in T Find node v in the ST that matches to P Traverse the subtree rooted at v to get the locations x a b a x a c c c 1 b b x x c 6 4 a a c c 5 T: xabxac 2 3 O(m) to construct ST (large constant factor) O(n) to find v – linear to length of P instead of T! O(k) to get all leaves, k is the number of occurrence. Asymptotic time is the same as KMP. ST wins if T is fixed. KMP wins otherwise.

ST Application 2: set matching
Find all occurrences of a set of patterns in T Build a ST from T Match each P to ST x a b a x a c c c 1 b b x x c 6 4 a a c c 5 T: xabxac P: xab 2 3 O(m) to construct ST (large constant factor) O(n) to find v – linear to total length of P’s O(k) to get all leaves, k is the number of occurrence. Asymptotic time is the same as Aho-Corasick. ST wins if T fixed. AC wins if P’s are fixed. Otherwise depending on relative size.

ST application 3: repeats finding
Genome contains many repeated DNA sequences Repeat sequence length: Varies from 1 nucleotide to millions Genes may have multiple copies (50 to 10,000) Highly repetitive DNA in some non-coding regions 6 to 10bp x 100,000 to 1,000,000 times Problem: find all repeats that are at least k-residues long and appear at least p times in the genome

Repeats finding at least k-residues long and appear at least p times in the seq Phase 1: top-down, count label lengths (L) from root to each node Phase 2: bottom-up: count # of leaves descended from each internal node For each node with L >= k, and N >= p, print all leaves O(m) to traverse tree (L, N)

Maximal repeats finding
Right-maximal repeat S[i+1..i+k] = S[j+1..j+k], but S[i+k+1] != S[j+k+1] Left-maximal repeat S[i+1..i+k] = S[j+1..j+k] But S[i] != S[j] Maximal repeat But S[i] != S[j], and S[i+k+1] != S[j+k+1] acatgacatt cat aca acat

acatgacatt 5 t a c $ 10 a c t 5:e t a t 9 t 4 t 5:e 5:e 5:e t t 7 6 8 3 1 2 Find repeats with at least 3 bases and 2 occurrence right-maximal: cat Maximal: acat left-maximal: aca

acatgacatt 5 t a c $ 10 a t 5:e c a t t 9 t 4 t 5:e 5:e 5:e t t 7 6 8 3 1 2 Left char = [] g c c a a How to find maximal repeat? A right-maximal repeats with different left chars

ST application 4: word enumeration
Find all k-mers that occur at least p times Compute (L, N) for each node L: total label length from root to node N: # leaves Find nodes v with L>=k, and L(parent)<k, and N>=y Traverse sub-tree rooted at v to get the locations L<k L=k L = K L>=k, N>=p This can be used in many applications. For example, to find words that appeared frequently in a genome or a document

Joint Suffix Tree (JST)
Build a ST for more than two strings Two strings S1 and S2 S* = S1 & S2 Build a suffix tree for S* in time O(|S1| + |S2|) The separator will only appear in the edge ending in a leaf

Joint suffix tree example
S1 = abcd S2 = abca S* = abcd&abca$ & a b c d useless a d c & b b c d & a a c b c d $ d d & a a & a b 2,4 a a 1,4 c a 2,3 b 2,1 c 2,2 1,1 d Seq ID 1,3 Suffix ID 1,2

To Simplify & a b c d useless a d & b b c d & a c a a d c b c $ c b b c d d d d c & a $ a & d a d 1,4 b 2,4 a a a 1,4 c a 2,3 2,4 a a b 2,1 1,1 2,3 c 2,1 2,2 1,1 d 1,3 2,2 1,2 1,3 1,2 We don’t really need to do anything, since all edge labels were implicit. The right hand side is more convenient to look at

Application 1 of JST Longest common substring between two sequences
Using smith-waterman Gap = mismatch = -infinity. Quadratic time Using JST Linear time For each internal node v, keep a bit vector B B[1] = 1 if a child of v is a suffix of S1 Bottom-up: find all internal nodes with B[1] = B[2] = 1 (green nodes) Report a green node with the longest label Can be extended to k sequences. Just use a longer bit vector. a d b b c d c c $ d d 1,4 a a 2,4 a 1,1 2,3 2,1 1,3 2,2 1,2

Application 2 of JST Given K strings, find all k-mers that appear in at least (or at most) d strings Exact motif finding problem L< k cardinal(B) >= 3 B = BitOR(1010, 0011) = 1011 L >= k cardinal(B) = 3 B = 0011 B = 1010 4,x 1,x 3,x 3,x

Application 3 of JST Substring problem for sequence databases
Given: A fixed database of sequences (e.g., individual genomes) Given: A short pattern (e.g., DNA signature) Q: Does this DNA signature belong to any individual in the database? i.e. the pattern is a substring of some sequences in the database Aho-Corasick doesn’t work This can also be used to design signatures for individuals Build a JST for the database seqs Match P to the JST Find seq IDs from descendents a d b c d c b c $ d 1,4 a d a 2,4 a 1,1 2,3 2,1 Seqs: abcd, abca P1: cd P2: ac 1,3 2,2 1,2

Application 4 of JST Detect DNA contamination
For some reason when we try to clone and sequence a genome, some DNAs from other sources may contaminate our sample, which should be detected and removed Given: A fixed database of sequences (e.g., possible cantamination sources) Given: A DNA just sequenced (e.g., DNA signature) Q: Does this DNA contain longer enough substring from the seqs in the database? Build a JST for the database seqs Scan T using the JST a d b c d c b c $ d d 1,4 a a 2,4 a 1,1 2,3 Contamination sources: abcd, abca Sequence: dbcgaabctacgtctagt 2,1 1,3 2,2 1,2

Summary One T, one P One T, many P One fixed T, many varying P
Boyer-Moore is the choice KMP works but not the best One T, many P Aho-Corasick Suffix Tree One fixed T, many varying P Suffix tree Two or more T’s Suffix tree, joint suffix tree Alphabet independent Alphabet dependent

Three ideas: Right-to-left comparison Bad character rule Good suffix rule

Right to left comparison Resume comparison here x y Skip some chars without missing any occurrence. y

Bad character rule 0 1 12345678901234567 T:xpbctbxabpqqaabpq
T:xpbctbxabpqqaabpq P: tpabxab *^^^^ What would you do now?

Bad character rule 0 1 12345678901234567 T:xpbctbxabpqqaabpq
T:xpbctbxabpqqaabpq P: tpabxab *^^^^ P: tpabxab

Bad character rule 0 1 123456789012345678 T:xpbctbxabpqqaabpqz
T:xpbctbxabpqqaabpqz P: tpabxab *^^^^ P: tpabxab * P: tpabxab

Basic bad character rule
tpabxab char Right-most-position in P a 6 b 7 p 2 t 1 x 5 Pre-processing: O(n)

k T: xpbctbxabpqqaabpqz P: tpabxab *^^^^ When rightmost T(k) in P is left to i, shift pattern P to align T(k) with the rightmost T(k) in P i = 3 Shift 3 – 1 = 2 P: tpabxab char Right-most-position in P a 6 b 7 p 2 t 1 x 5

k T: xpbctbxabpqqaabpqz P: tpabxab * When T(k) is not in P, shift left end of P to align with T(k+1) i = 7 Shift 7 – 0 = 7 P: tpabxab char Right-most-position in P a 6 b 7 p 2 t 1 x 5

k T: xpbctbxabpqqaabpqz P: tpabxab *^^ When rightmost T(k) in P is right to i, shift pattern P by 1 i = 5 5 – 6 < 0. so shift 1 P: tpabxab char Right-most-position in P a 6 b 7 p 2 t 1 x 5

Extended bad character rule
k T: xpbctbxabpqqaabpqz P: tpabxab *^^ Find T(k) in P that is immediately left to i, shift P to align T(k) with that position i = 5 5 – 3 = 2. so shift 2 P: tpabxab char Position in P a 6, 3 b 7, 4 p 2 t 1 x 5 Preprocessing still O(n)

Extended bad character rule
Best possible: m / n comparisons Works better for large alphabet size In some cases the extended bad character rule is sufficiently good Worst-case: O(mn) Expected time is sublinear

T:prstabstubabvqxrst P: qcabdabdab *^^
T:prstabstubabvqxrst P: qcabdabdab *^^ According to extended bad character rule

(weak) good suffix rule
T:prstabstubabvqxrst P: qcabdabdab *^^

(Weak) good suffix rule
T t Preprocessing: For any suffix t of P, find the rightmost copy of t, denoted by t’. How to find t’ efficiently? y P t’ t y P t’ t

(Strong) good suffix rule
T:prstabstubabvqxrst P: qcabdabdab *^^

T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab

In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, such that the char left to t ≠ the char left to t’ z y P t’ t z ≠ y z y P t’ t

Example preprocessing
qcabdabdab Bad char rule Good suffix rule char Positions in P a 9, 6, 3 b 10, 7, 4 c 2 d 8, 5 q 1 q c a b d a b d a b dabdabcabdab dabcab Where to shift depends on T Does not depend on T Largest shift given by either the (extended) bad char rule or the (strong) good suffix rule is used.

Time complexity of BM algorithm
Pre-processing can be done in linear time With strong good suffix rule, worst-case is O(m) if P is not in T If P is in T, worst-case could be O(mn) E.g. T = m100, P = m10 unless a modification was used (Galil’s rule) Proofs are technical. Skip.

How to actually do pre-processing?
Similar pre-processing for KMP and B-M Find matches between a suffix and a prefix Both can be done in linear time P is usually short, even a more expensive pre-processing may result in a gain overall KMP x y P t’ t j i For each i, find a j. similar to DP. Start from i = 2 B-M x y P t’ t j i

Fundamental pre-processing
x y P t’ t 1 zi i i+zi-1 Zi: length of longest substring starting at i that matches a prefix of P i.e. t = t’, x ≠ y, Zi = |t| With the Z-values computed, we can get the preprocessing for both KMP and B-M in linear time. aabcaabxaaz Z = How to compute Z-values in linear time?

Computing Z in Linear time
We already computed all Z-values up to k-1. need to compute Zk. We also know the starting and ending points of the previous match, l and r. t’ x t y P l k r t’ x t y P We know that t = t’, therefore the Z-value at k-l+1 may be helpful to us. 1 l k r k-l+1

Computing Z in Linear time
The previous r is smaller than k. i.e., no previous match extends beyond k. do explicit comparison. Case 1: P k Case 2: x y P Zk-l+1 <= r-k+1. Zk = Zk-l+1 No comparison is needed. 1 l k r k-l+1 Case 3: Zk-l+1 > r-k+1. Zk = Zk-l+1 Comparison start from r P 1 l k r k-l+1 No char inside the box is compared twice. At most one mismatch per iteration. Therefore, O(n).

Z-preprocessing for B-M and KMP
x y t’ t Z 1 zi i j j = i+zi-1 KMP x y t’ t For each j sp’(j+zj-1) = z(j) j i B-M x y t’ t Use Z backwards j i Both KMP and B-M preprocessing can be done in O(n)

Lecture 9-10 Exact String Matching Algorithms

Similar presentations

Presentation on theme: "Lecture 9-10 Exact String Matching Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 9-10 Exact String Matching Algorithms

Similar presentations

Presentation on theme: "Lecture 9-10 Exact String Matching Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback