Exact String Matching Algorithms

Slides:



Advertisements
Similar presentations
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advertisements

Space-for-Time Tradeoffs
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
CS 3343: Analysis of Algorithms Lecture 26: String Matching Algorithms.
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.
CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
KMP String Matching Prepared By: Carlens Faustin.
CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
MCS 101: Algorithms Instructor Neelima Gupta
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
MCS 101: Algorithms Instructor Neelima Gupta
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
Fundamental Data Structures and Algorithms
CS 5263 & CS 4593 Bioinformatics Exact String Matching Algorithms.
CS5263 Bioinformatics Lecture 11 Motif finding. HW2 2(C) Click to find out K and lambda.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
CS 5263 Bioinformatics Exact String Matching Algorithms.
Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
15-853:Algorithms in the Real World
HUFFMAN CODES.
CSC 421: Algorithm Design & Analysis
Advanced Algorithms Analysis and Design
13 Text Processing Hongfei Yan June 1, 2016.
Strings: Tries, Suffix Trees
CSCE350 Algorithms and Data Structure
CS 3343: Analysis of Algorithms
Fast Fourier Transform
Space-for-time tradeoffs
Data Structures Review Session
Chapter 7 Space and Time Tradeoffs
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Suffix trees.
CS 581 Tandy Warnow.
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Space-for-time tradeoffs
(Regulatory-) Motif Finding
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Suffix Trees String … any sequence of characters.
CS 6293 Advanced Topics: Translational Bioinformatics
Space-for-time tradeoffs
Lecture 9-10 Exact String Matching Algorithms
Chap 3 String Matching 3 -.
Strings: Tries, Suffix Trees
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Space-for-time tradeoffs
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Space-for-time tradeoffs
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Exact String Matching Algorithms CS 5263 & CS 4233 Bioinformatics Exact String Matching Algorithms

Overview Sequence alignment: two sub-problems: How to score an alignment with errors How to find an alignment with the best score Today: exact string matching Does not allow any errors Efficiency becomes the sole consideration Time and space

Why exact string matching? The most fundamental string comparison problem Often the core of more complex string comparison algorithms E.g., BLAST Often repeatedly called by other methods Usually the most time consuming part Small improvement could improve overall efficiency considerably

Definitions Text: a longer string T (length m) Pattern: a shorter string P (length n) Exact matching: find all occurrences of P in T T length m P length n

The naïve algorithm

Time complexity Worst case: O(mn) How to speedup? Pre-processing T or P Why pre-processing can save us time? Uncovers the structure of T or P Determines when we can skip ahead without missing anything Determines when we can infer the result of character comparisons without doing them.

Cost for exact string matching Total cost = cost (preprocessing) + cost(comparison) + cost(output) Overhead Minimize Constant Hope: gain > overhead

String matching scenarios One T and one P Search a word in a document One T and many P all at once Search a set of words in a document Spell checking (fixed P) One fixed T, many P Search a completed genome for short sequences Two (or many) T’s for common patterns Q: Which one to pre-process? A: Always pre-process the shorter seq, or the one that is repeatedly used

Pre-processing algs Pattern preprocessing Text preprocessing Knuth-Morris-Pratt algorithm (KMP) Aho-Corasick algorithm Multiple patterns Boyer – Moore algorithm (discuss if have time) The choice of most cases Typically sub-linear time Text preprocessing Suffix tree Very useful for many purposes Suffix array Burrows-Wheeler Transformation

Algorithm KMP: Intuitive example 1 abcxabc mismatch P abcxabcde Naïve approach: T abcxabc ? abcxabcde abcxabcde abcxabcde abcxabcde Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened when comparing P[8] with T[i], we can shift P by four chars, and compare P[4] with T[i], without missing any possible matches. Number of comparisons saved: 6

Intuitive example 2 Should not be a c T abcxabc mismatch P abcxabcde Naïve approach: T abcxabc ? abcxabcde ? abcxabcde abcxabcde abcxabcde abcxabcde abcxabcde Observation: by reasoning on the pattern alone, we can determine that if a mismatch happened between P[7] and T[j], we can shift P by six chars and compare T[j] with P[1] without missing any possible matches Number of comparisons saved: 7

KMP algorithm: pre-processing Key: the reasoning is done without even knowing what string T is. Only the location of mismatch in P must be known. x T t z y P t’ t j i z y P t’ t j i Pre-processing: for any position i in P, find P[1..i]’s longest proper suffix, t = P[j..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’ (i.e., y ≠ z) For each i, let sp(i) = length(t)

KMP algorithm: shift rule x T t z y P t’ t j i z y P t’ t 1 sp(i) j i Shift rule: when a mismatch occurred between P[i+1] and T[k], shift P to the right by i – sp(i) chars and compare x with z. This shift rule can be implicitly represented by creating a failure link between y and z. Meaning: when a mismatch occurred between x on T and P[i+1], resume comparison between x and P[sp(i)+1].

Failure Link Example P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 (= 2 + 1) a a t a a c sp(i) 0 1 0 0 2 0 aa at aat aac

Another example P: abababc If a char in T fails to match at pos 7, re-compare it with the char at pos 5 (= 4 + 1) a b a b a b c Sp(i) 0 0 0 0 0 4 0 ab ab abab abab ababa ababc

KMP Example using Failure Link t a a c T: aacaataaaaataaccttacta aataac ^^* Time complexity analysis: Each char in T may be compared up to n times. A lousy analysis gives O(mn) time. More careful analysis: number of comparisons can be broken to two phases: Comparison phase: the first time a char in T is compared to P. Total is exactly m. Shift phase. First comparisons made after a shift. Total is at most m. Time complexity: O(2m) aataac .* aataac ^^^^^* Implicit comparison aataac ..* aataac .^^^^^

KMP algorithm using DFA (Deterministic Finite Automata) P: aataac If a char in T fails to match at pos 6, re-compare it with the char at pos 3 Failure link a a t a a c If the next char in T is t after matching 5 chars, go to state 3 a t a a t a a c 1 2 3 4 5 DFA 6 a a All other inputs goes to state 0.

DFA Example T: aacaataataataaccttacta 1201234534534560001001 1 2 3 4 5 DFA 6 a a T: aacaataataataaccttacta 1201234534534560001001 Each char in T will be examined exactly once. Therefore, exactly m comparisons are made. But it takes longer to do pre-processing, and needs more space to store the FSA.

Difference between Failure Link and DFA Preprocessing time and space are O(n), regardless of alphabet size Comparison time is at most 2m (at least m) DFA Preprocessing time and space are O(n ||) May be a problem for very large alphabet size For example, each “char” is a big integer Chinese characters Comparison time is always m.

The set matching problem Find all occurrences of a set of patterns in T First idea: run KMP or BM for each P O(km + n) k: number of patterns m: length of text n: total length of patterns Better idea: combine all patterns together and search in one run

A simpler problem: spell-checking A dictionary contains five words: potato poetry pottery science school Given a document, check if any word is (not) in the dictionary Words in document are separated by special chars. Relatively easy.

Keyword tree for spell checking This version of the potato gun was inspired by the Weird Science team out of Illinois p s o c h o o l e 5 t i e t a t r n t e y c o r e y 3 1 4 2 O(n) time to construct. n: total length of patterns. Search time: O(m). m: length of text Common prefix only need to be compared once. What if there is no space between words?

Aho-Corasick algorithm Basis of the fgrep algorithm Generalizing KMP Using failure links Example: given the following 4 patterns: potato tattoo theater other

Keyword tree p t t o h e h t a r e t a a 4 t t t o e o r 1 o 2 3

Keyword tree potherotathxythopotattooattoo p t t o h e h t a r e t a a p t t o h e h t a r e a t a 4 t t t o e o r 1 o 2 3 potherotathxythopotattooattoo

Keyword tree potherotathxythopotattooattoo O(mn) p t t o h e h t a r e p t t o h e h t a r e t a a 4 t t t o e o r 1 o 2 3 potherotathxythopotattooattoo O(mn) m: length of text. n: length of longest pattern

Keyword Tree with a failure link p t t o h e h t a r e t a a 4 t t t o e o r 1 o 2 3 potherotathxythopotattooattoo

Keyword Tree with a failure link p t t o h e h t a r e t a a 4 t t t o e o r 1 o 2 3 potherotathxythopotattooattoo

Keyword Tree with all failure links p t t o h e h t a r e t a a 4 t t t o e o r 1 o 3 2

Example potherotathxythopotattooattoo p t t o h e h t a r e t a a 4 t p t t o h e h t a r e t a a 4 t t t o e o r 1 o 3 2 potherotathxythopotattooattoo

Example potherotathxythopotattooattoo p t t o h e h t a r e t a a 4 t p t t o h e h t a r e t a a 4 t t t o e o r 1 o 3 2 potherotathxythopotattooattoo

Example potherotathxythopotattooattoo p t t o h e h t a r e t a a 4 t p t t o h e h t a r e t a a 4 t t t o e o r 1 o 3 2 potherotathxythopotattooattoo

Example potherotathxythopotattooattoo p t t o h e h t a r e t a a 4 t p t t o h e h t a r e t a a 4 t t t o e o r 1 o 3 2 potherotathxythopotattooattoo

Example potherotathxythopotattooattoo p t t o h e h t a r e t a a 4 t p t t o h e h t a r e t a a 4 t t t o e o r 1 o 3 2 potherotathxythopotattooattoo

Aho-Corasick algorithm O(n) preprocessing, and O(m+k) searching. n: total length of patterns. m: length of text k is # of occurrence.

Suffix Tree All algorithms we talked about so far preprocess pattern(s) Boyer-Moore: fastest in practice. O(m) worst case. KMP: O(m) Aho-Corasick: O(m) In some cases we may prefer to pre-process T Fixed T, varying P Suffix tree: basically a keyword tree of all suffixes

Suffix tree T: xabxac Suffixes: xabxac abxac bxac xac ac c 1 b b x x c 6 4 a a c c 5 2 3 Naïve construction: O(m2) using Aho-Corasick. Smarter: O(m). Very technical. big constant factor Difference from a keyword tree: create an internal node only when there is a branch

Suffix tree implementation Explicitly labeling sequence end T: xabxa$ x a x b a a x a b a x a $ 1 1 b b $ b b x x $ x x a 4 a a a $ 5 $ 2 3 2 3 One-to-one correspondence of leaves and suffixes |T| leaves, hence < |T| internal nodes

Suffix tree implementation Implicitly labeling edges T: xabxa$ 1:2 x a b 3:$ a x a 2:2 $ 1 1 $ $ b b $ $ x x 3:$ 3:$ 4 4 a a $ 5 5 $ 2 2 3 3 |Tree(T)| = O(|T| + size(edge labels))

Suffix links Similar to failure link in a keyword tree Only link internal nodes having branches x a b P: xabcf a b c f c d d e e f f g g h h i i j j

ST Application 1: pattern matching Find all occurrence of P=xa in T Find node v in the ST that matches to P Traverse the subtree rooted at v to get the locations x a b a x a c c c 1 b b x x c 6 4 a a c c 5 T: xabxac 2 3 O(m) to construct ST (large constant factor) O(n) to find v – linear to length of P instead of T! O(k) to get all leaves, k is the number of occurrence. Asymptotic time is the same as KMP. ST wins if T is fixed. KMP wins otherwise.

ST application 2: repeats finding Genome contains many repeated DNA sequences Repeat sequence length: Varies from 1 nucleotide to millions Genes may have multiple copies (50 to 10,000) Highly repetitive DNA in some non-coding regions 6 to 10bp x 100,000 to 1,000,000 times Problem: find all repeats that are at least k-residues long and appear at least p times in the genome

Repeats finding at least k-residues long and appear at least p times in the seq Phase 1: top-down, count label lengths (L) from root to each node Phase 2: bottom-up: count # of leaves descended from each internal node For each node with L >= k, and N >= p, print all leaves O(m) to traverse tree (L, N)

Maximal repeats finding Right-maximal repeat S[i+1..i+k] = S[j+1..j+k], but S[i+k+1] != S[j+k+1] Left-maximal repeat S[i+1..i+k] = S[j+1..j+k] But S[i] != S[j] Maximal repeat But S[i] != S[j], and S[i+k+1] != S[j+k+1] acatgacatt cat aca acat

Maximal repeats finding 1234567890 acatgacatt 5 t a c $ 10 a c t 5:e t a t 9 t 4 t 5:e 5:e 5:e t t 7 6 8 3 1 2 Find repeats with at least 3 bases and 2 occurrence right-maximal: cat Maximal: acat left-maximal: aca

Maximal repeats finding 1234567890 acatgacatt 5 t a c $ 10 a t 5:e c a t t 9 t 4 t 5:e 5:e 5:e t t 7 6 8 3 1 2 Left char = [] g c c a a How to find maximal repeat? A right-maximal repeats with different left chars

Joint Suffix Tree (JST) Build a ST for more than two strings Two strings S1 and S2 S* = S1 & S2 Build a suffix tree for S* in time O(|S1| + |S2|) The separator will only appear in the edge ending in a leaf (why?)

Joint suffix tree example S1 = abcd S2 = abca S* = abcd&abca$ & a b c d (2, 0) useless a d c & b b c d & a a c b c d $ d d & a a & a b 2,4 a a 1,4 c a 2,3 b 2,1 c 2,2 1,1 d Seq ID 1,3 Suffix ID 1,2

To Simplify & a b c d useless a d & b b c d & a c a a d c b c $ c b b c d d d d c & a $ a & d a d 1,4 b 2,4 a a a 1,4 c a 2,3 2,4 a a b 2,1 1,1 2,3 c 2,1 2,2 1,1 d 1,3 2,2 1,2 1,3 1,2 We don’t really need to do anything, since all edge labels were implicit. The right hand side is more convenient to look at

Application 1 of JST Longest common substring between two sequences Using smith-waterman Gap = mismatch = -infinity. Quadratic time Using JST Linear time For each internal node v, keep a bit vector B B[1] = 1 if a child of v is a suffix of S1 Bottom-up: find all internal nodes with B[1] = B[2] = 1 (green nodes) Report a green node with the longest label Can be extended to k sequences. Just use a bit vector of size k. a d b b c d c c $ d d 1,4 a a 2,4 a 1,1 2,3 2,1 1,3 2,2 1,2

Application 2 of JST Substring problem for sequence databases Given: A fixed database of sequences (e.g., individual genomes) Given: A short pattern (e.g., DNA signature) Q: Does this DNA signature belong to any individual in the database? i.e. the pattern is a substring of some sequences in the database Aho-Corasick doesn’t work This can also be used to design signatures for individuals Build a JST for the database seqs Match P to the JST Find seq IDs from descendents a d b c d c b c $ d 1,4 a d a 2,4 a 1,1 2,3 2,1 Seqs: abcd, abca P1: cd P2: bc 1,3 2,2 1,2

Application 3 of JST Given K strings, find all k-mers that appear in at least (or at most) d strings Exact motif finding problem L< k cardinal(B) >= 3 B = BitOR(1010, 0011) = 1011 L >= k cardinal(B) = 3 B = 0011 B = 1010 4,x 1,x 3,x 3,x

Combinatorial motif finding Idea 1: find all k-mers that appeared at least m times m may be chosen such that # occurrence is statistically significant Problem: most motifs have divergence. Each variation may only appear once. Idea 2: find all k-mers, considering IUPAC nucleic acid codes e.g. ASGTKTKAC, S = C/G, K = G/T Still inflexible Idea 3: find k-mers that approximately appeared at least m times i.e. allow some mismatches

Combinatorial motif finding Given a set of sequences S = {x1, …, xn} A motif W is a consensus string w1…wK Find motif W* with “best” match to x1, …, xn Definition of “best”: d(W, xi) = min hamming dist. between W and a word in xi d(W, S) = i d(W, xi) W* = argmin( d(W, S) )

Exhaustive searches 1. Pattern-driven algorithm: For W = AA…A to TT…T (4K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4K ) (where N = i |xi|) Guaranteed to find the optimal solution.

Exhaustive searches 2. Sample-driven algorithm: For W = a K-char word in some xi Find d( W, S ) Report W* = argmin( d( W, S ) ) OR Report a local improvement of W* Running time: O( K N2 )

Exhaustive searches Problem with sample-driven approach: If: Then, True motif does not occur in data, and True motif is “weak” Then, random strings may score better than any instance of true motif

Example E. coli. Promoter “TATA-Box” ~ 10bp upstream of transcription start TACGAT TAAAAT TATACT GATAAT TATGAT TATGTT Consensus: TATAAT Each instance differs at most 2 bases from the consensus None of the instances matches the consensus perfectly

Heuristic methods Cannot afford exhaustive search on all patterns Sample-driven approaches may miss real patterns However, a real pattern should not differ too much from its instances in S Start from the space of all words in S, extend to the space with real patterns

Extended sample-driven (ESD) approaches Hybrid between pattern-driven and sample-driven Assume each instance does not differ by more than α bases to the motif ( usually depends on k) motif instance  The real motif will reside in the -neighborhood of some words in S. Instead of searching all 4K patterns, we can search the -neighborhood of every word in S. α-neighborhood

Extended sample-driven (ESD) approaches Naïve: N Kα 3α NK # of patterns to test # of words in sequences

Better idea Using a joint suffix tree, find all patterns that: Have length K Appeared in at least m sequences with at most α mismatches Post-processing

WEEDER: algorithm sketch Current pattern P, |P| < K A list containing all eligible nodes: with at most α mismatches to P For each node, remember #mismatches accumulated (e  α ), and a bit vector (B) for seq occ, e.g. [011100010] Bit OR all B’s to get seq occurrence for P Suppose #occ >= m Pattern still valid Now add a letter A C G T T # mismatches (e, B) Seq occ

WEEDER: algorithm sketch Current pattern P A C G T T A Simple extension: no branches. No change to B e may increase by 1 or no change Drop node if e > α Branches: replace a node with its child nodes Drop if e > α B may change Re-do Bit OR using all B’s Try a different char if #occ < m Report P when |P| = K (e, B)

WEEDER: complexity Can get all patterns in time O(Nn(K choose α) 3α) ~ O(N nKα 3α). n: # sequences. Needed for Bit OR. Better than O(KN 4K) and O(N Kα 3α NK) since usually α << K Kα 3α may still be expensive for large K E.g. K = 20, α = 6

WEEDER: More tricks Eligible nodes: with at most α mismatches to P Current pattern P A C G T T A Eligible nodes: with at most α mismatches to P Eligible nodes: with at most min(L, α) mismatches to P L: current pattern length : error ratio Require that mismatches to be somewhat evenly distributed among positions Prune tree at length K

Suffix Tree Memory Footprint The space requirements of suffix trees can become prohibitive |Tree(T)| is about 20|T| in practice Suffix arrays provide one solution.

Suffix Arrays 5 2 3 4 1 Very space efficient (m integers) Pattern lookup is nearly O(n) in practice O(n + log2 m) worst case with 2m additional integers Independent of alphabet size! Easiest to describe (and construct) using suffix trees Other (slower) methods exist x a b a x a $ $ 1 b 1. xabxa 2. abxa 3. bxa 4. xa 5. a $ 5 2 3 4 1 b x x 4 a 5 a a$ abxa$ bxa$ xa$ xabxa$ $ $ 2 3

Suffix array construction Build suffix tree for T$ Perform “lexical” depth-first search of suffix tree output the suffix label of each leaf encountered Therefore suffix array can be constructed in O(m) time.

Suffix array pattern search If P is in T, then all the locations of P are consecutive suffixes in Pos. Do binary search in Pos to find P! Compare P with suffix Pos(m/2) If lexicographically less, P is in first half of T If lexicographically more, P is in second half of T Iterate!

Suffix array pattern search T: xabxa$ P: abx R M L M R x a b a x a $ b $ 1 5 2 3 4 1 $ b x x 4 a$ a abxa$ bxa$ xa$ xabxa$ 5 a $ $ 2 3

Suffix array binary search How long to compare P with suffix of T? O(n) worst case! Binary search on Pos takes O(n log m) time Worst case will be rare occur if many long prefixes of P appear in T In random or large alphabet strings expect to do less than log m comparisons O(n + log m) running time when combined with LCP table suffix tree = suffix array + LCP table

Summary One T, one P One T, many P One fixed T, many varying P Boyer-Moore is the choice KMP works but not the best One T, many P Aho-Corasick Suffix Tree (array) One fixed T, many varying P Suffix tree (array) Two or more T’s Suffix tree, joint suffix tree Alphabet independent Alphabet dependent