CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.

Slides:



Advertisements
Similar presentations
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advertisements

Space-for-Time Tradeoffs
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : k-difference.
Two implementation issues Alphabet size Generalizing to multiple strings.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
CS 3343: Analysis of Algorithms Lecture 26: String Matching Algorithms.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Goodrich, Tamassia String Processing1 Pattern Matching.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: Boyer-Moore Algorithm.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
CS 6293 Advanced Topics: Current Bioinformatics Lecture 5 Exact String Matching Algorithms.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
1 Boyer-Moore Charles Yan Exact Matching Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern )
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
KMP String Matching Prepared By: Carlens Faustin.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
MCS 101: Algorithms Instructor Neelima Gupta
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
Application: String Matching By Rong Ge COSC3100
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
MCS 101: Algorithms Instructor Neelima Gupta
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
Fundamental Data Structures and Algorithms
CS 5263 & CS 4593 Bioinformatics Exact String Matching Algorithms.
Applications of Suffix Trees Dr. Amar Mukherjee CAP 5937 – ST: Bioinformatics University of central Florida.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
CS 5263 Bioinformatics Exact String Matching Algorithms.
Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
13 Text Processing Hongfei Yan June 1, 2016.
CS 3343: Analysis of Algorithms
Exact String Matching Algorithms
Chapter 7 Space and Time Tradeoffs
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Suffix Trees String … any sequence of characters.
CS 6293 Advanced Topics: Translational Bioinformatics
Lecture 9-10 Exact String Matching Algorithms
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Space-for-time tradeoffs
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms

Boyer – Moore algorithm Three ideas: –Right-to-left comparison –Bad character rule –Good suffix rule

Boyer – Moore algorithm Right to left comparison x y y Skip some chars without missing any occurrence.

Extended bad character rule charPosition in P a6, 3 b7, 4 p2 t1 x5 T: xpbctbxabpqqaabpqz P: tpabxab *^^ P: tpabxab Find T(k) in P that is immediately left to i, shift P to align T(k) with that position k i = 55 – 3 = 2. so shift 2 Preprocessing O(n) Restart the comparison here.

(Strong) good suffix rule t x t y t’ t y In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, such that the char left to t ≠ the char left to t’ T P P z z z ≠ y t y t’ P z z t y P z z t x T

Example preprocessing qcabdabdab charPositions in P a9, 6, 3 b10, 7, 4 c2 d8, 5 q1 q c a b d a b d a b dab cab Bad char rule Good suffix rule dabdab cabdab Where to shift depends on T Does not depend on T

Tricky case Pattern: abcab a b c a b * ^ ^ T: x y a a b c a b shift = 4 – 1 = 3 a b c a b N N 0 N N c b c b i-L

Example preprocessing qcabdabdab charPositions in P a9, 6, 3 b10, 7, 4 c2 d8, 5 q1 q c a b d a b d a b dab cab Bad char rule Good suffix rule Where to shift depends on T Does not depend on T dabdab cabdab

Example preprocessing qcabdabdab charPositions in P a9, 6, 3 b10, 7, 4 c2 d8, 5 q1 q c a b d a b d a b N N N N 2 N N 2 N N dab cab Bad char rule Good suffix rule dabdab cabdab Where to shift depends on T Does not depend on T

Algorithm KMP: Basic idea t t’ P t x T y t P y z z In pre-processing: for any position i in P, find the longest suffix t, such that t = t’, and y ≠ z. For each i, let Sp’(i) = length(t) ij

Failure link P: aataac aataac Sp’(i) aaataaat aat aac If a char in T fails to match at pos 6, re-compare it with the char at pos 3

FSA P: aataac a ataac 6 a t All other input goes to state 0 Sp’(i) aaataaat aat aac If the next char in T is t, we go to state 3

Tricky case Pattern: abcab abcab a bbca c Failure link FSA dummy

How to actually do pre-processing? Similar pre-processing for KMP and B-M –Find matches between a suffix and a prefix –Both can be done in linear time –P is usually short, even a more expensive pre-processing may result in a gain overall t t’ P yx KMP t y t’ P x B-M i i j j For each i, find a j. similar to DP. Start from i = 2

Fundamental pre-processing Z i : length of longest substring starting at i that matches a prefix of P –i.e. t = t’, x ≠ y, Z i = |t| –With the Z-values computed, we can get the preprocessing for both KMP and B-M in linear time. aabcaabxaaz Z = How to compute Z-values in linear time? t t’ P i x y i+z i -1zizi 1

Computing Z in Linear time t t’ P l x y rk We already computed all Z- values up to k-1. need to compute Zk. We also know the starting and ending points of the previous match, l and r. t t’ P l x y rk We know that t = t’, therefore the Z-value at k-l+1 may be helpful to us. 1 k-l+1

Computing Z in Linear time No char inside the box is compared twice. At most one mismatch per iteration. Therefore, O(n). P k The previous r is smaller than k. i.e., no previous match extends beyond k. do explicit comparison. P l x y rk Z k-l+1 <= r-k+1. Z k = Z k-l+1 No comparison is needed. 1 k-l+1 Case 1: Case 2: P l rk Z k-l+1 > r-k+1. Z k = Z k-l+1 Comparison start from r 1 k-l+1 Case 3:

Z-preprocessing for B-M and KMP Both KMP and B-M preprocessing can be done in O(n) t t’ i x y j = i+z i -1 zizi 1 t t’ yx KMP t y t’ x B-M i j Z j i j For each j sp’(j+z j -1) = z(j) Use Z backwards

Keyword tree for spell checking O(n) time to construct. n: total length of patterns. Search time: O(m). m: length of word Common prefix only need to be compared once. p o t a t o e t r y t e r y s c i e n c e hoo l

Aho-Corasick algorithm Generalizing KMP Create failure links Basis of the fgrep algorithm Given the following patterns: –potato –tattoo –theater –other

Failure link p o t a t o t e r 0 t h e r a t t o o h a t e potterisapersonwhomakespottery

Failure link p o t a t o t e r 0 t h e r a t t o o h a t e O(n) preprocessing, and O(m+k) searching. k is # of occurrence. Can create a FSA similarly. Requires more space, and preprocessing time depends on alphabet size.

A problem with failure link Patterns: {potato, other, pot} p o t a t o 0 t h e r 1 2 3

A problem with failure link for multiple patterns Patterns: {potato, other, pot, the, he, era} p o t a t o 0 t h e r 1 2 t h e 3 4 potherarac he 5e r a

Output link Patterns: {potato, other, pot, the} p o t a t o 0 t h e r 1 2 t h e 3 4 potherarac he Failure link: taken when a mismatch occurs. Output link: always taken. (but will return). 5 e r a

Suffix Tree All algorithms we talked about so far preprocess pattern(s) –Karp-Rabin: small pattern, small alphabet –Boyer-Moore: fastest in practice. O(m) worst case. –KMP: O(m) –Aho-Corasick: O(m) In some cases we may prefer to pre-process T –Fixed T, varying P Suffix tree: basically a keyword tree of all suffixes

Suffix tree T: xabxac Suffixes: 1.xabxac 2.abxac 3.bxac 4.xac 5.ac 6.c a b x a c b x a c c c x a b x a c c Naïve construction: O(m 2 ) using Aho-Corasick. Smarter: O(m). Very technical. big constant factor Create an internal node only when there is a branch

Suffix tree implementation Explicitly labeling seq end T: xabxa T: xabxa$ a b x a b x a x a b x a a b x a b x a x a b x a $ $ $ $ $ 4 5

Suffix tree implementation Implicitly labeling edges T: xabxa$ a b x a b x a x a b x a $ $ $ $ $ 4 5 2:2 3:$ $ $ 4 5 1:2 3:$

Suffix links Similar to failure link in a keyword tree Only link internal nodes having branches x a b c d e f g h i j a b c d e f g h i j xabcf f

Suffix tree construction 1:$ acatgacatt...

Suffix tree construction 2:$ 2 1:$ acatgacatt...

Suffix tree construction 2:$ a 4:$ 2 3 2:$ acatgacatt...

Suffix tree construction 2:$ 2 4:$ 4 a 3 2:$ acatgacatt...

Suffix tree construction 2:$ 2 4:$ 4 5:$ 5 a 4:$ 3 2:$ acatgacatt...

Suffix tree construction 2:$ 2 4:$ 4 5:$ c a t t 5 6 a 4:$ 3 5:$ 1 $ acatgacatt...

Suffix tree construction With this suffix link, when we later need to add another suffix, say acaty, we can use the link to avoid going back to the root and re-compare “cat” 5:$ 2 4:$ 4 5:$ 5 c a t t 7 c a t t 6 a 4:$ 3 5:$ 1 $ acatgacatt...

Suffix tree construction 5:$ 2 4:$ 4 5:$ 5 c a t t 7 c a t t 6 a 3 1 t 8 t $ acatgacatt...

Suffix tree construction 5:$ c a t t 7 c a t t 6 a 3 1 t 8 t t t 9 $ acatgacatt...

Suffix tree construction 5:$ c a t t 7 c a t t 6 a 3 1 t 8 t t t 9 10 $ $ acatgacatt...

ST Application: pattern matching Find all occurrence of P=xa in T –Find node v in the ST that matches to P –Traverse the subtree rooted at v to get the locations a b x a c b x a c c c x a b x a c c T: xabxac O(m) to construct ST (large constant factor) O(n) to find v – linear to length of P instead of T! O(k) to get all leaves, k is the number of occurrence.

ST application: repeats finding Genome contains many repeated DNA sequences Repeat sequence length: Varies from 1 nucleotide to whole gene –Highly repetitive DNA in some non-coding regions 6 to 10bp x 100,000 to 1,000,000 times –Genes may have multiple copies (50 to 10,000)

Find longest repeated substring Do a tree traversal, compute the lengths of labels at each node O(m) L = 4 2:5 6:10 15:18 L = 9 L = 8

Repeats finding Find all repeats that are at least k-residue long and appear at least p times in the seq –Phase 1: top-down, count lengths of labels at each node –Phase 2: bottom-up: count # of leaves descended from each internal node (L, N) For each node with L >= k, and N >= p, print all leaves O(m) to traverse tree

Repeats finding Find repeats with at least 3 bases and 2 occurrence –cat –acat –aca 5:e acatgacatt 5:e 5 c a t t 7 c a t t 6 a 3 1 t 8 t t t 9 10 $

Repeats finding 1.Left-maximal repeat –S[i+1..i+k] = S[j+1..j+k] –S[i] != S[j] 2.Right-maximal repeat –S[i+1..i+k] = S[j+1..j+k], –S[i+k+1] != S[j+k+1] 3.Maximal repeat –S[i+1..i+k] = S[j+1..j+k] –S[i] != S[j], and S[i+k+1] != S[j+k+1] acatgacatt 1.aca 2.cat 3.acat

Repeats finding How to find maximal repeat? –A right-maximal repeats with different left chars 5:e acatgacatt 5:e 5 c a t t 7 c a t t 6 a 3 1 t 8 t t t 9 10 $ Left char = [] gcc aa

ST application: word enumeration Find all k-mers that occur at least p times –Compute (L, N) for each node –Find nodes v with L>=k, and L(parent) =y –Traverse sub-tree rooted at v to get the locations L<k L>=k, N>=p L = K L=k This can be used in many applications. For example, to find words that appeared frequently in a genome or a document

Joint Suffix Tree Build a ST for many than two strings Two strings S 1 and S 2 S* = S 1 & S 2 Build a suffix tree for S* in time O(|S 1 | + |S 2 |) The separator will only appear in the edge ending in a leaf

S1 = abcd S2 = abca S* = abcd&abca$ a b c d & a b c a bcd&abcabcd&abca c d & a b c d d & a b c d & a b c d a a a $ 1,1 2,1 1,2 1,3 1,4 2,2 2,3 2,4

To Simplify We don’t really need to do anything, since all edge labels were implicit. The right hand side is more convenient to look at a b c d & a b c a bcd&abcabcd&abca c d & a b c d d & a b c d & a b c d a a a $ 1,1 2,1 1,2 1,3 1,4 2,2 2,3 2,4 useless a b c d bcdbcd c d d a a a $ 1,1 2,1 1,2 1,3 1,4 2,2 2,3 2,4

Application of JST Longest common substring –For each internal node v, keep a bit vector B[2] –B[1] = 1 if a child of v is a suffix of S1 –Find all internal nodes with B[1] = B[2] = 1 –Report one with the longest label –Can be extended to k sequences. Just use a longer bit vector. a b c d bcdbcd c d d a a a $ 1,1 2,1 1,2 1,3 1,4 2,2 2,3 2,4 O(m), m the total seq length

Application of JST Given K strings, find all substrings with L>=l, that appear in at least d strings Exact motif finding problem Build a joint suffix tree with all strings S* = S 1 & S 2 % S 3 * S S 5 ! S 6 + S 7 –Use a unique end char for each string –Not really necessary if caution is taken in construction

L< k L >= k B = 1010 | 0011 = 1011 |B| = 3 1,x 3,x 4,x B = 0011 O(mK), m the total seq length. K is for “bitwise or” two bit vectors 3,x B = 1010

Many other applications Reproduce the behavior of Aho-Corasick DNA finger printing –A database of people’s DNA sequence –Given a short DNA, which person is it from? Recognizing DNA contamination Indexing sequence databases … Catch –Large constant factor for space requirement (15-40 bytes per base for DNA) –Large constant factor for construction –Suffix array: trade off time for space

Summary One T, one P –Boyer-Moore is the choice –KMP works but not the best One T, many P –Aho-Corasick –Suffix Tree One fixed T, many varying P –Suffix tree Two or more T’s –Suffix tree, joint suffix tree, suffix array Alphabet independent Alphabet dependent