Advanced Data Structure: Bioinformatics 24/02/15 24/02/15 First week: Algorithms for exact string matching. Second week: Approximate search and alignment of short sequences. Third week: Dealing with long sequences. 1
Advanced Data Structure:bibliography 24/02/15 24/02/15 Bioinformatics, Sequence and Genome Analysis David W. Mount Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot http://www-igm.univ-mlv.fr/~lecroq/string/index.html http://www.ncbi.nlm.nih.gov/ 2
First week First week: algorithms for exact string matching: 24/02/15 24/02/15 First week: algorithms for exact string matching: One pattern: The algorithm depends on |p| and | k patterns: The algorithm depends on k, |p| and || Second week: approximate search and alignment of short sequences. Third week: dealing with long sequences. 3
Exact string matching for one pattern 24/02/15 24/02/15 How does the string algorithms made the search? For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA. and for the pattern TACTACGGTATGACTAA As you have seen this morning .... 4
Exact string matching: Brute force algorithm 24/02/15 24/02/15 Example: Given the pattern ATGTA, the search is G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A As you have seen this morning .... 5
The window is shifted only one cell Exact string matching: Brute force algorithm 24/02/15 24/02/15 Which is the next position of the window? How the comparison is made? Text : Pattern : From left to right: prefix Text : Pattern : As you have seen this morning .... The window is shifted only one cell 6
Exact string matching: one pattern 24/02/15 24/02/15 How does the matching algorithms made the search? There is a sliding window along the text against which the pattern is compared: Pattern : Text : At each step the comparison is made and the window is shifted to the right. As you have seen this morning .... Which are the facts that differentiate the algorithms? How the comparison is made. The length of the shift. 7
BNDM : Backward Nondeterministic Dawg Matching Exact string matching for one pattern 24/02/15 24/02/15 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w 2 4 8 16 32 64 128 256e 8
Horspool algorithm How the comparison is made? 24/02/15 24/02/15 Which is the next position of the window? How the comparison is made? Text : Pattern : Sufix search Pattern : Text : a As you have seen this morning .... Shift until the next ocurrence of “a” in the pattern: a We need a preprocessing phase to construct the shift table. 9
Horspool algorithm : example 24/02/15 24/02/15 Given the pattern ATGTA The shift table is: A C G T As you have seen this morning .... 10
Horspool algorithm : example 24/02/15 24/02/15 Given the pattern ATGTA The shift table is: A 4 C G T As you have seen this morning .... 11
Horspool algorithm : example 24/02/15 24/02/15 Given the pattern ATGTA The shift table is: A 4 C 5 G T As you have seen this morning .... 12
Horspool algorithm : example 24/02/15 24/02/15 Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T As you have seen this morning .... 13
Horspool algorithm : example 24/02/15 24/02/15 Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T 1 As you have seen this morning .... 14
Horspool algorithm : example 24/02/15 24/02/15 Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T 1 The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A As you have seen this morning .... 15
Horspool algorithm: example 24/02/15 24/02/15 Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T 1 The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A As you have seen this morning .... A T G T A 16
Some questions about Horspool algorithm 24/02/15 24/02/15 Given the pattern ATGTA, the shift table is A 4 C 5 G 2 T 1 Given a random text over an equally likely probability distribution (EPD): 1.- Determine the expected shift of the window. And, if the PD is not equally likely? 2.- Determine the expected number of shifts assuming a text of length n. As you have seen this morning .... 3.- Determine the expected number of comparisons in the suffix search phase 17
BNDM : Backward Nondeterministic Dawg Matching Exact string matching for one pattern 24/02/15 24/02/15 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w 2 4 8 16 32 64 128 256 18
BNDM algorithm How the comparison is made? 24/02/15 24/02/15 Which is the next position of the window ? How the comparison is made? Text : Pattern : Search for suffixes of T that are factors of Once the next character x is read D3 = D2<<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0) D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 ) x That is denoted as D2 = 1 0 0 0 1 0 0 Depends on the value of the leftmost bit of D As you have seen this morning .... 19
BNDM algorithm: example 24/02/15 24/02/15 Given the pattern ATGTA The mask of characters is: B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 ) D1 = ( 0 0 1 0 0 ) D2 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 ) As you have seen this morning .... D1 = ( 1 0 0 0 1 ) D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 ) 20
BNDM algorithm: example of window shift 24/02/15 24/02/15 Given the pattern ATGTA The mask of characters is : The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) D1 = ( 1 0 0 0 1 ) A T G T A D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 ) D5 = ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 ) As you have seen this morning .... D6 = ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 ) Found 21
BNDM algorithm: example 24/02/15 24/02/15 Given the pattern ATGTA The mask of characters is : B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) How the shif is determined? The searching phase: G T A C T A G A A T A C G T A T G T A C T G ... A T G T A A T G T A A T G T A D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 ) As you have seen this morning .... D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 ) D3 = ( 0 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 ) 22
Extended string matching 24/02/15 24/02/15 Classes of characters: when in some DNA files or patterns there are new characters as N or R that means N={A,C,G,T} and R={G,A}. Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3) means any 2 or 3 characters. Optional characters: we find pattern as AC?ACT?T?A where C? means that C may or may not appear in the text. Wild cards: we find pattern as AT*TA where * means an arbitrary long string. Repeatable characters: we find pattern as AT[TA]*AT where [TA]* means that TA can appear zero or more times.. As you have seen this morning .... 23
BNDM : Backward Nondeterministic Dawg Matching Exact string matching for one pattern 24/02/15 Algorismes més eficients (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w 2 4 8 16 32 64 128 256
Autòmata Factor Oracle: propietats 24/02/15 24/02/15 Factor Oracle of word G T A T G T A G A T All states are accepting states. Recognizes all factors … but more, which? As you have seen this morning .... If a word is rejected, it isn't a factor, then 25
BOM algorithm (Backward Oracle Matching) 24/02/15 24/02/15 How many cells are shifted? How the comparison is made? Text : Pattern : Automata: Factor Oracle Checks from right to left a If the a isn't into the automaton As you have seen this morning .... If we reach the last stat of the automaton with the a a 26
BOM algorithm: example 24/02/15 24/02/15 How the comparison is made? The automaton of the inverse patterns is built: given the pattern ATGTATG G A T And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G As you have seen this morning .... 27
BOM algorithm: example 24/02/15 24/02/15 How the comparison is made? The automaton of the inverse patterns is built: given the pattern ATGTATG G A T And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G A T G T A T G As you have seen this morning .... 28
BOM algorithm: example 24/02/15 24/02/15 How the comparison is made? The automaton of the inverse patterns is built: given the pattern ATGTATG G A T And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G A T G T A T G A T G T A T G As you have seen this morning .... 29
BOM algorithm: example 24/02/15 24/02/15 How the comparison is made? The automaton of the inverse patterns is built: given the pattern ATGTATG G A T And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G A T G T A T G A T G T A T G A T G T A T G As you have seen this morning .... 30
BOM algorithm: example 24/02/15 24/02/15 How the comparison is made? The automaton of the inverse patterns is built: given the pattern ATGTATG G A T And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G As you have seen this morning .... 31
BOM algorithm: example 24/02/15 24/02/15 How the comparison is made? The automaton of the inverse patterns is built: given the pattern ATGTATG G A T And the search is : G T A C T A G A A T G T G T A G A C A T G T A T G G G A... A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G As you have seen this morning .... 32
Automata Factor Oracle 24/02/15 24/02/15 Given the pattern GTATA, in which state the factors are accepted? G A T GT GTA TA When the new T is read, 4 factors should be accepted GTAT TAT AT T, how it can be reached? GTAT TAT AT T G A GT GTA TA When the new A is read, 5 factors should be accepted GTATA TATA ATA TA A, how it can be reached? As you have seen this morning .... 33
Automata Factor Oracle 24/02/15 24/02/15 GTATA TATA ATA TA A GTAT TAT AT T G GT GTA When the new G is read, 6 factors should be accepted GTATAG TATAG ATAG TAG AG G G GTATAG TATAG ATAG TAG AG G As you have seen this morning .... 34
? Automaton Factor Oracle: linear algorithm 24/02/15 24/02/15 As you have seen this morning .... 35
Autòmata Factor Oracle: algorisme 24/02/15 24/02/15 If there is a T transition ... T As you have seen this morning .... 36
Autòmata Factor Oracle: algorisme 24/02/15 24/02/15 But if there isn't a T transition ... T T As you have seen this morning .... … and recursively continue ... 37