Recuperació de la informació

Slides:



Advertisements
Similar presentations
Parallel BioInformatics Sathish Vadhiyar. Parallel Bioinformatics  Many large scale applications in bioinformatics – sequence search, alignment, construction.
Advertisements

Space-for-Time Tradeoffs
MSc Bioinformatics for H15: Algorithms on strings and sequences
Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle.
1 String Matching of Bit Parallel Suffix Automata.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
1 Advisor: Prof. R. C. T. Lee Speaker: G. W. Cheng Two exact string matching algorithms using suffix to prefix rule.
The chromosomes contains the set of instructions for alive beings
1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Knuth-Morris-Pratt Algorithm left to right scan like the naïve algorithm one main improvement –on a mismatch, calculate maximum possible shift to the right.
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Smith Algorithm Experiments with a very fast substring search algorithm, SMITH P.D., Software - Practice & Experience 21(10), 1991, pp Adviser:
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Quick Search Algorithm A very fast substring search algorithm, SUNDAY D.M., Communications of the ACM. 33(8),1990, pp Adviser: R. C. T. Lee Speaker:
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Backward Nondeterministic DAWG Matching Algorithm
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Indexing and Searching
Modern Information Retrieval Chapter 4 Query Languages.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
String Matching Chapter 32 Highlights Charles Tappert Seidenberg School of CSIS, Pace University.
Chapter 7 Space and Time Tradeoffs James Gain & Sonia Berman
1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,
MCS 101: Algorithms Instructor Neelima Gupta
Exact String Matching Algorithms: A Survey Mehreen Ali, Hina Naz Khan, Shumaila Sayyab, Nadeem Iftikhar Department of Bio-Science Mohammad Ali Jinnah University,
Application: String Matching By Rong Ge COSC3100
String Matching of Regular Expression
MCS 101: Algorithms Instructor Neelima Gupta
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.
 Author: Ricardo A. Baeza-Yates, Gaston H. Gonnet  Publisher: 1992 Communications of the ACM  Presenter: Yuen-Shuo Li  Date: 2013/08/14 1.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Fundamental Data Structures and Algorithms
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
CSG523/ Desain dan Analisis Algoritma
Advanced Data Structure: Bioinformatics
Source : Practical fast searching in strings
Exact string matching: one pattern (text on-line)
String Matching (Chap. 32)
13 Text Processing Hongfei Yan June 1, 2016.
Indexing and Searching (File Structures)
Space-for-time tradeoffs
Adviser: R. C. T. Lee Speaker: C. W. Cheng National Chi Nan University
Chapter 7 Space and Time Tradeoffs
Accelerating Approximate Pattern Matching with Processing-In-Memory (PIM) and Single-Instruction Multiple-Data (SIMD) Programming Damla Senol Cali1, Zülal.
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
Space-for-time tradeoffs
Tècniques i Eines Bioinformàtiques
Space-for-time tradeoffs
Recuperació de la informació
String Matching 11/04/2019 String matching: definition of the problem (text,pattern) Exact matching: depends on what we have: text or patterns The patterns.
Chap 3 String Matching 3 -.
Tècniques i Eines Bioinformàtiques
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
String Matching Algorithm
Space-for-time tradeoffs
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Improved Two-Way Bit-parallel Search
MA/CSSE 473 Day 27 Student questions Leftovers from Boyer-Moore
Presentation transcript:

Recuperació de la informació 12/09/2018 Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot http://static.ppurl.com/chmview-V1JRYFF-BnMAZgFqD1NVOlZ0VzMMZgdqUDABMwI9BWc=/0001.html Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq http://www-igm.univ-mlv.fr/~lecroq/string/index.html

String Matching 12/09/2018 String matching: definition of the problem (text,pattern) Exact matching: depends on what we have: text or patterns The patterns ---> Data structures for the patterns 1 pattern ---> The algorithm depends on |p| and || k patterns ---> The algorithm depends on k, |p| and || Extensions Regular Expressions The text ----> Data structure for the text (suffix tree, ...) Approximate matching: Dynamic programming Sequence alignment (pairwise and multiple) Sequence assembly: hash algorithm Probabilistic search: Hidden Markov Models

Extended string matching 12/09/2018 Classes of characters: when in some DNA files or patterns there are new characters as N or R that means N={A,C,G,T} and R={G,A}. Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3) means any 2 or 3 characters. Optional characters: we find pattern as AC?ACT?T?A where C? means that C may or may not appear in the text. Wild cards: we find pattern as AT*TA where * means an arbitrary long string. Repeatable characters: we find pattern as AT[TA]*AT where [TA]* means that TA can appear zero or more times.. As you have seen this morning ....

Classes of characters 12/09/2018 There are classes of characters represented by one symbol. For instace the IUPAC code for the DNA alphabet is: R = {G,A} Y = {T,C} K = {G,T} M = {A,C} S = {G,C} W = {A,T} B = {G,T,C } D = {G,A,T} H = {A,C,T} V = {G,C,A} N = {A,G,C,T} (any) 1. Classes of characters in the tetx. There are characters in the text that represent sets of simbols As you have seen this morning .... 2. Classes of characters in the pattern. There are characters in the pattern that represent sets of simbols

Extended alphabets First part Classes in the text 12/09/2018 As you have seen this morning ....

Classes in the text: Brute force algorithm 12/09/2018 How the comparison is made? Text : over 2|∑| Pattern over  From left to right: prefix We need the operation: belongs to a set ? ? Which is the next position of the window? As you have seen this morning .... Text : Pattern : The window is shifted only one cell

Classes in the text: Brute force algorithm 12/09/2018 When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=( , , , ) As you have seen this morning ....

Classes in the text: Brute force algorithm 12/09/2018 When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=( , , , ) As you have seen this morning ....

Classes in the text: Brute force algorithm 12/09/2018 When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “A belongs to set X” is made with ... As you have seen this morning ....

Classes in the text: Brute force algorithm 12/09/2018 When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “A belongs to set X” is made with I(A) and I(X) >0 G T A R T R N A G G A ... I(A) & I(T)>0 A T G T A A T G T A As you have seen this morning ....

Classes in the text: Brute force algorithm 12/09/2018 When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “A belongs to set X” is made with I(A) and I(X) >0 G T A R T R N A G G A ... I(A) & I(T)>0 A T G T A A T G T A I(A) & I(R)>0 I(T) & I(T)>0 I(G) & I(R)>0 I(T) & I(A)>0 A T G T A As you have seen this morning ....

Classes in the text: Brute force algorithm 12/09/2018 When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “A belongs to set X” is made with I(A) and I(X) >0 G T A R T R N A G G A ... I(A) & I(T)>0 A T G T A A T G T A I(A) & I(R)>0 I(T) & I(T)>0 I(G) & I(R)>0 I(T) & I(A)>0 A T G T A As you have seen this morning .... I(A) & I(N)>0 I(T) & I(R)>0 ... Which is the cost?

BNDM : Backward Nondeterministic Dawg Matching Classes in the text 12/09/2018 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w 2 4 8 16 32 64 128 256

Classes in the text: Horspool algorithm 12/09/2018 Which is the next position of the window? How the comparison is made? Text : Pattern : Sufix search Pattern : Text : a As you have seen this morning .... Shift until the next ocurrence of “a” (or “t”,”r”,…) in the pattern: a We need a shift table with the extended alphabet.

Classes in the text :Horspool example 12/09/2018 Given the pattern ATGTA A 4 C 5 G 2 T 1 R ? … N ? The shift table is: As you have seen this morning ....

Classes in the text :Horspool example 12/09/2018 Given the pattern ATGTA A 4 C 5 G 2 T 1 R 2 … N ? The shift table is: As you have seen this morning ....

Classes in the text :Horspool example 12/09/2018 Given the pattern ATGTA A 4 C 5 G 2 T 1 R 2 … N 1 The shift table is: text : G T A R T R N A A G G A … A T G T A A T G T A A T G T A As you have seen this morning ....

Classes in the text :Horspool example 12/09/2018 Given the pattern ATGTA A 4 C 5 G 2 T 1 R 2 … N 1 The shift table is: text : G T A R T R N A A G G A ... A T G T A A T G T A A T G T A A T G T A As you have seen this morning .... …

BNDM : Backward Nondeterministic Dawg Matching Classes in the text 12/09/2018 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w 2 4 8 16 32 64 128 256

Classes in the text: BNDM algorithm 12/09/2018 Which is the next position of the window ? How the comparison is made? Text : Pattern : Search for suffixes of T that are factors of the pattern Once the next character x is read D3 = D2<<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0) D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 ) x …that is denoted as D2 = 1 0 0 0 1 0 0 Depends on the value of the leftmost bit of D As you have seen this morning ....

Classes in the text : BNDM example 12/09/2018 Given the pattern ATGTA B(A) = ( 1 0 0 0 1 ) B(R)=( ) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=( ) B(T) = ( 0 1 0 1 0 ) The masks of bits are As you have seen this morning ....

Classes in the text : BNDM example 12/09/2018 Given the pattern ATGTA B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=( ) B(T) = ( 0 1 0 1 0 ) The masks of bits are As you have seen this morning ....

Classes in the text : BNDM example 12/09/2018 Given the pattern ATGTA B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1) B(T) = ( 0 1 0 1 0 ) The masks of bits are text : G T A R T R N A G G A C G ... A T G T A A T G T A D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 ) D2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 ) As you have seen this morning ....

Classes in the text : BNDM example 12/09/2018 Given the pattern ATGTA B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1) B(T) = ( 0 1 0 1 0 ) The masks of bits are text : G T A R T R N A G G A C G ... A T G T A A T G T A D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 ) D2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 ) D1 = ( 1 0 0 0 1 ) As you have seen this morning .... D2 = ( 0 0 0 1 0 ) & ( 1 1 1 1 1 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 ) D5 = ( 1 0 0 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 0 0 0)

Classes in the text : BNDM example 12/09/2018 Given the pattern ATGTA B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1) B(T) = ( 0 1 0 1 0 ) The masks of bits are text : G T A R T R N A G G A C G ... A T G T A A T G T A A T G T A D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 ) D2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 ) D1 = ( 1 0 0 0 1 ) As you have seen this morning .... D2 = ( 0 0 0 1 0 ) & ( 1 1 1 1 1 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 ) D5 = ( 1 0 0 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 0 0 0) …

BNDM : Backward Nondeterministic Dawg Matching Classes in the text 12/09/2018 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w 2 4 8 16 32 64 128 256

BOM algorithm (Backward Oracle Matching) 12/09/2018 Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle Check if the suffix is a factor As you have seen this morning .... The position determined by the last character of the text with a transition in the automata

Classes in the text: BOM example 12/09/2018 The we build the AFO of the inverse pattern of ATGTATG G A T … and we try to find… : G T A R T R N A A T G… A T G T A T G As you have seen this morning .... It’s not possible any improvement!

Multiple string matching 12/09/2018 5 10 15 20 25 30 35 40 45 8 4 2 | | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC

Classes in the text: Set Horspool algorithm 12/09/2018 How the comparison is made? By suffixes Text : Patterns: Trie of all inverse patterns Which is the next position of the window? ? As you have seen this morning ....

Set Horspool algorithm 12/09/2018 Search for ATGTATG,TATG,ATAAT,ATGTG T A G 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA 2. Determine lmin=4 A 1 C 4 (lmin) G 2 T 1 3. Determine the shift table As you have seen this morning .... 4. Find the patterns

Classes in the text: Set Horspool 12/09/2018 Search for the patterns ATGTATG,TATG,ATAAT,ATGTG T A G text: ARTGNCTATGTGACA… As you have seen this morning .... It’s not possible any improvement!

Multiple string matching 12/09/2018 5 10 15 20 25 30 35 40 45 8 4 2 | | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC

Classes in the text: SBOM algorithm 12/09/2018 Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle (Inverse patterns of length lmin) Check if the suffix is a factor of any pattern As you have seen this morning .... The position determined by the last character of the text with a transition in the automata

Classes in the text: SBOM example 12/09/2018 Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T G T A 1 4 G A T A A T A 2 3 text: ACATN C TAGC TA TA ATAATGTATG As you have seen this morning .... It’s not possible any improvement!

Extended alphabets Classes in the: text pattern Horspool ✓ BNDM ✓ 12/09/2018 Classes in the: text pattern Horspool ✓ BNDM ✓ BOM ✗ Set-Horspool ✗ SBOM ✗ As you have seen this morning ....

Extended search Second part Classes in the pattern 12/09/2018 As you have seen this morning ....

Classes in the pattern: Brute force algorithm 12/09/2018 How the comparison is made? Text : over  Pattern : over 2|∑| From left to right: prefix We need the operation: belongs to a set ? ? Which is the next position of the window? As you have seen this morning .... Text : Pattern : The window is shifted only one cell

Classes in the pattern: Brute force algorithm 12/09/2018 When || < computer word Every subset is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=(1,0,1,0,),..., I(N)=(1,1,1,1) Then the operation “A belongs to set X” is made with I(A) and I(X) >0 G T A C T A G A G G A C G T A T G T A C T G ... I(T) and I(R) >0 A T N T R I(A) and I(R) >0 A T N T R As you have seen this morning .... I(T) and I(T) >0 I(C) and I(N) >0 I(A) and I(T) >0 …

BNDM : Backward Nondeterministic Dawg Matching Classes in the text 12/09/2018 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w 2 4 8 16 32 64 128 256

Classes in the pattern: Horspool algorithm 12/09/2018 Which is the next position of the window? How the comparison is made? Text : Pattern : Sufix search Pattern : Text : a As you have seen this morning .... Shift until the next ocurrence of “a” in the pattern: a We need a preprocessing phase to construct the shift table.

Classes in the pattern: Horspool example 12/09/2018 Given the pattern ATNTR A C G T The shift table is: As you have seen this morning ....

Classes in the pattern: Horspool example 12/09/2018 Given the pattern ATNTR A 2 C G T The shift table is: As you have seen this morning ....

Classes in the pattern: Horspool example 12/09/2018 Given the pattern ATNTR A 2 C 2 G T The shift table is: As you have seen this morning ....

Classes in the pattern: Horspool example 12/09/2018 Given the pattern ATNTR A 2 C 2 G 2 T The shift table is: As you have seen this morning ....

Classes in the pattern: Horspool example 12/09/2018 Given the pattern ATNTR A 2 C 2 G 2 T 1 The shift table is: text : G T A C T A G A T A T G A G ... A T N T R A T N T R A T N T R A T N T R A T N T R As you have seen this morning ....

Classes in the pattern: Horspool example 12/09/2018 Given the pattern ATNTR A 2 C 2 G 2 T 1 The shift table is: text : G T A C T A G A T A T G A G ... A T N T R A T N T R A T N T R A T N T R A T G T A A T N T R As you have seen this morning .... Shorter shifts!

BNDM : Backward Nondeterministic Dawg Matching Classes in the text 12/09/2018 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w 2 4 8 16 32 64 128 256

Classes in the text: BNDM algorithm 12/09/2018 Which is the next position of the window ? How the comparison is made? Text : Pattern : Search for suffixes of T that are factors of the pattern Once the next character x is read D3 = D2<<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0) D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 ) x …that is denoted as D2 = 1 0 0 0 1 0 0 Depends on the value of the leftmost bit of D As you have seen this morning ....

Classes in the pattern : BNDM example 12/09/2018 Given the pattern ATNTR The masks of bits of symbols are B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) As you have seen this morning ....

Classes in the pattern : BNDM example 12/09/2018 Given the pattern ATNTR The masks of bits of symbols are B(A) = ( 1 0 1 0 1 ) B(C) = ( ) B(G) = ( ) B(T) = ( ) As you have seen this morning ....

Classes in the pattern : BNDM example 12/09/2018 Given the pattern ATNTR The masks of bits of symbols are B(A) = ( 1 0 1 0 1 ) B(C) = ( 0 0 1 0 0 ) B(G) = ( ) B(T) = ( ) As you have seen this morning ....

Classes in the pattern : BNDM example 12/09/2018 Given the pattern ATNTR The masks of bits of symbols are B(A) = ( 1 0 1 0 1 ) B(C) = ( 0 0 1 0 0 ) B(G) = ( 0 0 1 0 1 ) B(T) = ( ) As you have seen this morning ....

Classes in the pattern : BNDM example 12/09/2018 Given the pattern ATNTR The masks of bits of symbols are B(A) = ( 1 0 1 0 1 ) B(C) = ( 0 0 1 0 0 ) B(G) = ( 0 0 1 0 1 ) B(T) = ( 0 1 1 1 0 ) text : G T A C T A G A G G A C G T A T G T A C T G ... A T N T R A T N T R D1 = ( 0 1 1 1 0 ) A T N T R A T N T R D2 = ( 1 1 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 ) D3 = ( 0 1 0 0 0 ) & ( 1 0 1 0 1 ) = ( 0 0 0 0 0 ) D1 = ( 0 0 1 0 1 ) D2 = ( 0 1 0 1 0 ) & ( 0 0 1 0 1 ) = ( 0 0 0 0 0 ) As you have seen this morning .... D1 = ( 1 0 1 0 1 ) D2 = ( 0 1 0 1 0 ) & ( 0 1 1 1 0 ) = ( 0 1 0 1 0 ) D3 = ( 1 0 1 0 0 ) & ( 0 0 1 0 1 ) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 ) …

BNDM : Backward Nondeterministic Dawg Matching Classes in the text 12/09/2018 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w 2 4 8 16 32 64 128 256

BOM algorithm (Backward Oracle Matching) 12/09/2018 Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle Check if the suffix is a factor As you have seen this morning .... The position determined by the last character of the text with a transition in the automata

Classes in the pattern: BOM example 12/09/2018 Given the pattern ATGTATG, the AFO is G A T but for the patter ATNTRTG? We should apply the SBOM algorithm! As you have seen this morning ....

Multiple string matching 12/09/2018 5 10 15 20 25 30 35 40 45 8 4 2 | | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC

Set Horspool algorithm 12/09/2018 How the comparison is made? By suffixes Text : Patterns: Trie of all inverse patterns Which is the next position of the window? a As you have seen this morning .... We shift until a is aligned with the first a in the trie not longer than lmin, or lmin

Set Horspool algorithm 12/09/2018 Search for ATNTARG,RTGR,NTTNAR,ATRTG 1. Construct the trie of the 46 possible inverse patterns 2. Determine lmin=4 A 1 C 2 G 1 T 1 3. Determine the shift table As you have seen this morning .... 4. Find the patterns

Multiple string matching 12/09/2018 5 10 15 20 25 30 35 40 45 8 4 2 | | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC

SBOM algorithm How the comparison is made? 12/09/2018 Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle (Inverse patterns of length lmin) Check if the suffix is a factor of any pattern As you have seen this morning .... The position determined by the last character of the text with a transition in the automata

Classes in the patterns: SBOM example 12/09/2018 Given the patterns ATGNARG, TRATR,TAATAAT i ANTNTGR the Automata Factor Oracle of all 21 possible patterns is built … As you have seen this morning ....

Extended alphabets Classes in the: text pattern Horspool ✓ ✓ BNDM ✓ ✓ 12/09/2018 Classes in the: text pattern Horspool ✓ ✓ BNDM ✓ ✓ BOM ✗ ≈ Set-Horspool ✗ ≈ SBOM ✗ ≈ As you have seen this morning ....

Extended string matching 12/09/2018 Classes of characters: when in some DNA files or patterns there are new characters as N or R that means N={A,C,G,T} and R={G,A}. Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3) means any 2 or 3 characters. Optional characters: we find pattern as AC?ACT?T?A where C? means that C may or may not appear in the text. Wild cards: we find pattern as AT*TA where * means an arbitrary long string. Repeatable characters: we find pattern as AT[TA]*AT where [TA]* means that TA can appear zero or more times.. As you have seen this morning ....

Bounded length gaps : BNDM example 12/09/2018 Given the pattern ATx(2,3)TA B(A) = ( 1 0 1 1 1 0 1 ) B(C) = ( 0 0 1 1 1 0 0 ) B(G) = ( 0 0 1 1 1 0 0 ) B(T) = ( 0 1 1 1 1 1 0 ) The masks of bits are As you have seen this morning ....

Bounded length gaps : BNDM example 12/09/2018 Given the pattern ATx(2,3)TA B(A) = ( 1 0 1 1 1 0 1 ) B(C) = ( 0 0 1 1 1 0 0 ) B(G) = ( 0 0 1 1 1 0 0 ) B(T) = ( 0 1 1 1 1 1 0 ) The masks of bits are The masks of bits are text : A T A G T A G A G T ... D1 = ( 1 0 1 1 1 0 1 ) D2 = ( 0 1 1 1 0 1 0 ) & ( 0 1 1 1 1 1 0 ) = ( 0 1 1 1 0 1 0 ) D3 = ( 1 1 1 0 1 0 0 ) & ( 0 0 1 1 1 0 0 ) = ( 0 0 1 0 1 0 0 ) D4 = ( 0 1 0 1 0 0 0 ) & ( 1 0 1 1 1 0 1 ) = ( 0 0 0 1 0 0 0 ) ? D5 = ( 0 0 1 0 0 0 0 ) & ( 0 1 1 1 1 1 0) = ( 0 0 1 0 0 0 0 ) As you have seen this morning .... D6 = ( 0 1 0 0 0 0 0 ) & ( 1 0 1 1 1 0 1) = ( 0 0 0 0 0 0 0 )

Bounded length gaps : BNDM example 12/09/2018 Given the pattern ATx(2,3)TA text : A T A G T A G A G T ... D1 = ( 1 0 1 1 1 0 1 ) D2 = ( 0 1 1 1 0 1 0 ) & ( 0 1 1 1 1 1 0 ) = ( 0 1 1 1 0 1 0 ) D3 = ( 1 1 1 0 1 0 0 ) & ( 0 0 1 1 1 0 0 ) = ( 0 0 1 0 1 0 0 )  ( 0 0 1 1 0 0 0 ) D4 = ( 0 1 0 1 0 0 0 ) & ( 1 0 1 1 1 0 1 ) = ( 0 0 0 1 0 0 0 ) 1 1 1 1 D5 = ( 0 0 1 0 0 0 0 ) & ( 0 1 1 1 1 1 0) = ( 0 0 1 0 0 0 0 ) ? D6 = ( 0 1 0 0 0 0 0 ) & ( 1 0 1 1 1 0 1) = ( 0 0 0 0 0 0 0 ) Let’s see the automaton: A T * ε = F = I As you have seen this morning .... - 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 D  [ F - (I & D) ] & ¬ F