Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)

Slides:



Advertisements
Similar presentations
Tuned Boyer Moore Algorithm
Advertisements

Space-for-Time Tradeoffs
MSc Bioinformatics for H15: Algorithms on strings and sequences
Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle.
1 String Matching of Bit Parallel Suffix Automata.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan
1 Advisor: Prof. R. C. T. Lee Speaker: G. W. Cheng Two exact string matching algorithms using suffix to prefix rule.
The chromosomes contains the set of instructions for alive beings
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Knuth-Morris-Pratt Algorithm left to right scan like the naïve algorithm one main improvement –on a mismatch, calculate maximum possible shift to the right.
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Smith Algorithm Experiments with a very fast substring search algorithm, SMITH P.D., Software - Practice & Experience 21(10), 1991, pp Adviser:
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
1 KMP algorithm Advisor: Prof. R. C. T. Lee Reporter: C. W. Lu KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R.,, Fast pattern matching in strings, SIAM Journal.
Quick Search Algorithm A very fast substring search algorithm, SUNDAY D.M., Communications of the ACM. 33(8),1990, pp Adviser: R. C. T. Lee Speaker:
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Backward Nondeterministic DAWG Matching Algorithm
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Raita Algorithm T. RAITA Advisor: Prof. R. C. T. Lee
Indexing and Searching
Modern Information Retrieval Chapter 4 Query Languages.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
String Matching Chapter 32 Highlights Charles Tappert Seidenberg School of CSIS, Pace University.
Chapter 7 Space and Time Tradeoffs James Gain & Sonia Berman
Advisor: Prof. R. C. T. Lee Speaker: T. H. Ku
MA/CSSE 473 Day 24 Student questions Quadratic probing proof
MCS 101: Algorithms Instructor Neelima Gupta
Exact String Matching Algorithms: A Survey Mehreen Ali, Hina Naz Khan, Shumaila Sayyab, Nadeem Iftikhar Department of Bio-Science Mohammad Ali Jinnah University,
Application: String Matching By Rong Ge COSC3100
Regular Grammars Chapter 7. Regular Grammars A regular grammar G is a quadruple (V, , R, S), where: ● V is the rule alphabet, which contains nonterminals.
Regular Grammars Chapter 7 1. Regular Grammars A regular grammar G is a quadruple (V, , R, S), where: ● V is the rule alphabet, which contains nonterminals.
String Matching of Regular Expression
MCS 101: Algorithms Instructor Neelima Gupta
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.
 Author: Ricardo A. Baeza-Yates, Gaston H. Gonnet  Publisher: 1992 Communications of the ACM  Presenter: Yuen-Shuo Li  Date: 2013/08/14 1.
Fundamental Data Structures and Algorithms
ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh.
Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
CSG523/ Desain dan Analisis Algoritma
Advanced Data Structure: Bioinformatics
Source : Practical fast searching in strings
Exact string matching: one pattern (text on-line)
Recuperació de la informació
13 Text Processing Hongfei Yan June 1, 2016.
Indexing and Searching (File Structures)
Space-for-time tradeoffs
Chapter 7 Regular Grammars
Adviser: R. C. T. Lee Speaker: C. W. Cheng National Chi Nan University
Chapter 7 Space and Time Tradeoffs
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
Space-for-time tradeoffs
Tècniques i Eines Bioinformàtiques
Space-for-time tradeoffs
Recuperació de la informació
String Matching 11/04/2019 String matching: definition of the problem (text,pattern) Exact matching: depends on what we have: text or patterns The patterns.
Tècniques i Eines Bioinformàtiques
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
String Matching Algorithm
Space-for-time tradeoffs
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Improved Two-Way Bit-parallel Search
MA/CSSE 473 Day 27 Student questions Leftovers from Boyer-Moore
Presentation transcript:

Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq

String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching: 1 pattern ---> The algorithm depends on |p| and |  | k patterns ---> The algorithm depends on k, |p| and |  | The text ----> Data structure for the text (suffix tree,...) The patterns ---> Data structures for the patterns Dynamic programming Sequence alignment (pairwise and multiple) Extensions Regular Expressions Probabilistic search: Sequence assembly: hash algorithm Hidden Markov Models

Extended string matching There are characters in the text that represent sets of simbols 1. Classes of characters in the tetx. There are characters in the pattern that represent sets of simbols 2. Classes of characters in the pattern. There are classes of characters represented by one symbol. For instace the IUPAC code for the DNA alphabet is: R = {G,A} Y = {T,C} K = {G,T} M = {A,C} S = {G,C} W = {A,T} B = {G,T,C } D = {G,A,T} H = {A,C,T} V = {G,C,A} N = {A,G,C,T} (any)

Extended alphabets First part Classes in the text

Classes in the text: Brute force algorithm Text : over 2 |∑| Pattern over  From left to right: prefix Which is the next position of the window? How the comparison is made? Pattern : Text : The window is shifted only one cell We need the operation: belongs to a set ? ?

Classes in the text: Brute force algorithm Every subset of  is represented by a string of bits of length |  |. When |  | < computer word For instance, given the DNA alphabet  ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(,,, )

Classes in the text: Brute force algorithm Every subset of  is represented by a string of bits of length |  |. When |  | < computer word For instance, given the DNA alphabet  ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(,,, )

Classes in the text: Brute force algorithm Every subset of  is represented by a string of bits of length |  |. When |  | < computer word For instance, given the DNA alphabet  ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “ A belongs to set X ” is made with...

Classes in the text: Brute force algorithm Every subset of  is represented by a string of bits of length |  |. G T A R T R N A G G A... A T G T A When |  | < computer word For instance, given the DNA alphabet  ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “ A belongs to set X ” is made with I(A) and I(X) >0 I(A) & I(T)>0

Classes in the text: Brute force algorithm Every subset of  is represented by a string of bits of length |  |. G T A R T R N A G G A... A T G T A When |  | < computer word For instance, given the DNA alphabet  ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “ A belongs to set X ” is made with I(A) and I(X) >0 I(A) & I(T)>0 I(A) & I(R)>0 I(T) & I(T)>0I(G) & I(R)>0I(T) & I(A)>0

Classes in the text: Brute force algorithm Every subset of  is represented by a string of bits of length |  |. G T A R T R N A G G A... A T G T A When |  | < computer word For instance, given the DNA alphabet  ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “ A belongs to set X ” is made with I(A) and I(X) >0 I(A) & I(T)>0 I(A) & I(R)>0 I(T) & I(T)>0I(G) & I(R)>0I(T) & I(A)>0 I(A) & I(N)>0 I(T) & I(R)>0... Which is the cost?

Classes in the text Experimental efficiency (Navarro & Raffinot) |  | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

Classes in the text: Horspool algorithm Text : Pattern : Sufix search Which is the next position of the window? How the comparison is made? Pattern : Text : a Shift until the next ocurrence of “a” (or “t”,”r”,…) in the pattern: a aa aa a We need a shift table with the extended alphabet.

Classes in the text :Horspool example Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T 1 R ? … N ?

Classes in the text :Horspool example Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T 1 R 2 … N ?

Classes in the text :Horspool example Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T 1 R 2 … N 1 text :G T A R T R N A A G G A … A T G T A

Classes in the text :Horspool example Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T 1 R 2 … N 1 text :G T A R T R N A A G G A... A T G T A …

Classes in the text Experimental efficiency (Navarro & Raffinot) |  | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

Text : Pattern : Search for suffixes of T that are factors of the pattern Classes in the text: BNDM algorithm Which is the next position of the window ? How the comparison is made? …that is denoted as D 2 = Depends on the value of the leftmost bit of D Once the next character x is read D 3 = D 2 <<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( ) D = ( ) & ( ) = ( ) x

Classes in the text : BNDM example Given the pattern ATGTA The masks of bits are B(A) = ( ) B(R)=( ) B(C) = ( ) … B(G) = ( ) B(N)=( ) B(T) = ( )

Classes in the text : BNDM example Given the pattern ATGTA B(A) = ( ) B(R)=( ) B(C) = ( ) … B(G) = ( ) B(N)=( ) B(T) = ( ) The masks of bits are

Classes in the text : BNDM example Given the pattern ATGTA text :G T A R T R N A G G A C G... A T G T A B(A) = ( ) B(R)=( ) B(C) = ( ) … B(G) = ( ) B(N)=( ) B(T) = ( ) D 1 = ( ) D 2 = ( ) & ( ) = ( ) D 2 = ( ) & ( ) = ( ) The masks of bits are

Classes in the text : BNDM example Given the pattern ATGTA text :G T A R T R N A G G A C G... A T G T A B(A) = ( ) B(R)=( ) B(C) = ( ) … B(G) = ( ) B(N)=( ) B(T) = ( ) D 1 = ( ) D 2 = ( ) & ( ) = ( ) D 1 = ( ) D 2 = ( ) & ( ) = ( ) D 2 = ( ) & ( ) = ( ) D 3 = ( ) & ( ) = ( ) D 4 = ( ) & ( ) = ( ) D 5 = ( ) & ( ) = ( ) The masks of bits are

Classes in the text : BNDM example Given the pattern ATGTA text :G T A R T R N A G G A C G... A T G T A B(A) = ( ) B(R)=( ) B(C) = ( ) … B(G) = ( ) B(N)=( ) B(T) = ( ) D 1 = ( ) D 2 = ( ) & ( ) = ( ) D 1 = ( ) D 2 = ( ) & ( ) = ( ) D 2 = ( ) & ( ) = ( ) D 3 = ( ) & ( ) = ( ) D 4 = ( ) & ( ) = ( ) D 5 = ( ) & ( ) = ( ) … The masks of bits are

Classes in the text Experimental efficiency (Navarro & Raffinot) |  | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

BOM algorithm (Backward Oracle Matching) Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle Check if the suffix is a factor The position determined by the last character of the text with a transition in the automata

Classes in the text: BOM example The we build the AFO of the inverse pattern of ATGTATG … and we try to find… :G T A R T R N A A T G… A T G T A T G GGATT AT T A G It’s not possible any improvement!

Multiple string matching |  | Wu-Manber SBOM lmin (5 strings) Wu-Manber SBOM (10 strings) Ad AC Wu-Manber SBOM (1000 strings) Ad AC Wu-Manber SBOM (100 strings) Ad AC

Classes in the text: Set Horspool algorithm Text : Patterns: By suffixes Which is the next position of the window? How the comparison is made? Trie of all inverse patterns ?

Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG 4. Find the patterns T A G G A T T T T G A A A A T 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA 2. Determine lmin=4 A 1 C 4 (lmin) G 2 T 1 3. Determine the shift table

Classes in the text: Set Horspool Search for the patterns ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T text: ARTGNCTATGTGACA… It’s not possible any improvement!

Multiple string matching |  | Wu-Manber SBOM lmin (5 strings) Wu-Manber SBOM (10 strings) Ad AC Wu-Manber SBOM (1000 strings) Ad AC Wu-Manber SBOM (100 strings) Ad AC

Classes in the text: SBOM algorithm Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle (Inverse patterns of length lmin) Check if the suffix is a factor of any pattern The position determined by the last character of the text with a transition in the automata

Classes in the text: SBOM example Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A text: ACATN C TAGC TA TA ATAATGTATG A It’s not possible any improvement!

Extended alphabets Classes in the: text pattern Horspool ✓ BNDM ✓ BOM ✗ Set-Horspool ✗ SBOM ✗

Extended search Second part Classes in the pattern

Classes in the pattern: Brute force algorithm Text : over  From left to right: prefix Which is the next position of the window? How the comparison is made? Pattern : Text : The window is shifted only one cell We need the operation: belongs to a set ? ? Pattern : over 2 |∑|

Classes in the pattern: Brute force algorithm Every subset is represented by a string of bits of length |  |. G T A C T A G A G G A C G T A T G T A C T G... A T N T R When |  | < computer word For instance, given the DNA alphabet  ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=(1,0,1,0,),..., I(N)=(1,1,1,1) Then the operation “ A belongs to set X ” is made with I(A) and I(X) >0 I(T) and I(R) >0 I(A) and I(R) >0 I(T) and I(T) >0 I(C) and I(N) >0 I(A) and I(T) >0 …

Classes in the text Experimental efficiency (Navarro & Raffinot) |  | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

Classes in the pattern: Horspool algorithm Text : Pattern : Sufix search Which is the next position of the window? How the comparison is made? Pattern : Text : a Shift until the next ocurrence of “a” in the pattern: a aa aa a We need a preprocessing phase to construct the shift table.

Classes in the pattern: Horspool example Given the pattern ATNTR The shift table is: ACGTACGT

Classes in the pattern: Horspool example Given the pattern ATNTR The shift table is: A 2 C G T

Classes in the pattern: Horspool example Given the pattern ATNTR The shift table is: A 2 C 2 G T

Classes in the pattern: Horspool example Given the pattern ATNTR The shift table is: A 2 C 2 G 2 T

Classes in the pattern: Horspool example Given the pattern ATNTR The shift table is: A 2 C 2 G 2 T 1 text :G T A C T A G A T A T G A G... A T N T R

Classes in the pattern: Horspool example Given the pattern ATNTR The shift table is: A 2 C 2 G 2 T 1 text :G T A C T A G A T A T G A G... A T N T R A T G T A Shorter shifts!

Classes in the text Experimental efficiency (Navarro & Raffinot) |  | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

Text : Pattern : Search for suffixes of T that are factors of the pattern Classes in the text: BNDM algorithm Which is the next position of the window ? How the comparison is made? …that is denoted as D 2 = Depends on the value of the leftmost bit of D Once the next character x is read D 3 = D 2 <<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( ) D = ( ) & ( ) = ( ) x

Classes in the pattern : BNDM example Given the pattern ATNTR The masks of bits of symbols are B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( )

Classes in the pattern : BNDM example Given the pattern ATNTR The masks of bits of symbols are B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( )

Classes in the pattern : BNDM example Given the pattern ATNTR The masks of bits of symbols are B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( )

Classes in the pattern : BNDM example Given the pattern ATNTR The masks of bits of symbols are B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( )

Classes in the pattern : BNDM example Given the pattern ATNTR text :G T A C T A G A G G A C G T A T G T A C T G... A T N T R The masks of bits of symbols are B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) D 1 = ( ) D 2 = ( ) & ( ) = ( ) D 1 = ( ) D 2 = ( ) & ( ) = ( ) D 3 = ( ) & ( ) = ( ) … D 1 = ( ) D 2 = ( ) & ( ) = ( ) D 3 = ( ) & ( ) = ( ) D 4 = ( ) & ( ) = ( ) A T N T R

Classes in the text Experimental efficiency (Navarro & Raffinot) |  | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

BOM algorithm (Backward Oracle Matching) Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle Check if the suffix is a factor The position determined by the last character of the text with a transition in the automata

Classes in the pattern: BOM example Given the pattern ATGTATG, the AFO is GGATT AT T A G We should apply the SBOM algorithm! but for the patter ATNTRTG?

Multiple string matching |  | Wu-Manber SBOM lmin (5 strings) Wu-Manber SBOM (10 strings) Ad AC Wu-Manber SBOM (1000 strings) Ad AC Wu-Manber SBOM (100 strings) Ad AC

Set Horspool algorithm Text : Patterns: By suffixes Which is the next position of the window? How the comparison is made? a Trie of all inverse patterns We shift until a is aligned with the first a in the trie not longer than lmin, or lmin

Set Horspool algorithm Search for ATNTARG,RTGR,NTTNAR,ATRTG 4. Find the patterns 1. Construct the trie of the 46 possible inverse patterns 2. Determine lmin=4 A 1 C 2 G 1 T 1 3. Determine the shift table

Multiple string matching |  | Wu-Manber SBOM lmin (5 strings) Wu-Manber SBOM (10 strings) Ad AC Wu-Manber SBOM (1000 strings) Ad AC Wu-Manber SBOM (100 strings) Ad AC

SBOM algorithm Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle (Inverse patterns of length lmin) Check if the suffix is a factor of any pattern The position determined by the last character of the text with a transition in the automata

Classes in the patterns: SBOM example the Automata Factor Oracle of all 21 possible patterns is built … Given the patterns ATGNARG, TRATR,TAATAAT i ANTNTGR

Multiple string matching |  | Wu-Manber SBOM lmin (5 strings) Wu-Manber SBOM (10 strings) Ad AC Wu-Manber SBOM (1000 strings) Ad AC Wu-Manber SBOM (100 strings) Ad AC

Extended alphabets Classes in the: text pattern Horspool ✓ ✓ BNDM ✓ ✓ BOM ✗ ≈ Set-Horspool ✗ ≈ SBOM ✗ ≈