Recuperació de la informació 06/04/2019 Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq http://www-igm.univ-mlv.fr/~lecroq/string/index.html
String Matching 06/04/2019 String matching: definition of the problem (text,pattern) Exact matching: depends on what we have: text or patterns The patterns ---> Data structures for the patterns 1 pattern ---> The algorithm depends on |p| and || k patterns ---> The algorithm depends on k, |p| and || Extensions Regular Expressions The text ----> Data structure for the text (suffix tree, ...) Approximate matching: Dynamic programming Sequence alignment (pairwise and multiple) Sequence assembly: hash algorithm Probabilistic search: Hidden Markov Models
Regular expression 06/04/2019 A regular expression ℛ is a string on the set of simbols Σ U { ε, |, · , * , (, ) } which is recursively defined as: ε (empty character) is a regular expression A character of Σ is a regular expression ( ℛ ) is a regular expression ℛ1 · ℛ2 is a regular expression As you have seen this morning .... ℛ1 | ℛ2 is a regular expression ℛ * is a regular expression
Regular lenguage 06/04/2019 The lenguage defined by a regular expression ℛ is the set of strings generated by ℛ . The problem of searching for a regular expression in the text T is to find all the factors in T that belong to the lenguage. As you have seen this morning ....
Methods Regular expression Parse tree NFA DFA Strings found 06/04/2019 Regular expression Parse tree NFA DFA Search with deterministic finit automata Search with bit-parallel Thompson automata As you have seen this morning .... Strings found
Methods Regular expression Parse tree NFA DFA Strings found 06/04/2019 Regular expression Parse tree NFA DFA Search with deterministic finit automata Search with bit-parallel Thompson automata As you have seen this morning .... Strings found
Search with a deterministic finit automata 06/04/2019 Given the regular expression bb*(b|b*a) the NFA is b 1 a 3 2 As it’s not possible to spell the text out the NFA, the NFA is transformed into a DFA … b 1 a 3 12 As you have seen this morning .... What is the cost? And the search process…
Search example with DFA 06/04/2019 b 1 a 3 12 Given the regular expression bb*(b|b*a) and the NFA: The search on the text: b b b a a b a a b b … As you have seen this morning .... …
Methods Regular expression Parse tree NFA DFA Strings found 06/04/2019 Regular expression Parse tree NFA DFA Search with deterministic finit automata Search with bit-parallel Thompson automata As you have seen this morning .... Strings found
. Parse tree Is a tree such that: 06/04/2019 Is a tree such that: - internal nodes are labeled by operators - leaves are labeled by characters of Σ and ε . ℛ1 ℛ2 ( ℛ ) ℛ ℛ1 · ℛ2 ℛ1 | ℛ2 As you have seen this morning .... | ℛ1 ℛ2 ℛ * ℛ *
Parse tree: example 06/04/2019 Given the regular expression bb*(b|b*a) the parse tree is: . | b * b . b As you have seen this morning .... a * b
NFA (Thompson automaton) 06/04/2019 From the regular expression or from the parse tree we define the automaton: For a character a of Σ: a . ℛ1 ℛ2 | ℛ1 ℛ2 ε ε As you have seen this morning .... ℛ * ε
Thompsom automaton construction 06/04/2019 bb*(b|b*a) . | b * b b . a * b b b a As you have seen this morning .... b b
NFA: ε-closure (states ε-equivalents) 06/04/2019 bb*(b|b*a) b 6 7 b a 2 3 5 8 9 b 1 4 12 b 10 11 ε 1 3 4 5 7 9 11 1, 2, 4, 5, 6, 8, 10 2, 3, 4, 5, 6, 8,10 4, 5, 6, 8, 10 As you have seen this morning .... 5, 6, 8 6, 7, 8 9, 12 11, 12
Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 6 7 b 2 3 a 5 8 9 b D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 12 b 10 11 Text: ababbbaab The bit-vector D mark the active states: at the begining At every step we shift to the right followed by an “and” operator with the mask of the last read character… D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 As you have seen this morning .... The masks are B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …and the ε-closure extension of active states.
Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b -> 0 1 0 0 0 0 0 0 0 0 0 0 0 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 0 1 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 1 0 1 1 0 1 1 1 0 1 0 1 0 0 6 7 b 2 3 a 5 8 9 ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b -> 0 1 0 0 0 0 0 0 0 0 0 0 0 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 0 1 1 0 1 1 1 0 1 0 1 0 As you have seen this morning ....
Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 0 1 1 0 1 1 1 0 1 0 1 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 1
Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 -> 0 0 0 0 0 0 0 0 0 0 1 0 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 1 0 0 0 0 0 0 0 0 1 0 0 1
Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 -> 0 0 0 0 0 0 0 0 0 0 1 0 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 1 0 0 1
Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 1 0 1 1 1 0 1 0 1 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 -> 0 0 0 0 0 0 0 0 0 0 1 0 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 (b) 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 As you have seen this morning ....
Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0
Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0