Download presentation
Presentation is loading. Please wait.
1
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot http://www-igm.univ-mlv.fr/~lecroq/string/index.html Algorismes de: Cerca de patrons (exacta i aproximada) (String matching i Pattern matching) Indexació de textos: Suffix trees, Suffix arrays
2
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching: 1 pattern ---> The algorithm depends on |p| and | | k patterns ---> The algorithm depends on k, |p| and | | The text ----> Data structure for the text (suffix tree,...) The patterns ---> Data structures for the patterns Dynamic programming Sequence alignment (pairwise and multiple) Extensions Regular Expressions Probabilistic search: Sequence assembly: hash algorithm Hidden Markov Models
3
Exact string matching: one pattern (text on-line) Experimental efficiency (Navarro & Raffinot) 2 4 8 16 32 64 128 256 64 32 16 8 4 2 | | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w
4
Multiple string matching 5 10 15 20 25 30 35 40 45 8 4 2 | | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC
5
Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA
6
Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA G G A T T T A
7
Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA G G A T T T A
8
Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA A G G A T T T T A A A A T
9
Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA T A G G A T T T T G A A A A T Which is the cost?
10
Set Horspool algorithm Text : Patterns: By suffixes Which is the next position of the window? How the comparison is made? a Trie of all inverse patterns We shift until a is aligned with the first a in the trie not longer than lmin, or lmin
11
Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA 2. Determine lmin=
12
Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA 2. Determine lmin=4 A 1 C 4 (lmin) G T 3. Determine the shift table
13
Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA 2. Determine lmin=4 A 1 C 4 (lmin) G 2 T 3. Determine the shift table
14
Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG 4. Find the patterns T A G G A T T T T G A A A A T 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA 2. Determine lmin=4 A 1 C 4 (lmin) G 2 T 1 3. Determine the shift table
15
Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1
16
Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1
17
Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1
18
Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1
19
Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1
20
Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1
21
Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1 …
22
Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1 Is the expected length of the shifts related with the number of patterns? … As more patterns we search for, shorter shifts we do!
23
AA 1 AC 3 ( LMIN-L+1 ) AG AT CA CC CG … 2 símbols Set Horspool algorithm Wu-Manber algorithm How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG A 1 C 4 (lmin) G 2 T 1 1 símbol
24
AA 1 AC 3 ( LMIN-L+1 ) AG AT CA CC CG … 2 símbols Set Horspool algorithm Wu-Manber algorithm How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG A 1 C 4 (lmin) G 2 T 1 1 símbol 3
25
AA 1 AC 3 ( LMIN-L+1 ) AG AT CA CC CG … 2 símbols Set Horspool algorithm Wu-Manber algorithm How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG A 1 C 4 (lmin) G 2 T 1 1 símbol 3 1
26
AA 1 AC 3 ( LMIN-L+1 ) AG AT 1 CA CC CG … 2 símbols Set Horspool algorithm Wu-Manber algorithm How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG A 1 C 4 (lmin) G 2 T 1 1 símbol 3 333333
27
AA 1 AC 3 ( LMIN-L+1 ) AG 3 AT 1 CA 3 CC 3 CG 3 … 2 símbols Set Horspool algorithm Wu-Manber algorithm How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG AA 1 AT 1 GT 1 TA 2 TG 2 A 1 C 4 (lmin) G 2 T 1 1 símbol
28
Wu-Manber algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T text: ACATGCTATGTGACATAATA AA 1 AT 1 GT 1 TA 2 TG 2
29
Wu-Manber algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T text: ACATGCTATGTGACATAATA AA 1 AT 1 GT 1 TA 2 TG 2
30
Wu-Manber algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T text: ACATGCTATGTGACATAATA AA 1 AT 1 GT 1 TA 2 TG 2
31
Wu-Manber algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T text: ACATGCTATGTGACATAATA … AA 1 AT 1 GT 1 TA 2 TG 2 log |Σ| 2*lmin*k But given k patterns, how many symbols we should take ?
32
Multiple string matching 5 10 15 20 25 30 35 40 45 8 4 2 | | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC
33
BOM algorithm (Backward Oracle Matching) Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle Check if the suffix is a factor of any pattern The position determined by the last character of the text with a transition in the automata
34
Factor Oracle of k strings How can we build the Factor Oracle of GTATGTA, GTAA, TAATA i GTGTA ? T A A GGATT T T A G A 1,4 3 2 A
35
Factor Oracle of k strings How can we build the Factor Oracle of GTATGTA, GTAA, TAATA i GTGTA ? T A A GGATT T T A G A 1,4 3 2 A
36
Factor Oracle of k strings Given the Factor Oracle of GTATGTA GT
37
Factor Oracle of k strings Given the Factor Oracle of GTATGTA GAT T
38
Factor Oracle of k strings Given the Factor Oracle of GTATGTA GAT T T A
39
Factor Oracle of k strings Given the Factor Oracle of GTATGTA GGAT T T A
40
Factor Oracle of k strings Given the Factor Oracle of GTATGTA GGAT T T A G T
41
Factor Oracle of k strings Given the Factor Oracle of GTATGTA GGATT T T A G … we insert GTAA A 1
42
Factor Oracle of k strings …inserting GTAA GGATT T T A G A 1 A 2
43
Factor Oracle of k strings Given the AFO of GTATGTA and GTAA A GGATT T T A G A … we insert TAATA 1 2
44
Factor Oracle of k strings … inserting TAATA T A GGATT T T A G A A 1 2 A 3
45
Factor Oracle of k strings Given the AFO of GTATGTA, GTAA and TAATA T A A GGATT T T A G A 1 3 2 A …we insert GTGTA
46
Factor Oracle of k strings …inserting GTGTA T A A GGATT T T A G A 1 3 2 A
47
Factor Oracle of k strings This is the Automata Factor Oracle of GTATGTA, GTAA, TAATA and GTGTA T A A GGATT T T A G A 1,4 3 2 A
48
SBOM algorithm Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle (Inverse patterns of length lmin) Check if the suffix is a factor of any pattern The position determined by the last character of the text with a transition in the automata
49
SBOM algorithm: example … the we build the Automata Factor Oracle of GTATG, GTAAT, TAATA and GTGTA of length lmin=5 We search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 A
50
SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGTATG A
51
SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGTATG A
52
SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGTATG A
53
SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGTATG A
54
SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGTATG A
55
SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGTATG A
56
SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGTATG A
57
SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGT… A
58
Multiple string matching 5 10 15 20 25 30 35 40 45 8 4 2 | | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.