Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)

Similar presentations


Presentation on theme: "Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)"— Presentation transcript:

1 Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot http://www-igm.univ-mlv.fr/~lecroq/string/index.html Algorismes de: Cerca de patrons (exacta i aproximada) (String matching i Pattern matching) Indexació de textos: Suffix trees, Suffix arrays

2 String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching: 1 pattern ---> The algorithm depends on |p| and |  | k patterns ---> The algorithm depends on k, |p| and |  | The text ----> Data structure for the text (suffix tree,...) The patterns ---> Data structures for the patterns Dynamic programming Sequence alignment (pairwise and multiple) Extensions Regular Expressions Probabilistic search: Sequence assembly: hash algorithm Hidden Markov Models

3 Exact string matching: one pattern (text on-line) Experimental efficiency (Navarro & Raffinot) 2 4 8 16 32 64 128 256 64 32 16 8 4 2 |  | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

4 Multiple string matching 5 10 15 20 25 30 35 40 45 8 4 2 |  | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC

5 Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA

6 Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA G G A T T T A

7 Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA G G A T T T A

8 Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA A G G A T T T T A A A A T

9 Trie Construct the trie of GTATGTA,GTAT,TAATA,GTGTA T A G G A T T T T G A A A A T Which is the cost?

10 Set Horspool algorithm Text : Patterns: By suffixes Which is the next position of the window? How the comparison is made? a Trie of all inverse patterns We shift until a is aligned with the first a in the trie not longer than lmin, or lmin

11 Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA 2. Determine lmin=

12 Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA 2. Determine lmin=4 A 1 C 4 (lmin) G T 3. Determine the shift table

13 Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA 2. Determine lmin=4 A 1 C 4 (lmin) G 2 T 3. Determine the shift table

14 Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG 4. Find the patterns T A G G A T T T T G A A A A T 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA 2. Determine lmin=4 A 1 C 4 (lmin) G 2 T 1 3. Determine the shift table

15 Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1

16 Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1

17 Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1

18 Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1

19 Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1

20 Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1

21 Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1 …

22 Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A G G A T T T T G A A A A T text: ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1 Is the expected length of the shifts related with the number of patterns? … As more patterns we search for, shorter shifts we do!

23 AA 1 AC 3 ( LMIN-L+1 ) AG AT CA CC CG … 2 símbols Set Horspool algorithm  Wu-Manber algorithm How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG A 1 C 4 (lmin) G 2 T 1 1 símbol

24 AA 1 AC 3 ( LMIN-L+1 ) AG AT CA CC CG … 2 símbols Set Horspool algorithm  Wu-Manber algorithm How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG A 1 C 4 (lmin) G 2 T 1 1 símbol 3

25 AA 1 AC 3 ( LMIN-L+1 ) AG AT CA CC CG … 2 símbols Set Horspool algorithm  Wu-Manber algorithm How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG A 1 C 4 (lmin) G 2 T 1 1 símbol 3 1

26 AA 1 AC 3 ( LMIN-L+1 ) AG AT 1 CA CC CG … 2 símbols Set Horspool algorithm  Wu-Manber algorithm How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG A 1 C 4 (lmin) G 2 T 1 1 símbol 3 333333

27 AA 1 AC 3 ( LMIN-L+1 ) AG 3 AT 1 CA 3 CC 3 CG 3 … 2 símbols Set Horspool algorithm  Wu-Manber algorithm How the length of shifts can be increased? By reading blocks of symbols instead of only one! Given ATGTATG,TATG,ATAAT,ATGTG AA 1 AT 1 GT 1 TA 2 TG 2 A 1 C 4 (lmin) G 2 T 1 1 símbol

28 Wu-Manber algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T text: ACATGCTATGTGACATAATA AA 1 AT 1 GT 1 TA 2 TG 2

29 Wu-Manber algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T text: ACATGCTATGTGACATAATA AA 1 AT 1 GT 1 TA 2 TG 2

30 Wu-Manber algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T text: ACATGCTATGTGACATAATA AA 1 AT 1 GT 1 TA 2 TG 2

31 Wu-Manber algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T text: ACATGCTATGTGACATAATA … AA 1 AT 1 GT 1 TA 2 TG 2 log |Σ| 2*lmin*k But given k patterns, how many symbols we should take ?

32 Multiple string matching 5 10 15 20 25 30 35 40 45 8 4 2 |  | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC

33 BOM algorithm (Backward Oracle Matching) Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle Check if the suffix is a factor of any pattern The position determined by the last character of the text with a transition in the automata

34 Factor Oracle of k strings How can we build the Factor Oracle of GTATGTA, GTAA, TAATA i GTGTA ? T A A GGATT T T A G A 1,4 3 2 A

35 Factor Oracle of k strings How can we build the Factor Oracle of GTATGTA, GTAA, TAATA i GTGTA ? T A A GGATT T T A G A 1,4 3 2 A

36 Factor Oracle of k strings Given the Factor Oracle of GTATGTA GT

37 Factor Oracle of k strings Given the Factor Oracle of GTATGTA GAT T

38 Factor Oracle of k strings Given the Factor Oracle of GTATGTA GAT T T A

39 Factor Oracle of k strings Given the Factor Oracle of GTATGTA GGAT T T A

40 Factor Oracle of k strings Given the Factor Oracle of GTATGTA GGAT T T A G T

41 Factor Oracle of k strings Given the Factor Oracle of GTATGTA GGATT T T A G … we insert GTAA A 1

42 Factor Oracle of k strings …inserting GTAA GGATT T T A G A 1 A 2

43 Factor Oracle of k strings Given the AFO of GTATGTA and GTAA A GGATT T T A G A … we insert TAATA 1 2

44 Factor Oracle of k strings … inserting TAATA T A GGATT T T A G A A 1 2 A 3

45 Factor Oracle of k strings Given the AFO of GTATGTA, GTAA and TAATA T A A GGATT T T A G A 1 3 2 A …we insert GTGTA

46 Factor Oracle of k strings …inserting GTGTA T A A GGATT T T A G A 1 3 2 A

47 Factor Oracle of k strings This is the Automata Factor Oracle of GTATGTA, GTAA, TAATA and GTGTA T A A GGATT T T A G A 1,4 3 2 A

48 SBOM algorithm Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle (Inverse patterns of length lmin) Check if the suffix is a factor of any pattern The position determined by the last character of the text with a transition in the automata

49 SBOM algorithm: example … the we build the Automata Factor Oracle of GTATG, GTAAT, TAATA and GTGTA of length lmin=5 We search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 A

50 SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGTATG A

51 SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGTATG A

52 SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGTATG A

53 SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGTATG A

54 SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGTATG A

55 SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGTATG A

56 SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGTATG A

57 SBOM algorithm: example Search for ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATGCTAGCTATAATAATGT… A

58 Multiple string matching 5 10 15 20 25 30 35 40 45 8 4 2 |  | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC


Download ppt "Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)"

Similar presentations


Ads by Google