Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)

Similar presentations


Presentation on theme: "Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)"— Presentation transcript:

1 Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq http://www-igm.univ-mlv.fr/~lecroq/string/index.html

2 String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching: 1 pattern ---> The algorithm depends on |p| and |  | k patterns ---> The algorithm depends on k, |p| and |  | The text ----> Data structure for the text (suffix tree,...) The patterns ---> Data structures for the patterns Dynamic programming Sequence alignment (pairwise and multiple) Extensions Regular Expressions Probabilistic search: Sequence assembly: hash algorithm Hidden Markov Models

3 Extended string matching There are characters in the text that represent sets of simbols 1. Classes of characters in the tetx. There are characters in the pattern that represent sets of simbols 2. Classes of characters in the pattern. There are classes of characters represented by one symbol. For instace the IUPAC code for the DNA alphabet is: R = {G,A} Y = {T,C} K = {G,T} M = {A,C} S = {G,C} W = {A,T} B = {G,T,C } D = {G,A,T} H = {A,C,T} V = {G,C,A} N = {A,G,C,T} (any)

4 Extended alphabets First part Classes in the text

5 Classes in the text: Brute force algorithm Text : over 2 |∑| Pattern over  From left to right: prefix Which is the next position of the window? How the comparison is made? Pattern : Text : The window is shifted only one cell We need the operation: belongs to a set ? ?

6 Classes in the text: Brute force algorithm Every subset of  is represented by a string of bits of length |  |. When |  | < computer word For instance, given the DNA alphabet  ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(,,, )

7 Classes in the text: Brute force algorithm Every subset of  is represented by a string of bits of length |  |. When |  | < computer word For instance, given the DNA alphabet  ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(,,, )

8 Classes in the text: Brute force algorithm Every subset of  is represented by a string of bits of length |  |. When |  | < computer word For instance, given the DNA alphabet  ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “ A belongs to set X ” is made with...

9 Classes in the text: Brute force algorithm Every subset of  is represented by a string of bits of length |  |. G T A R T R N A G G A... A T G T A When |  | < computer word For instance, given the DNA alphabet  ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “ A belongs to set X ” is made with I(A) and I(X) >0 I(A) & I(T)>0

10 Classes in the text: Brute force algorithm Every subset of  is represented by a string of bits of length |  |. G T A R T R N A G G A... A T G T A When |  | < computer word For instance, given the DNA alphabet  ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “ A belongs to set X ” is made with I(A) and I(X) >0 I(A) & I(T)>0 I(A) & I(R)>0 I(T) & I(T)>0I(G) & I(R)>0I(T) & I(A)>0

11 Classes in the text: Brute force algorithm Every subset of  is represented by a string of bits of length |  |. G T A R T R N A G G A... A T G T A When |  | < computer word For instance, given the DNA alphabet  ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “ A belongs to set X ” is made with I(A) and I(X) >0 I(A) & I(T)>0 I(A) & I(R)>0 I(T) & I(T)>0I(G) & I(R)>0I(T) & I(A)>0 I(A) & I(N)>0 I(T) & I(R)>0... Which is the cost?

12 Classes in the text Experimental efficiency (Navarro & Raffinot) 2 4 8 16 32 64 128 256 64 32 16 8 4 2 |  | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

13 Classes in the text: Horspool algorithm Text : Pattern : Sufix search Which is the next position of the window? How the comparison is made? Pattern : Text : a Shift until the next ocurrence of “a” (or “t”,”r”,…) in the pattern: a aa aa a We need a shift table with the extended alphabet.

14 Classes in the text :Horspool example Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T 1 R ? … N ?

15 Classes in the text :Horspool example Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T 1 R 2 … N ?

16 Classes in the text :Horspool example Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T 1 R 2 … N 1 text :G T A R T R N A A G G A … A T G T A

17 Classes in the text :Horspool example Given the pattern ATGTA The shift table is: A 4 C 5 G 2 T 1 R 2 … N 1 text :G T A R T R N A A G G A... A T G T A …

18 Classes in the text Experimental efficiency (Navarro & Raffinot) 2 4 8 16 32 64 128 256 64 32 16 8 4 2 |  | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

19 Text : Pattern : Search for suffixes of T that are factors of the pattern Classes in the text: BNDM algorithm Which is the next position of the window ? How the comparison is made? …that is denoted as D 2 = 1 0 0 0 1 0 0 Depends on the value of the leftmost bit of D Once the next character x is read D 3 = D 2 <<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0) D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 ) x

20 Classes in the text : BNDM example Given the pattern ATGTA The masks of bits are B(A) = ( 1 0 0 0 1 ) B(R)=( ) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=( ) B(T) = ( 0 1 0 1 0 )

21 Classes in the text : BNDM example Given the pattern ATGTA B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=( ) B(T) = ( 0 1 0 1 0 ) The masks of bits are

22 Classes in the text : BNDM example Given the pattern ATGTA text :G T A R T R N A G G A C G... A T G T A B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1) B(T) = ( 0 1 0 1 0 ) D 1 = ( 0 1 0 1 0 ) D 2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 ) D 2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 ) The masks of bits are

23 Classes in the text : BNDM example Given the pattern ATGTA text :G T A R T R N A G G A C G... A T G T A B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1) B(T) = ( 0 1 0 1 0 ) D 1 = ( 0 1 0 1 0 ) D 2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 ) D 1 = ( 1 0 0 0 1 ) D 2 = ( 0 0 0 1 0 ) & ( 1 1 1 1 1 ) = ( 0 0 0 1 0 ) D 2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 ) D 3 = ( 0 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 0 0 1 0 0 ) D 4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 ) D 5 = ( 1 0 0 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 0 0 0) The masks of bits are

24 Classes in the text : BNDM example Given the pattern ATGTA text :G T A R T R N A G G A C G... A T G T A B(A) = ( 1 0 0 0 1 ) B(R)=(1 0 1 0 1) B(C) = ( 0 0 0 0 0 ) … B(G) = ( 0 0 1 0 0 ) B(N)=(1 1 1 1 1) B(T) = ( 0 1 0 1 0 ) D 1 = ( 0 1 0 1 0 ) D 2 = ( 1 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 1 0 0 ) D 1 = ( 1 0 0 0 1 ) D 2 = ( 0 0 0 1 0 ) & ( 1 1 1 1 1 ) = ( 0 0 0 1 0 ) D 2 = ( 0 1 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 ) D 3 = ( 0 0 1 0 0 ) & ( 1 0 1 0 1 ) = ( 0 0 1 0 0 ) D 4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 ) D 5 = ( 1 0 0 0 0 ) & ( 1 0 1 0 1 ) = ( 1 0 0 0 0) … The masks of bits are

25 Classes in the text Experimental efficiency (Navarro & Raffinot) 2 4 8 16 32 64 128 256 64 32 16 8 4 2 |  | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

26 BOM algorithm (Backward Oracle Matching) Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle Check if the suffix is a factor The position determined by the last character of the text with a transition in the automata

27 Classes in the text: BOM example The we build the AFO of the inverse pattern of ATGTATG … and we try to find… :G T A R T R N A A T G… A T G T A T G GGATT AT T A G It’s not possible any improvement!

28 Multiple string matching 5 10 15 20 25 30 35 40 45 8 4 2 |  | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC

29 Classes in the text: Set Horspool algorithm Text : Patterns: By suffixes Which is the next position of the window? How the comparison is made? Trie of all inverse patterns ?

30 Set Horspool algorithm Search for ATGTATG,TATG,ATAAT,ATGTG 4. Find the patterns T A G G A T T T T G A A A A T 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA 2. Determine lmin=4 A 1 C 4 (lmin) G 2 T 1 3. Determine the shift table

31 Classes in the text: Set Horspool Search for the patterns ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T text: ARTGNCTATGTGACA… It’s not possible any improvement!

32 Multiple string matching 5 10 15 20 25 30 35 40 45 8 4 2 |  | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC

33 Classes in the text: SBOM algorithm Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle (Inverse patterns of length lmin) Check if the suffix is a factor of any pattern The position determined by the last character of the text with a transition in the automata

34 Classes in the text: SBOM example Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A 14 23 text: ACATN C TAGC TA TA ATAATGTATG A It’s not possible any improvement!

35 Extended alphabets Classes in the: text pattern Horspool ✓ BNDM ✓ BOM ✗ Set-Horspool ✗ SBOM ✗

36 Extended search Second part Classes in the pattern

37 Classes in the pattern: Brute force algorithm Text : over  From left to right: prefix Which is the next position of the window? How the comparison is made? Pattern : Text : The window is shifted only one cell We need the operation: belongs to a set ? ? Pattern : over 2 |∑|

38 Classes in the pattern: Brute force algorithm Every subset is represented by a string of bits of length |  |. G T A C T A G A G G A C G T A T G T A C T G... A T N T R When |  | < computer word For instance, given the DNA alphabet  ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=(1,0,1,0,),..., I(N)=(1,1,1,1) Then the operation “ A belongs to set X ” is made with I(A) and I(X) >0 I(T) and I(R) >0 I(A) and I(R) >0 I(T) and I(T) >0 I(C) and I(N) >0 I(A) and I(T) >0 …

39 Classes in the text Experimental efficiency (Navarro & Raffinot) 2 4 8 16 32 64 128 256 64 32 16 8 4 2 |  | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

40 Classes in the pattern: Horspool algorithm Text : Pattern : Sufix search Which is the next position of the window? How the comparison is made? Pattern : Text : a Shift until the next ocurrence of “a” in the pattern: a aa aa a We need a preprocessing phase to construct the shift table.

41 Classes in the pattern: Horspool example Given the pattern ATNTR The shift table is: ACGTACGT

42 Classes in the pattern: Horspool example Given the pattern ATNTR The shift table is: A 2 C G T

43 Classes in the pattern: Horspool example Given the pattern ATNTR The shift table is: A 2 C 2 G T

44 Classes in the pattern: Horspool example Given the pattern ATNTR The shift table is: A 2 C 2 G 2 T

45 Classes in the pattern: Horspool example Given the pattern ATNTR The shift table is: A 2 C 2 G 2 T 1 text :G T A C T A G A T A T G A G... A T N T R

46 Classes in the pattern: Horspool example Given the pattern ATNTR The shift table is: A 2 C 2 G 2 T 1 text :G T A C T A G A T A T G A G... A T N T R A T G T A Shorter shifts!

47 Classes in the text Experimental efficiency (Navarro & Raffinot) 2 4 8 16 32 64 128 256 64 32 16 8 4 2 |  | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

48 Text : Pattern : Search for suffixes of T that are factors of the pattern Classes in the text: BNDM algorithm Which is the next position of the window ? How the comparison is made? …that is denoted as D 2 = 1 0 0 0 1 0 0 Depends on the value of the leftmost bit of D Once the next character x is read D 3 = D 2 <<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0) D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 ) x

49 Classes in the pattern : BNDM example Given the pattern ATNTR The masks of bits of symbols are B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( )

50 Classes in the pattern : BNDM example Given the pattern ATNTR The masks of bits of symbols are B(A) = ( 1 0 1 0 1 ) B(C) = ( ) B(G) = ( ) B(T) = ( )

51 Classes in the pattern : BNDM example Given the pattern ATNTR The masks of bits of symbols are B(A) = ( 1 0 1 0 1 ) B(C) = ( 0 0 1 0 0 ) B(G) = ( ) B(T) = ( )

52 Classes in the pattern : BNDM example Given the pattern ATNTR The masks of bits of symbols are B(A) = ( 1 0 1 0 1 ) B(C) = ( 0 0 1 0 0 ) B(G) = ( 0 0 1 0 1 ) B(T) = ( )

53 Classes in the pattern : BNDM example Given the pattern ATNTR text :G T A C T A G A G G A C G T A T G T A C T G... A T N T R The masks of bits of symbols are B(A) = ( 1 0 1 0 1 ) B(C) = ( 0 0 1 0 0 ) B(G) = ( 0 0 1 0 1 ) B(T) = ( 0 1 1 1 0 ) D 1 = ( 0 1 1 1 0 ) D 2 = ( 1 1 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 ) D 1 = ( 0 0 1 0 1 ) D 2 = ( 0 1 0 1 0 ) & ( 0 0 1 0 1 ) = ( 0 0 0 0 0 ) D 3 = ( 0 1 0 0 0 ) & ( 1 0 1 0 1 ) = ( 0 0 0 0 0 ) … D 1 = ( 1 0 1 0 1 ) D 2 = ( 0 1 0 1 0 ) & ( 0 1 1 1 0 ) = ( 0 1 0 1 0 ) D 3 = ( 1 0 1 0 0 ) & ( 0 0 1 0 1 ) = ( 0 0 1 0 0 ) D 4 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 ) A T N T R

54 Classes in the text Experimental efficiency (Navarro & Raffinot) 2 4 8 16 32 64 128 256 64 32 16 8 4 2 |  | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

55 BOM algorithm (Backward Oracle Matching) Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle Check if the suffix is a factor The position determined by the last character of the text with a transition in the automata

56 Classes in the pattern: BOM example Given the pattern ATGTATG, the AFO is GGATT AT T A G We should apply the SBOM algorithm! but for the patter ATNTRTG?

57 Multiple string matching 5 10 15 20 25 30 35 40 45 8 4 2 |  | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC

58 Set Horspool algorithm Text : Patterns: By suffixes Which is the next position of the window? How the comparison is made? a Trie of all inverse patterns We shift until a is aligned with the first a in the trie not longer than lmin, or lmin

59 Set Horspool algorithm Search for ATNTARG,RTGR,NTTNAR,ATRTG 4. Find the patterns 1. Construct the trie of the 46 possible inverse patterns 2. Determine lmin=4 A 1 C 2 G 1 T 1 3. Determine the shift table

60 Multiple string matching 5 10 15 20 25 30 35 40 45 8 4 2 |  | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC

61 SBOM algorithm Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle (Inverse patterns of length lmin) Check if the suffix is a factor of any pattern The position determined by the last character of the text with a transition in the automata

62 Classes in the patterns: SBOM example the Automata Factor Oracle of all 21 possible patterns is built … Given the patterns ATGNARG, TRATR,TAATAAT i ANTNTGR

63 Multiple string matching 5 10 15 20 25 30 35 40 45 8 4 2 |  | Wu-Manber SBOM lmin (5 strings) 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC 5 10 15 20 25 30 35 40 45 8 4 2 Wu-Manber SBOM (100 strings) Ad AC

64 Extended alphabets Classes in the: text pattern Horspool ✓ ✓ BNDM ✓ ✓ BOM ✗ ≈ Set-Horspool ✗ ≈ SBOM ✗ ≈


Download ppt "Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)"

Similar presentations


Ads by Google