Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recuperació de la informació

Similar presentations


Presentation on theme: "Recuperació de la informació"— Presentation transcript:

1 Recuperació de la informació
12/09/2018 Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq

2 String Matching 12/09/2018 String matching: definition of the problem (text,pattern) Exact matching: depends on what we have: text or patterns The patterns ---> Data structures for the patterns 1 pattern ---> The algorithm depends on |p| and || k patterns ---> The algorithm depends on k, |p| and || Extensions Regular Expressions The text ----> Data structure for the text (suffix tree, ...) Approximate matching: Dynamic programming Sequence alignment (pairwise and multiple) Sequence assembly: hash algorithm Probabilistic search: Hidden Markov Models

3 Extended string matching
12/09/2018 Classes of characters: when in some DNA files or patterns there are new characters as N or R that means N={A,C,G,T} and R={G,A}. Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3) means any 2 or 3 characters. Optional characters: we find pattern as AC?ACT?T?A where C? means that C may or may not appear in the text. Wild cards: we find pattern as AT*TA where * means an arbitrary long string. Repeatable characters: we find pattern as AT[TA]*AT where [TA]* means that TA can appear zero or more times.. As you have seen this morning ....

4 Classes of characters 12/09/2018 There are classes of characters represented by one symbol. For instace the IUPAC code for the DNA alphabet is: R = {G,A} Y = {T,C} K = {G,T} M = {A,C} S = {G,C} W = {A,T} B = {G,T,C } D = {G,A,T} H = {A,C,T} V = {G,C,A} N = {A,G,C,T} (any) 1. Classes of characters in the tetx. There are characters in the text that represent sets of simbols As you have seen this morning .... 2. Classes of characters in the pattern. There are characters in the pattern that represent sets of simbols

5 Extended alphabets First part Classes in the text 12/09/2018
As you have seen this morning ....

6 Classes in the text: Brute force algorithm
12/09/2018 How the comparison is made? Text : over 2|∑| Pattern over  From left to right: prefix We need the operation: belongs to a set ? ? Which is the next position of the window? As you have seen this morning .... Text : Pattern : The window is shifted only one cell

7 Classes in the text: Brute force algorithm
12/09/2018 When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=( , , , ) As you have seen this morning ....

8 Classes in the text: Brute force algorithm
12/09/2018 When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=( , , , ) As you have seen this morning ....

9 Classes in the text: Brute force algorithm
12/09/2018 When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “A belongs to set X” is made with ... As you have seen this morning ....

10 Classes in the text: Brute force algorithm
12/09/2018 When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “A belongs to set X” is made with I(A) and I(X) >0 G T A R T R N A G G A ... I(A) & I(T)>0 A T G T A A T G T A As you have seen this morning ....

11 Classes in the text: Brute force algorithm
12/09/2018 When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “A belongs to set X” is made with I(A) and I(X) >0 G T A R T R N A G G A ... I(A) & I(T)>0 A T G T A A T G T A I(A) & I(R)>0 I(T) & I(T)>0 I(G) & I(R)>0 I(T) & I(A)>0 A T G T A As you have seen this morning ....

12 Classes in the text: Brute force algorithm
12/09/2018 When || < computer word Every subset of  is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0),... I(R)=I(G,A)=(1,0,1,0)...I(N)=(1,1,1,1) Then the operation “A belongs to set X” is made with I(A) and I(X) >0 G T A R T R N A G G A ... I(A) & I(T)>0 A T G T A A T G T A I(A) & I(R)>0 I(T) & I(T)>0 I(G) & I(R)>0 I(T) & I(A)>0 A T G T A As you have seen this morning .... I(A) & I(N)>0 I(T) & I(R)>0 ... Which is the cost?

13 BNDM : Backward Nondeterministic Dawg Matching
Classes in the text 12/09/2018 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w

14 Classes in the text: Horspool algorithm
12/09/2018 Which is the next position of the window? How the comparison is made? Text : Pattern : Sufix search Pattern : Text : a As you have seen this morning .... Shift until the next ocurrence of “a” (or “t”,”r”,…) in the pattern: a We need a shift table with the extended alphabet.

15 Classes in the text :Horspool example
12/09/2018 Given the pattern ATGTA A 4 C 5 G 2 T 1 R ? N ? The shift table is: As you have seen this morning ....

16 Classes in the text :Horspool example
12/09/2018 Given the pattern ATGTA A 4 C 5 G 2 T 1 R 2 N ? The shift table is: As you have seen this morning ....

17 Classes in the text :Horspool example
12/09/2018 Given the pattern ATGTA A 4 C 5 G 2 T 1 R 2 N 1 The shift table is: text : G T A R T R N A A G G A … A T G T A A T G T A A T G T A As you have seen this morning ....

18 Classes in the text :Horspool example
12/09/2018 Given the pattern ATGTA A 4 C 5 G 2 T 1 R 2 N 1 The shift table is: text : G T A R T R N A A G G A ... A T G T A A T G T A A T G T A A T G T A As you have seen this morning ....

19 BNDM : Backward Nondeterministic Dawg Matching
Classes in the text 12/09/2018 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w

20 Classes in the text: BNDM algorithm
12/09/2018 Which is the next position of the window ? How the comparison is made? Text : Pattern : Search for suffixes of T that are factors of the pattern Once the next character x is read D3 = D2<<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( ) D = ( ) & ( ) = ( ) x …that is denoted as D2 = Depends on the value of the leftmost bit of D As you have seen this morning ....

21 Classes in the text : BNDM example
12/09/2018 Given the pattern ATGTA B(A) = ( ) B(R)=( ) B(C) = ( ) … B(G) = ( ) B(N)=( ) B(T) = ( ) The masks of bits are As you have seen this morning ....

22 Classes in the text : BNDM example
12/09/2018 Given the pattern ATGTA B(A) = ( ) B(R)=( ) B(C) = ( ) … B(G) = ( ) B(N)=( ) B(T) = ( ) The masks of bits are As you have seen this morning ....

23 Classes in the text : BNDM example
12/09/2018 Given the pattern ATGTA B(A) = ( ) B(R)=( ) B(C) = ( ) … B(G) = ( ) B(N)=( ) B(T) = ( ) The masks of bits are text : G T A R T R N A G G A C G ... A T G T A A T G T A D1 = ( ) D2 = ( ) & ( ) = ( ) D2 = ( ) & ( ) = ( ) As you have seen this morning ....

24 Classes in the text : BNDM example
12/09/2018 Given the pattern ATGTA B(A) = ( ) B(R)=( ) B(C) = ( ) … B(G) = ( ) B(N)=( ) B(T) = ( ) The masks of bits are text : G T A R T R N A G G A C G ... A T G T A A T G T A D1 = ( ) D2 = ( ) & ( ) = ( ) D2 = ( ) & ( ) = ( ) D1 = ( ) As you have seen this morning .... D2 = ( ) & ( ) = ( ) D3 = ( ) & ( ) = ( ) D4 = ( ) & ( ) = ( ) D5 = ( ) & ( ) = ( )

25 Classes in the text : BNDM example
12/09/2018 Given the pattern ATGTA B(A) = ( ) B(R)=( ) B(C) = ( ) … B(G) = ( ) B(N)=( ) B(T) = ( ) The masks of bits are text : G T A R T R N A G G A C G ... A T G T A A T G T A A T G T A D1 = ( ) D2 = ( ) & ( ) = ( ) D2 = ( ) & ( ) = ( ) D1 = ( ) As you have seen this morning .... D2 = ( ) & ( ) = ( ) D3 = ( ) & ( ) = ( ) D4 = ( ) & ( ) = ( ) D5 = ( ) & ( ) = ( )

26 BNDM : Backward Nondeterministic Dawg Matching
Classes in the text 12/09/2018 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w

27 BOM algorithm (Backward Oracle Matching)
12/09/2018 Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle Check if the suffix is a factor As you have seen this morning .... The position determined by the last character of the text with a transition in the automata

28 Classes in the text: BOM example
12/09/2018 The we build the AFO of the inverse pattern of ATGTATG G A T … and we try to find… : G T A R T R N A A T G… A T G T A T G As you have seen this morning .... It’s not possible any improvement!

29 Multiple string matching
12/09/2018 8 4 2 | | Wu-Manber SBOM lmin (5 strings) 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 8 4 2 Wu-Manber SBOM (100 strings) Ad AC 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC

30 Classes in the text: Set Horspool algorithm
12/09/2018 How the comparison is made? By suffixes Text : Patterns: Trie of all inverse patterns Which is the next position of the window? ? As you have seen this morning ....

31 Set Horspool algorithm
12/09/2018 Search for ATGTATG,TATG,ATAAT,ATGTG T A G 1. Construct the trie of GTATGTA, GTAT, TAATA i GTGTA 2. Determine lmin=4 A 1 C 4 (lmin) G 2 T 1 3. Determine the shift table As you have seen this morning .... 4. Find the patterns

32 Classes in the text: Set Horspool
12/09/2018 Search for the patterns ATGTATG,TATG,ATAAT,ATGTG T A G text: ARTGNCTATGTGACA… As you have seen this morning .... It’s not possible any improvement!

33 Multiple string matching
12/09/2018 8 4 2 | | Wu-Manber SBOM lmin (5 strings) 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 8 4 2 Wu-Manber SBOM (100 strings) Ad AC 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC

34 Classes in the text: SBOM algorithm
12/09/2018 Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle (Inverse patterns of length lmin) Check if the suffix is a factor of any pattern As you have seen this morning .... The position determined by the last character of the text with a transition in the automata

35 Classes in the text: SBOM example
12/09/2018 Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T G T A 1 4 G A T A A T A 2 3 text: ACATN C TAGC TA TA ATAATGTATG As you have seen this morning .... It’s not possible any improvement!

36 Extended alphabets Classes in the: text pattern Horspool ✓ BNDM ✓
12/09/2018 Classes in the: text pattern Horspool ✓ BNDM ✓ BOM ✗ Set-Horspool ✗ SBOM ✗ As you have seen this morning ....

37 Extended search Second part Classes in the pattern 12/09/2018
As you have seen this morning ....

38 Classes in the pattern: Brute force algorithm
12/09/2018 How the comparison is made? Text : over  Pattern : over 2|∑| From left to right: prefix We need the operation: belongs to a set ? ? Which is the next position of the window? As you have seen this morning .... Text : Pattern : The window is shifted only one cell

39 Classes in the pattern: Brute force algorithm
12/09/2018 When || < computer word Every subset is represented by a string of bits of length |  |. For instance, given the DNA alphabet ={A,C,G,T} : I(A)=(1,0,0,0), I(C)=(0,1,0,0), I(R)=(1,0,1,0,),..., I(N)=(1,1,1,1) Then the operation “A belongs to set X” is made with I(A) and I(X) >0 G T A C T A G A G G A C G T A T G T A C T G ... I(T) and I(R) >0 A T N T R I(A) and I(R) >0 A T N T R As you have seen this morning .... I(T) and I(T) >0 I(C) and I(N) >0 I(A) and I(T) >0

40 BNDM : Backward Nondeterministic Dawg Matching
Classes in the text 12/09/2018 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w

41 Classes in the pattern: Horspool algorithm
12/09/2018 Which is the next position of the window? How the comparison is made? Text : Pattern : Sufix search Pattern : Text : a As you have seen this morning .... Shift until the next ocurrence of “a” in the pattern: a We need a preprocessing phase to construct the shift table.

42 Classes in the pattern: Horspool example
12/09/2018 Given the pattern ATNTR A C G T The shift table is: As you have seen this morning ....

43 Classes in the pattern: Horspool example
12/09/2018 Given the pattern ATNTR A 2 C G T The shift table is: As you have seen this morning ....

44 Classes in the pattern: Horspool example
12/09/2018 Given the pattern ATNTR A 2 C 2 G T The shift table is: As you have seen this morning ....

45 Classes in the pattern: Horspool example
12/09/2018 Given the pattern ATNTR A 2 C 2 G 2 T The shift table is: As you have seen this morning ....

46 Classes in the pattern: Horspool example
12/09/2018 Given the pattern ATNTR A 2 C 2 G 2 T 1 The shift table is: text : G T A C T A G A T A T G A G ... A T N T R A T N T R A T N T R A T N T R A T N T R As you have seen this morning ....

47 Classes in the pattern: Horspool example
12/09/2018 Given the pattern ATNTR A 2 C 2 G 2 T 1 The shift table is: text : G T A C T A G A T A T G A G ... A T N T R A T N T R A T N T R A T N T R A T G T A A T N T R As you have seen this morning .... Shorter shifts!

48 BNDM : Backward Nondeterministic Dawg Matching
Classes in the text 12/09/2018 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w

49 Classes in the text: BNDM algorithm
12/09/2018 Which is the next position of the window ? How the comparison is made? Text : Pattern : Search for suffixes of T that are factors of the pattern Once the next character x is read D3 = D2<<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( ) D = ( ) & ( ) = ( ) x …that is denoted as D2 = Depends on the value of the leftmost bit of D As you have seen this morning ....

50 Classes in the pattern : BNDM example
12/09/2018 Given the pattern ATNTR The masks of bits of symbols are B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) As you have seen this morning ....

51 Classes in the pattern : BNDM example
12/09/2018 Given the pattern ATNTR The masks of bits of symbols are B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) As you have seen this morning ....

52 Classes in the pattern : BNDM example
12/09/2018 Given the pattern ATNTR The masks of bits of symbols are B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) As you have seen this morning ....

53 Classes in the pattern : BNDM example
12/09/2018 Given the pattern ATNTR The masks of bits of symbols are B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) As you have seen this morning ....

54 Classes in the pattern : BNDM example
12/09/2018 Given the pattern ATNTR The masks of bits of symbols are B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) text : G T A C T A G A G G A C G T A T G T A C T G ... A T N T R A T N T R D1 = ( ) A T N T R A T N T R D2 = ( ) & ( ) = ( ) D3 = ( ) & ( ) = ( ) D1 = ( ) D2 = ( ) & ( ) = ( ) As you have seen this morning .... D1 = ( ) D2 = ( ) & ( ) = ( ) D3 = ( ) & ( ) = ( ) D4 = ( ) & ( ) = ( )

55 BNDM : Backward Nondeterministic Dawg Matching
Classes in the text 12/09/2018 Experimental efficiency (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. pattern w

56 BOM algorithm (Backward Oracle Matching)
12/09/2018 Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle Check if the suffix is a factor As you have seen this morning .... The position determined by the last character of the text with a transition in the automata

57 Classes in the pattern: BOM example
12/09/2018 Given the pattern ATGTATG, the AFO is G A T but for the patter ATNTRTG? We should apply the SBOM algorithm! As you have seen this morning ....

58 Multiple string matching
12/09/2018 8 4 2 | | Wu-Manber SBOM lmin (5 strings) 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 8 4 2 Wu-Manber SBOM (100 strings) Ad AC 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC

59 Set Horspool algorithm
12/09/2018 How the comparison is made? By suffixes Text : Patterns: Trie of all inverse patterns Which is the next position of the window? a As you have seen this morning .... We shift until a is aligned with the first a in the trie not longer than lmin, or lmin

60 Set Horspool algorithm
12/09/2018 Search for ATNTARG,RTGR,NTTNAR,ATRTG 1. Construct the trie of the 46 possible inverse patterns 2. Determine lmin=4 A 1 C 2 G 1 T 1 3. Determine the shift table As you have seen this morning .... 4. Find the patterns

61 Multiple string matching
12/09/2018 8 4 2 | | Wu-Manber SBOM lmin (5 strings) 8 4 2 Wu-Manber SBOM (10 strings) Ad AC 8 4 2 Wu-Manber SBOM (100 strings) Ad AC 8 4 2 Wu-Manber SBOM (1000 strings) Ad AC

62 SBOM algorithm How the comparison is made?
12/09/2018 Which is the next position of the window? How the comparison is made? Text : Pattern : Automata: Factor Oracle (Inverse patterns of length lmin) Check if the suffix is a factor of any pattern As you have seen this morning .... The position determined by the last character of the text with a transition in the automata

63 Classes in the patterns: SBOM example
12/09/2018 Given the patterns ATGNARG, TRATR,TAATAAT i ANTNTGR the Automata Factor Oracle of all 21 possible patterns is built … As you have seen this morning ....

64 Extended alphabets Classes in the: text pattern Horspool ✓ ✓ BNDM ✓ ✓
12/09/2018 Classes in the: text pattern Horspool ✓ ✓ BNDM ✓ ✓ BOM ✗ ≈ Set-Horspool ✗ ≈ SBOM ✗ ≈ As you have seen this morning ....

65 Extended string matching
12/09/2018 Classes of characters: when in some DNA files or patterns there are new characters as N or R that means N={A,C,G,T} and R={G,A}. Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3) means any 2 or 3 characters. Optional characters: we find pattern as AC?ACT?T?A where C? means that C may or may not appear in the text. Wild cards: we find pattern as AT*TA where * means an arbitrary long string. Repeatable characters: we find pattern as AT[TA]*AT where [TA]* means that TA can appear zero or more times.. As you have seen this morning ....

66 Bounded length gaps : BNDM example
12/09/2018 Given the pattern ATx(2,3)TA B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) The masks of bits are As you have seen this morning ....

67 Bounded length gaps : BNDM example
12/09/2018 Given the pattern ATx(2,3)TA B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) The masks of bits are The masks of bits are text : A T A G T A G A G T ... D1 = ( ) D2 = ( ) & ( ) = ( ) D3 = ( ) & ( ) = ( ) D4 = ( ) & ( ) = ( ) ? D5 = ( ) & ( ) = ( ) As you have seen this morning .... D6 = ( ) & ( ) = ( )

68 Bounded length gaps : BNDM example
12/09/2018 Given the pattern ATx(2,3)TA text : A T A G T A G A G T ... D1 = ( ) D2 = ( ) & ( ) = ( ) D3 = ( ) & ( ) = ( )  ( ) D4 = ( ) & ( ) = ( ) D5 = ( ) & ( ) = ( ) ? D6 = ( ) & ( ) = ( ) Let’s see the automaton: A T * ε = F = I As you have seen this morning .... - D  [ F - (I & D) ] & ¬ F


Download ppt "Recuperació de la informació"

Similar presentations


Ads by Google