Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)

Similar presentations


Presentation on theme: "Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)"— Presentation transcript:

1 Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq http://www-igm.univ-mlv.fr/~lecroq/string/index.html

2 String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching: 1 pattern ---> The algorithm depends on |p| and |  | k patterns ---> The algorithm depends on k, |p| and |  | The text ----> Data structure for the text (suffix tree,...) The patterns ---> Data structures for the patterns Dynamic programming Sequence alignment (pairwise and multiple) Extensions Regular Expressions Probabilistic search: Sequence assembly: hash algorithm Hidden Markov Models

3 Index 1a. Part: Suffix trees Algorithms on strings, trees and sequences, Dan Gusfield Cambridge University Press 2a. Part: Suffix arrays Suffix-arrays: a new method for on-line string searches, G. Myers, U. Manber

4 Suffix trees Given string ababaas: 1: ababaas 2: babaas 3: abaas 4: baas 5: aas 6: as 7: s as,3 s,6 as,5 s,7 as,4 ba baas,2 a ba baas,1 a ba baas,1 ba baas,2 as,3as,4 s,6 as,5 s,7 Suffixes: What kind of queries?

5 Applications of Suffix trees a ba baas,1 as,3 ba baas,2 as,4 s,6 as,5 s,7 1. Exact string matching ………………………… Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?

6 Quadratic insertion algorithm Given the string …………………………...... P1: the leaves of suffixes from  have been inserted and the suffix-tree …...  Invariant Properties:

7 Quadratic insertion algorithm Given the string ababaabbs ababaabbs,1

8 Quadratic insertion algorithm Given the string ababaabbs babaabbs,2 ababaabbs,1

9 Quadratic insertion algorithm Given the string ababaabbs babaabbs,2 ababaabbs,1 aba baabbs,1

10 Quadratic insertion algorithm Given the string ababaabbs babaabbs,2 aba baabbs,1 abbs,3

11 Quadratic insertion algorithm Given the string ababaabbs babaabbs,2 aba baabbs,1 abbs,3 ba baabbs,2

12 Quadratic insertion algorithm Given the string ababaabbs aba baabbs,1 abbs,3 ba baabbs,2 abbs,4

13 Quadratic insertion algorithm Given the string ababaabbs aba baabbs,1 abbs,3 abbs,4 ba baabbs,2 abbs,4 abbs,3 ba a baabbs,1

14 Quadratic insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 abbs,3 ba a baabbs,1 abbs,5

15 Quadratic insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 abbs,3 ba a baabbs,1 abbs,5

16 Quadratic insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 a abbs,5 b a abbs,3 baabbs,1

17 Quadratic insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 a abbs,5 b a abbs,3 baabbs,1 bs,6

18 Quadratic insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 a abbs,5 b a abbs,3 baabbs,1 bs,6

19 Quadratic insertion algorithm Given the string ababaabbs a abbs,5 b a abbs,3 baabbs,1 bs,6 a baabbs,2 b abbs,4 bs,7

20 Quadratic insertion algorithm Given the string ababaabbs a abbs,5 b a abbs,3 baabbs,1 bs,6 a baabbs,2 b abbs,4 bs,7 s,7

21 Quadratic insertion algorithm Given the string ababaabbs a abbs,5 b a abbs,3 baabbs,1 bs,6 a baabbs,2 b abbs,4 bs,7 s,7

22 Generalizad suffix tree The suffix tree of many strings … and it is the suffix tree of the concatenation of strings. the generalized suffix tree of ababaabb and aabaat … is the suffix tree of ababaabαaabaatβ, : is called the generalized suffix tree … For instance,

23 Generalizad suffix tree a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,7 Given the suffix tree of ababaabα : Construction of the suffix tree of ababaabbαaabaaβ :

24 Generalizad suffix tree a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ :

25 Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : a bα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,7 ab aaβ,1

26 Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : a bα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,7 ab aaβ,1

27 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,7 ab aaβ,1 a β,2 Construction of the suffix tree of ababaabbαaabaaβ :

28 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,7 ab aaβ,1 a β,2 Construction of the suffix tree of ababaabbαaabaaβ :

29 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 ab aaβ,1 a β,2 a β,3

30 Construction of the suffix tree of ababaabbαaabaaβ : Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 ab aaβ,1 a β,2 a β,3

31 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4 Construction of the suffix tree of ababaabbαaabaaβ :

32 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4 Construction of the suffix tree of ababaabbαaabaaβ :

33 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4 Construction of the suffix tree of ababaabbαaabaaβ :

34 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4 Construction of the suffix tree of ababaabbαaabaaβ :

35 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4 Construction of the suffix tree of ababaabbαaabaaβ :

36 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4 Generalized suffix tree of ababaabbαaabaaβ :

37 Applications of Generalized Suffix trees 1. The substring problem for a database of strings DB Does the DB contain any ocurrence of patterns abab, aab, and ab? a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4

38 Applications of Generalized Suffix trees 2. The longest common substring of two strings a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4 &nbsp

39 Definition of MUM … a a t g….c t g... … c g t g….c c c... MatchingUniqueMaximal MUM

40 Applications of Generalized Suffix trees 3. Finding MUMs. a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4

41 Quadratic insertion algorithm Given the string …………………………...... P1: the leaves of suffixes from  have been inserted and the suffix-tree …...  Invariant Properties:

42 Linear insertion algorithm Given the string …………………………...... P2: the string  is the longest string that can be spelt through the tree. P1: the leaves of suffixes from  have been inserted and the suffix-tree  …...   Invariant Properties:

43 Linear insertion algorithm: example Given the string ababaababb... ba baababab...,2 a ababab...,5 ba ababab...,3 baababab...,1 ababab...,4  

44 Linear insertion algorithm: example Given the string ababaababb... ba baababab...,2 a ababab...,5 ba ababab...,3 baababab...,1 ababab...,4   6 7 8

45 Linear insertion algorithm: example ba baababb...,2 a ababb...,5 ba ababb...,3 baababb...,1 ababb...,4  6 7 8 Given the string ababaababb... 

46 Linear insertion algorithm: example ba baababb...,2 a ababb...,5 ba ababb...,3 baababb...,1 ababb...,4  6 7 89 Given the string ababaababb... 

47 Linear insertion algorithm: example a ababb...,5 ba ababb...,3 baababb...,1 ba baababb...,2 ababb...,4 Given the string ababaababb...   6 7 89 baababb...,1 b b...,6 ababb...,1

48 Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba baababb...,2 ababb...,4 Given the string ababaababb...   7 89 b b...,6 ababb...,1

49 Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba baababb...,2 ababb...,4 Given the string ababaababb...   7 89 b b...,6 ababb...,1 aa 

50 Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba baababb...,2 ababb...,4 Given the string ababaababb...   7 89 b b...,6 ababb...,1

51 Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba baababb...,2 ababb...,4 Given the string ababaababb...   7 89 b b...,6 ababb...,1 baababb...,2 b aababb...,2

52 Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba baababb...,2 ababb...,4 Given the string ababaababb...   7 8… b b...,6 ababb...,1 baababb...,2 b b...,7 aababb...,2

53 Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba ababb...,4 Given the string ababaababb...   89 b b...,6 ababb...,1 b b...,7 aababb...,2

54 Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba ababb...,4 Given the string ababaababb...   89 b b...,6 ababb...,1 b b...,7 aababb...,2

55 Linear insertion algorithm: example a ababb...,5 ba ababb...,3 ba ababb...,4 Given the string ababaababb...   89 b b...,6 ababb...,1 b b...,7 aababb...,2

56 Linear insertion algorithm: example a ababb...,5 b ba ababb...,4 Given the string ababaababb...   89 ababb...,3 b b...,6 ababb...,1 b b...,7 aababb...,2 a

57 Linear insertion algorithm: example a ababb...,5 b ba ababb...,4 Given the string ababaababb...   89 ababb...,3 b b...,6 ababb...,1 b b...,7 aababb...,2 a b...,8

58 Linear insertion algorithm: example a ababb...,5 b ba ababb...,4 Given the string ababaababb...   9 ababb...,3 b b...,6 ababb...,1 b b...,7 aababb...,2 a b...,8

59 Linear insertion algorithm: example a ababb...,5 b ba ababb...,4 Given the string ababaababb...   9 ababb...,3 b b...,6 ababb...,1 b b...,7 aababb...,2 a b...,8

60 Linear insertion algorithm: example a ababb...,5 b b ababb...,4 Given the string ababaababb...  9 ababb...,3 b b...,6 ababb...,1 b b...,7 aababb...,2 a b...,8 a 

61 Linear insertion algorithm: example a ababb...,5 b b ababb...,4 Given the string ababaababb...  9 ababb...,3 b b...,6 ababb...,1 b b...,7 aababb...,2 a b...,8 a b...,9 

62 Linear insertion algorithm: example a ababb...,5 b b ababb...,4 Given the string ababaababb...  9 ababb...,3 b b...,6 ababb...,1 b b...,7 aababb...,2 a b...,8 a b...,9

63 Linear insertion algorithm: example a ababb...,5 b b ababb...,4 Given the string ababaababb...   9 ababb...,3 b b...,6 ababb...,1 b b...,7 aababb...,2 a b...,8 a b...,9

64 Linear insertion algorithm Given the string ababaababs ababaababs,1

65 Linear insertion algorithm Given the string ababaababs ababaababs,1 babaababs,2

66 Linear insertion algorithm Given the string ababaababs ababaababs,1 babaababs,2

67 Linear insertion algorithm Given the string ababaababs ababaababs,1 babaababs,2

68 Linear insertion algorithm Given the string ababaababs ababaababs,1 babaababs,2

69 Linear insertion algorithm Given the string ababaababs ababaababs,1 babaababs,2 aba baababs,1

70 Linear insertion algorithm Given the string ababaababs babaababs,2 aba baababs,1 ababs,3 ba baababs,2

71 Linear insertion algorithm Given the string ababaababs aba baababs,1 ababs,3 ba baababs,2 ababs,4 ababs,3 ba a baababs,1

72 Linear insertion algorithm Given the string ababaababs ba baababs,2 ababs,4 ababs,3 ba a baababs,1 ababs,5

73 Index 1a. Part: Suffix trees Algorithms on strings, trees and sequences, Dan Gusfield Cambridge University Press 2a. Part: Suffix arrays Suffix-arrays: a new method for on-line string searches, G. Myers, U. Manber

74 Suffix arrays Given string ababaa#: 1: ababaa# 2: babaa# 3: abaa# 4: baa# 5: aa# 6: a# 7: # Suffixes:… but lexicographically sorted 1: ababaa# 2: babaa# 3: abaa# 4: baa# 5: aa# 6: a# 1: # 12345671234567 Which is the cost?O(n log(n))

75 Applications of suffix arrays 1. Exact string matching Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab? 1: ababaa# 2: babaa# 3: abaa# 4: baa# 5: aa# 6: a# 1: #12345671234567 Binary search

76 Search with cost O(log(n) |P|) Query: Invariant Properties: P1: α < query ≤ β α β 1 2 … n Suffix array

77 Search with cost O(log(n) |P|) Query: Invariant Properties: P1: α < query ≤ β α β γ Algorithm: 1 2 … n Suffix array If γ <query then α = γ else β = γ Cost: O(log(n) |P|) Can it be improved to … O(log(n)+|P|) ?

78 Fast search with cost O(log(n)+|P|) Query: Invariant Properties: P1: α < query ≤ β α β 1 2 … n Suffix array P2: matches pref( query)

79 Fast search with cost O(log(n)+|P|) Query: Invariant Properties: P1: α < query ≤ β α β γ Algorithm: 1 2 … n Suffix array P2: matches pref( query) If x<y then α = γ x>y then β = γ x=y then … fi x y


Download ppt "Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)"

Similar presentations


Ads by Google