Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }

Similar presentations


Presentation on theme: "Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }"— Presentation transcript:

1 Suffix trees

2 Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }

3 Trie (Cont) Assume no string is a prefix of another a c b c e e f d b f e g Each edge is labeled by a letter, no two edges outgoing from the same node are labeled the same. Each string corresponds to a leaf.

4 Compressed Trie Compress unary nodes, label edges by strings a c b c e e f d b f e g a c bbf c eef d e g 

5 Suffix tree Given a string s a suffix tree of s is a compressed trie of all suffixes of s To make these suffixes prefix-free we add a special character, say $, at the end of s

6 Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ { $ b$ ab$ bab$ abab$ } a b a b $ a b $ b $ $ $

7 Trivial algorithm to build a Suffix tree Put the largest suffix in Put the suffix bab$ in a b a b $ a b a b $ a b $ b

8 Put the suffix ab$ in a b a b $ a b $ b a b a b $ a b $ b $

9 Put the suffix b$ in a b a b $ a b $ b $ a b a b $ a b $ b $ $

10 Put the suffix $ in a b a b $ a b $ b $ $ a b a b $ a b $ b $ $ $

11 We will also label each leaf with the starting point of the corres. suffix. a b a b $ a b $ b $ $ $ 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $

12 Analysis Takes O(n 2 ) time to build. We will see how to do it in O(n) time

13 What can we do with it ? Exact string matching: Given a Text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T. We may also want to find all occurrences of P in T

14 Exact string matching In preprocessing we just build a suffix tree in O(n) time 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ Given a pattern P = ab we traverse the tree according to the pattern.

15 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence. By traversing this subtree we get all k occurrences in O(n+k) time

16 Generalized suffix tree Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s  S To associate each suffix with a unique string in S add a different special char to each s

17 Generalized suffix tree (Example) Let s 1 =abab and s 2 =aab here is a generalized suffix tree for s 1 and s 2 { $ # b$ b# ab$ ab# bab$ aab# abab$ } 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ 1 b # a 2 # 3 # 4 #

18 So what can we do with it ? Matching a pattern against a database of strings

19 Longest common substring (of two strings) Every node with a leaf descendant from string s 1 and a leaf descendant from string s 2 represents a maximal common substring and vice versa. 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ 1 b # a 2 # 3 # 4 # Find such node with largest “string depth”

20 Lowest common ancetors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

21 Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ 1 b # a 2 # 3 # 4 #

22 Finding maximal palindromes A palindrome: caabaac, cbaabc Want to find all maximal palindromes in a string s Let s = cbaaba The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of s r

23 Maximal palindromes algorithm Prepare a generalized suffix tree for s = cbaaba$ and s r = abaabc# For every i find the LCA of suffix i of s and suffix m-i+1 of s r

24 3 a abab a baaba$baaba$ b 3 $ 7 $ b 7 # c 1 6 abab c#c# 5 2 2 a$a$ c#c# a 5 6 $ 4 4 1 c#c# a$a$ $ abc#abc# c#c# Let s = cbaaba$ then s r = abaabc#

25 Analysis O(n) time to identify all palindromes

26 Can we construct a suffix tree in linear time ?

27 Ukkonen’s linear time construction ACTAATC A 1 A

28 A 1 AC

29 ACTAATC AC 1

30 ACTAATC AC 1 2 C

31 ACTAATC AC 1 ACT 2 C

32 ACTAATC ACT 1 2 C

33 ACTAATC ACT 1 2 CT

34 ACTAATC ACT 1 2 CT 3 T

35 ACTAATC ACT 1 ACTA 2 CT 3 T

36 ACTAATC ACTA 1 2 CT 3 T

37 ACTAATC ACTA 1 2 CTA 3 T

38 ACTAATC ACTA 1 2 CTA 3 TA

39 ACTAATC ACTA 1 ACTAA 2 CTA 3 TA

40 ACTAATC 1 ACTAA 2 CTA 3 TA ACTAA

41 ACTAATC ACTAA 1 2 CTAA 3 TA

42 ACTAATC ACTAA 1 2 CTAA 3 TAA

43 ACTAATC CTAA 1 ACTAA 2 CTAA 3 TAA A 4 A

44 Phases & extensions Phase i is when we add character i i In phase i we have i extensions of suffixes

45 ACTAATC CTAA 1 ACTAAT 2 CTAA 3 TAA A 4 A

46 ACTAATC CTAAT 1 ACTAAT 2 CTAA 3 TAA A 4 A

47 ACTAATC CTAAT 1 ACTAAT 2 CTAAT 3 TAA A 4 A

48 ACTAATC CTAAT 1 ACTAAT 2 CTAAT 3 TAAT A 4 A

49 ACTAATC CTAAT 1 ACTAAT 2 CTAAT 3 TAAT A 4 AT

50 ACTAATC CTAAT 1 ACTAAT 2 CTAAT 3 TAAT A 4 AT 5 T

51 Extension rules Rule 1: The suffix ends at a leaf, you add a character on the edge entering the leaf Rule 2: The suffix ended internally and the extended suffix does not exist, you add a leaf and possibly an internal node Rule 3: The suffix exists and the extended suffix exists, you do nothing

52 ACTAATC 1 2 CTAAT 3 TAAT A 4 AT 5 T CTAAT

53 ACTAATC CTAATC 1 ACTAATC 2 CTAAT 3 TAAT A 4 AT 5 T

54 ACTAATC CTAATC 1 ACTAATC 2 CTAATC 3 TAAT A 4 AT 5 T

55 ACTAATC CTAATC 1 ACTAATC 2 CTAATC 3 TAATC A 4 AT 5 T

56 ACTAATC CTAATC 1 ACTAATC 2 CTAATC 3 TAATC A 4 ATC 5 T

57 ACTAATC CTAATC 1 ACTAATC 2 CTAATC 3 TAATC A 4 ATC 5 TC

58 ACTAATC CTAATC 1 ACTAATC 2 CTAATC 3 T A 4 ATC 5 TC AATC 6 C

59 1 2 C 3 T A 4 ATCAC 5 TCAC AATCAC 6 CAC ACTAATCAC TAATCAC 7 AC CTAATCAC Skip forward..

60 1 2 C 3 T A 4 ATCAC 5 TCAC AATCAC 6 CAC ACTAATCACT TAATCAC 7 AC CTAATCAC

61 1 2 C 3 T A 4 ATCAC 5 TCAC AATCAC 6 CAC ACTAATCACT TAATCAC 7 AC CTAATCACT

62 1 2 C 3 T A 4 ATCAC 5 TCAC AATCAC 6 CAC ACTAATCACT TAATCACT 7 AC CTAATCACT

63 1 2 C 3 T A 4 ATCAC 5 TCAC AATCACT 6 CAC ACTAATCACT TAATCACT 7 AC CTAATCACT

64 1 2 C 3 T A 4 ATCACT 5 TCAC AATCACT 6 CAC ACTAATCACT TAATCACT 7 AC CTAATCACT

65 1 2 C 3 T A 4 ATCACT 5 TCACT AATCACT 6 CAC ACTAATCACT TAATCACT 7 AC CTAATCACT

66 1 2 C 3 T A 4 ATCACT 5 TCACT AATCACT 6 CACT ACTAATCACT TAATCACT 7 AC CTAATCACT

67 1 2 C 3 T A 4 ATCACT 5 TCACT AATCACT 6 CACT ACTAATCACT TAATCACT 7 ACT CTAATCACT

68 1 2 C 3 T A 4 ATCACT 5 TCACT AATCACT 6 CACT ACTAATCACTG TAATCACT 7 ACT CTAATCACT

69 1 2 C 3 T A 4 ATCACT 5 TCACT AATCACT 6 CACT ACTAATCACTG TAATCACT 7 ACT CTAATCACTG

70 1 2 C 3 T A 4 ATCACT 5 TCACT AATCACT 6 CACT ACTAATCACTG TAATCACTG 7 ACT CTAATCACTG

71 1 2 C 3 T A 4 ATCACT 5 TCACT AATCACTG 6 CACT ACTAATCACTG TAATCACTG 7 ACT CTAATCACTG

72 1 2 C 3 T A 4 ATCACTG 5 TCACT AATCACTG 6 CACT ACTAATCACTG TAATCACTG 7 ACT CTAATCACTG

73 1 2 C 3 T A 4 ATCACTG 5 TCACTG AATCACTG 6 CACT ACTAATCACTG TAATCACTG 7 ACT CTAATCACTG

74 1 2 C 3 T A 4 ATCACTG 5 TCACTG AATCACTG 6 CACTG ACTAATCACTG TAATCACTG 7 ACT CTAATCACTG

75 1 2 C 3 T A 4 ATCACTG 5 TCACTG AATCACTG 6 CACTG ACTAATCACTG TAATCACTG 7 ACTG CTAATCACTG

76 CT 1 C 3 T A 4 ATCACTG 5 TCACTG AATCACTG 6 CACTG ACTAATCACTG 7 ACTG AATCACTG 8 G 2 TAATCACTG

77 CT 12 C 3 T A 4 ATCACTG 5 TCACTG AATCACTG 6 CACTG ACTAATCACTG T 7 ACTG AATCACTG 8 G 9 G

78 CT 12 C 3 T A 4 ATCACTG 5 TCACTG AATCACTG 6 CACTG ACTAATCACTG T 7 ACTG AATCACTG 8 G 9 G 10 G 11 G

79 Observations At the first extension we must end at a leaf because no longer suffix exists (rule 1) i At the second extension we still most likely to end at a leaf. i We will not end at a leaf only if the second suffix is a prefix of the first

80 Say at some extension we do not end at a leaf i Is there a way to continue using i th character ? (Is it a prefix of a suffix where the next character is the i th character ?) Rule 3Rule 2 Then this suffix is a prefix of some other suffix (suffixes) We will not end at a leaf in subsequent extensions

81 Rule 3Rule 2 If we apply rule 3 then in all subsequent extensions we will apply rule 3 Otherwise we keep applying rule 2 until in some subsequent extensions we will apply rule 3 Rule 3

82 In terms of the rules that we apply a phase looks like: 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 We have nothing to do when applying rule 3, so once rule 3 happens we can stop We don’t really do anything significant when we apply rule 1 (the structure of the tree does not change)

83 Representation We do not really store a substring with each edge, but rather pointers into the starting position and ending position of the substring in the text With this representaion we do not really have to do anything when rule 1 applies

84 How do phases relate to each other 1 1 1 1 1 1 1 2 2 2 2 3 3 3 3 i The next phase we must have: 1 1 1 1 1 1 1 1 1 1 1 2/3 So we start the phase with the extension that was the first where we applied rule 3 in the previous phase

85 Suffix Links CT 12 C 3 T A 4 ATCACTG 5 TCACTG AATCACTG 6 CACTG ACTAATCACTG T 7 ACTG AATCACTG 8 G 9 G 10 G 11 G

86 CT 12 C 3 T A 4 ATCACTG 5 TCACTG AATCACTG 6 CACTG ACTAATCACTG T 7 ACTG AATCACTG 8 G 9 G 10 G 11 G

87 CT 12 C 3 T A 4 ATCACTG 5 TCACTG AATCACTG 6 CACTG ACTAATCACTG T 7 ACTG AATCACTG 8 G 9 G 10 G 11 G

88 CT 12 C 3 T A 4 ATCACTG 5 TCACTG AATCACTG 6 CACTG ACTAATCACTG T 7 ACTG AATCACTG 8 G 9 G 10 G 11 G

89 CT 12 C 3 T A 4 ATCACTG 5 TCACTG AATCACTG 6 CACTG ACTAATCACTG T 7 ACTG AATCACTG 8 G 9 G 10 G 11 G

90 Suffix Links From an internal node that corresponds to the string aβ to the internal node that corresponds to β (if there is such node) aβaβ β

91 Is there such a node ? v aβaβ Suppose we create v applying rule 2. Then there was a suffix aβx… and now we add aβy So there was a suffix βx… x.. β y

92 Is there such a node ? v aβaβ Suppose we create v applying rule 2. Then there was a suffix aβx… and now we add aβy So there was a suffix βx… x.. β If there was also a suffix βz… Then a node corresponding to β is there y x..z..

93 Is there such a node ? v aβaβ Suppose we create v applying rule 2. Then there was a suffix aβx… and now we add aβy So there was a suffix βx… x.. β If there was also a suffix βz… Then a node corresponding to β is there y x..y Otherwise it will be created in the next extension when we add βy

94 Inv: All suffix links are there except (possibly) of the last internal node added You are at the (internal) node corresponding to the last extension You start a phase at the last internal node of the first extension in which you applied rule 3 in the previous iteration i Remember: we apply rule 2 i

95 1) Go up one node (if needed) to find a suffix link 2) Traverse the suffix link 3) If you went up in step 1 along an edge that was labeled δ then go down consuming a string δ

96 v δ x.. δ y aβaβ β Create the new internal node if necessary

97 v δ x.. δ y aβaβ β Create the new internal node if necessary

98 v δ x.. δ y aβaβ β Create the new internal node if necessary, add the suffix y

99 v δ x.. δ y aβaβ β Create the new internal node if necessary, add the suffix and install a suffix link if necessary y

100 Analysis Handling all extensions of rule 1 and all extensions of rule 3 per phase take O(1) time  O(n) total How many times do we carry out rule 2 in all phases ? O(n) Does each application of rule 2 takes constant time ? No ! (going up and traversing the suffix link takes constant time, but then we go down possibly on many edges..)

101 v δ x.. δ y aβaβ β y

102 So why is it a linear time algorithm ? How much can the depth change when we traverse a suffix link ? It can decrease by at most 1

103 Punch line Each time we go up or traverse a suffix link the depth decreases by at most 1 When starting the depth is 0, final depth is at most n So during all applications of rule 2 together we cannot go down more than 3n times THM: The running time of Ukkonen’s algorithm is O(n)

104 Another application Suppose we have a pattern P and a text T and we want to find for each position of T the length of the longest substring of P that matches there. How would you do that in O(n) time ?

105

106 Drawbacks of suffix trees Suffix trees consume a lot of space It is O(n) but the constant is quite big Notice that if we indeed want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node

107 Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 3142

108 How do we build it ? Build a suffix tree Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. O(n) time

109 How do we search for a pattern ? If P occurs in T then all its occurrences are consecutive in the suffix array. Do a binary search on the suffix array Takes O(mlogn) time

110 Example Let S = mississippi i ippi issippi ississippi mississippi pi 8 5 2 1 10 9 7 4 11 6 3 ppi sippi sisippi ssippi ssissippi L R Let P = issa M

111 How do we accelerate the search ? L R Maintain l = LCP(P,L) Maintain r = LCP(P,R) M If l = r then start comparing M to P at l + 1 r

112 How do we accelerate the search ? L R Suppose we know LCP(L,M) If LCP(L,M) < l we go left If LCP(L,M) > l we go right If LCP(L,M) = l we start comparing at l + 1 M If l > r then r

113 Analysis of the acceleration If we do more than a single comparison in an iteration then max( l, r ) grows by 1 for each comparison  O(logn + m) time


Download ppt "Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }"

Similar presentations


Ads by Google