Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suffix trees.

Similar presentations


Presentation on theme: "Suffix trees."— Presentation transcript:

1 Suffix trees

2 Trie A tree representing a set of strings. c { a aeef b ad bbfe bbfg e

3 Trie (Cont) Assume no string is a prefix of another c a b e b d e f c
Each edge is labeled by a letter, no two edges outgoing from the same node are labeled the same. Each string corresponds to a leaf. a b e b d e f c f e g

4 Compressed Trie Compress unary nodes, label edges by strings c c a a b
c a a b e b d bbf d eef e f c f c e g e g

5 Suffix tree Given a string s a suffix tree of s is a compressed trie of all suffixes of s To make these suffixes prefix-free we add a special character, say $, at the end of s

6 Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b b $ a a $ b b $ $

7 Trivial algorithm to build a Suffix tree
Put the largest suffix in a b $ a b b a Put the suffix bab$ in a b b $ $

8 a b b a a b b $ $ Put the suffix ab$ in a b b a b $ a $ b $

9 a b b a b $ a $ b $ Put the suffix b$ in a b b $ a a $ b b $ $

10 a b b $ a a $ b b $ $ $ Put the suffix $ in a b b $ a a $ b b $ $

11 $ a b b $ a a $ b b $ $ We will also label each leaf with the starting point of the corres. suffix. $ a b b 5 $ a a $ b 4 b $ $ 3 2 1

12 Analysis Takes O(n2) time to build. You can do it in O(n) time

13 What can we do with it ? Exact string matching:
Given a Text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide if it occurs in T. We may also want to find all occurrences of P in T

14 Exact string matching In preprocessing we just build a suffix tree in O(n) time $ a b b 5 $ a a $ b 4 b $ $ 3 2 1 Given a pattern P = ab we traverse the tree according to the pattern.

15 By traversing this subtree we get all k occurrences in O(n+k) time
$ a b b 5 $ a a $ b 4 b $ $ 3 2 1 If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence. By traversing this subtree we get all k occurrences in O(n+k) time

16 Generalized suffix tree
Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s  S To associate each suffix with a unique string in S add a different special char to each s

17 Generalized suffix tree (Example)
Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2 # { $ # b$ b# ab$ ab# bab$ aab# abab$ } $ a b 5 4 # $ a b a b 3 b $ 4 # a # $ 2 b 1 $ 3 2 1

18 So what can we do with it ? Matching a pattern against a database of strings

19 Longest common substring (of two strings)
Every node with a leaf descendant from string s1 and a leaf descendant from string s2 represents a maximal common substring and vice versa. # $ a b 5 4 # $ a b a b 3 b 4 # $ a # $ 2 b 1 Find such node with largest “string depth” $ 3 2 1

20 Lowest common ancetors
A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

21 Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes # $ a b 5 4 # $ a b a b 3 b 4 # $ a # $ 2 b 1 $ 3 2 1

22 Lowest common ancestors

23 Write an Euler tour of the tree
3 Shallowest node 12 9 3 8 1 3 2 3 1 4 1 7 1 3 5 6 5 3 2 5 1 11 4 10 7 5 6 2 4 7 6 LCA(1,5) = 3

24 3 12 9 3 8 1 2 5 1 11 4 10 7 5 6 minimum 2 4 7 6 3 2 3 1 4 1 7 1 3 5 6 5 3 1 1 2 1 2 1 1 2 1

25 Range minimum Preprocess an array, such that given i,j you can find the minimum in [i,j] fast 3 12 9 3 8 1 2 5 Reduction takes linear time 1 11 4 10 7 5 6 minimum 2 4 7 6 3 2 3 1 4 1 7 1 3 5 6 5 3 1 1 2 1 2 1 1 2 1

26 Trivial algorithms for RMQ

27 Less trivial algorithms to RMQ
Try to use O(nlog(n)) space to do a query in O(1) time

28 Lowest common ancetors
A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

29 Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes # $ a b 5 4 # $ a b a b 3 b 4 # $ a # $ 2 b 1 $ 3 2 1

30 Finding maximal palindromes
A palindrome: caabaac, cbaabc Want to find all maximal palindromes in a string s Let s = cbaaba The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of sr

31 Maximal palindromes algorithm
Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc# For every i find the LCA of suffix i of s and suffix m-i+1 of sr

32 Let s = cbaaba$ then sr = abaabc#
7 a b 7 $ b a c # baaba$ c # 6 c # a $ a b 6 a $ 4 abc # 5 5 3 3 $ a $ c # 4 1 2 2 1

33 Analysis O(n) time to identify all palindromes

34 Compression

35 LZ overview Cursor a a c a a c a b c a b a b a c previously coded
Find the longest prefix of S[i,m] that appears in S[1,i-1] write its position, length

36 a c b (0,0,a)

37 a c b (0,0,a)

38 a c b (0,0,a)(1,1,c)

39 a c b (0,0,a)(1,1,c)

40 a c b (0,0,a)(1,1,c)(1,3,a)

41 a c b (0,0,a)(1,1,c)(1,3,a)

42 a c b (0,0,a)(1,1,c)(1,3,a)(0,0,b)

43 a c b (0,0,a)(1,1,c)(1,3,a)(0,0,b)

44 a c b (0,0,a)(1,1,c)(1,3,a)(0,0,b)(6,3,a)

45 a c b (0,0,a)(1,1,c)(1,3,a)(0,0,b)(6,3,a)

46 a c b Implement by maintaining a suffix tree for the coded part  linear time and space

47 Dictionary (previously coded)
LZ77: Sliding Window Cursor a c b Dictionary (previously coded) Lookahead Buffer Dictionary and buffer “windows” are fixed length and slide with the cursor On each step: Output (p,l,c) p = relative position of the longest match in the dictionary l = length of longest match c = next char in buffer beyond longest match Advance window by l + 1

48 LZ77: Example (0,0,a) a a c a a c a b c a b a a a c a c b (1,1,c) a c
Dictionary (size = 6) Longest match Next character

49 LZ77 Optimizations used by gzip
LZSS: Output one of the following formats (0, position, length) or (1,char) Typically use the second format if length < 3. (1,a) a a c a a c a b c a b a a a c (1,a) a a c a a c a b c a b a a a c (1,c) a a c a a c a b c a b a a a c (0,3,4) a a c a a c a b c a b a a a c

50 Suffix array We loose some of the functionality but we save space.
Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 2 3 1

51 How do we build it ? Build a suffix tree
Traverse the tree in in-order, lexicographically picking edges outgoing from each node and fill the suffix array. O(n) time Can also build it directly

52 Example Let S = mississippi i ippi issippi Let P = issa ississippi
7 4 1 9 8 6 3 10 5 2 ippi issippi Let P = issa ississippi mississippi pi M ppi sippi sisippi ssippi ssissippi R

53 How do we search for a pattern ?
If P occurs in T then all its occurrences are consecutive in the suffix array. Do a binary search on the suffix array Takes O(mlogn) time Can also do it in O(m+log(n)) with an additional array


Download ppt "Suffix trees."

Similar presentations


Ads by Google