Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.

Similar presentations


Presentation on theme: "Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree."— Presentation transcript:

1 Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree

2 Suffix tree S=xabxac = abxac = bxac = xac = ac = c 123456123456

3 Suffix tree S=xabxa = abxa = bxa = xa = a 1234512345 x a b x a a b x a b x a

4 Suffix tree (Example) Let s=abab, a suffix tree of s contains all the suffixes of s=abab$ { $ b$ ab$ bab$ abab$ } a b a b $ a b $ b $ $ $

5 Trivial algorithm to build a Suffix tree Put the largest suffix in Put the suffix bab$ in a b a b $ a b a b $ a b $ b s=abab$

6 Put the suffix ab$ in a b a b $ a b $ b a b a b $ a b $ b $ { abab$ bab$ }

7 Put the suffix b$ in a b a b $ a b $ b $ a b a b $ a b $ b $ $ { abab$ bab$ ab$ }

8 Put the suffix $ in a b a b $ a b $ b $ $ a b a b $ a b $ b $ $ $ { abab$ bab$ ab$ b$ }

9 We will also label each leaf with the starting point of the corres. suffix. a b a b $ a b $ b $ $ $ 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ { abab$ bab$ ab$ b$ $ }

10 Naive Construction – More Example abbcbab# ab # bcbab# b # cbab# bcbab# ab# cbab# 6 1 7 3 2 5 4 abbcbab# bbcbab#

11 Analysis Takes O(n 2 ) time to build. We will see how to do it in O(n) time

12 Ukkonen’s linear-time Suffix Tree Algorithm Implicit Suffix Tree 1.Remove the terminal symbols $ from the edge labels of the tree 2.Then remove any edge that has no label

13 Implicit Suffix Tree – More Example 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ { abab$ bab$ ab$ b$ $ } 1.Even though an implicit suffix tree may not have a leaf for each suffix, it does encode all the suffixes of S 2.Let i denote the implicit suffix tree of the string S[1…i]

14 Ukkonen’s Algorithm at a High Level Construct an implicit suffix tree i for each prefix S[1..i] of S, starting 1 and incrementing i by one until m is build, where m is the length of the string S. The true suffix tree for S is constructed from m, and the time for the entire algorithm is O(m)

15 High-level Description of Ukkonen’s Algorithm Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1].

16 Naïve Algorithm of Suffix Tree { abab$ bab$ ab$ b$ $ } a b a b $ 1 a b $ b 2 3 $ 4 $ $ 5

17 High-level of Ukkonen’s Algorithm Ukkonen’s algorithm is divided into m phases. In phase i+1, tree i+1 is constructed from i Each phase i+1 is further divided into i+1 extensions, one for each of the i+1 suffixes of S[1… i+1]. b a a 1 a b 2 1 : S[1…1] {a} 2 : S[1…2] {ab, b} a b 3 : S[1…3] {aba, ba, a} a bb a extensions phases

18 b a a 1 a b 2 1 : S[1…1] {a} 2 : S[1…2] {ab, b} 3 : S[1…3] {aba, ba, a} extensions O (m 3 )

19 b a a 1 a b 2 1 : S[1…1] {a} 2 : S[1…2] {ab, b} 3 : S[1…3] {aba, ba, a} Suffix Entension Rules 4 : S[1…4] {abab, bab, ab, b} 12 b b Rule1: Let β = S[j … i] be a suffix of S[1 … i]. If path β ends at a leaf, character S(i+1) is added to the end of the label of that leaf edge. 12 3 Rule2: some path from the end of string β starts with character S(i+1). In this case the string β S(i+1) is already in the tree. So do nothing. β S(i+1) Let i already there and want to extend for i+1

20 Suffix Entension Rules Let, i already there and want to extend for i+1 Let, 5 is drawn for axabxb 123456 Now extend for 6 axabxb xabxb abxb bxb RULE1 xb Rule3: No path from the end of string β starts with character S(i+1), but at least one labeled path continues from the end of β. Add new node. RULE3 b RULE2 O (m 3 )

21 Implementation and Speedup, Suffix Links Definition: Let xα denotes an arbitrary string, where x is a single character and α a substring (possibly empty). For an internal node v with path-label xα, if there is another node s(v) with path-label α, then a pointer from v to s(v) is called a suffix link. Does root have a suffix link? No, because not an internal node Every internal node has a suffix link.

22 Suffix Links – More Example abbcbab# ab # bcbab# b # cbab# bcbab# ab# cbab# 6 1 7 3 2 5 4 Suffix link v S(v) Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension. Corollary 6.1.2: In any implicit suffix tree i, if internal node v has path-label xα, then there is a node s(v) of i with path-label α.

23 MISSISSIPI 1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS 7 : MISSISS 8 : MISSISSI 9 : MISSISSIP 10 : MISSISSIPI 1 M I S S I S S I P I I S S I S S I I S S I S S I P I I S S I P I I I I P I I 2 3 4 5 P P 6 P 7 P 8 P 9 1234567890 Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension.

24 MISSISSIPI 1 : M 2 : MI 3 : MIS 4 : MISS 5 : MISSI 6 : MISSIS 7 : MISSISS 8 : MISSISSI 9 : MISSISSIP 10 : MISSISSIPI 1 M I S S I S S I P I I S S I S S I I S S I S S I P I I S S I P I I I I P I I 2 3 4 5 P P 6 P 7 P 8 P 9 1234567890 Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal node will have a suffix link form it by the end of the next extension. How suffix links help?

25 What is achieved so far? Not so much. Worst-case running time is O(m 2 ) for a phase.

26 Trick1: Skip/Count Trick There must be a γ path from s(v).

27 Trick1: Skip/Count Trick There must be a γ path from s(v). Walking down along γ takes time proportional to |γ| Skip/count trick reduces the traversal time to something proportional to the number of nodes on the path. zabcdefghy 2233 Nodes But what does it buy in terms of worst-case bounds? Edge length

28 Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment, the node-depth of v is at most one greater than the node depth of s(v). v=2 s(v)=1 v=3 s(v)=3 v=4 s(v)=5

29 Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment, the node-depth of v is at most one greater than the node depth of s(v). Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time. In a single extension – The algorithm walks up at most one edge – Find suffix link and traverse it – Walks down some number of nodes – Applies suffix extension rules – And may add a suffix link All operations except down-walk takes constant time Only needs to analyze down walk time

30 Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during Ukkonen’s algorithm. At that moment, the node-depth of v is at most one greater than the node depth of s(v). Theorem 6.1.1: Using the skip/count trick, any phase of Ukkonen’s algorithm takes O(m) time. In a single extension – The algorithm walks up at most one edge – Find suffix link and traverse it – Walks down some number of nodes – Applies suffix extension rules – And may add a suffix link All operations except down-walk takes constant time Only needs to analyze down walk time – Decreases current node-depth by at most one – Decreases node-depth by at most another one – Each down walk moves to greater node-depth – Over the entire phase, current node-depth is decremented by at most 2m times – Since no node can have depth greater than m, the total possible increment to current node- depth is bounded by 3m over the entire phase – Total number of edge traversal bounded by 3m – Since each edge traversal is constant, in a phase all the down-walking is O(m).

31 Complexity There are m phases Each phase takes O(m) So the running time is O(m 2 ) Two more tricks and we are done

32 Reference Chapter 6: Algorithms on Strings, Trees and Sequences


Download ppt "Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree."

Similar presentations


Ads by Google