Presentation is loading. Please wait.

Presentation is loading. Please wait.

On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.

Similar presentations


Presentation on theme: "On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere."— Presentation transcript:

1 On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere

2 Fact From the Previous Talk Harel and Tarjan 1984, Bender and Farach-Colton 2000 A tree T with m nodes can be preprocessed in O(m) time so that, for any pair of its nodes u, v, lca (u, v) can be computed in constant time.

3 What’s in This Paper Bounds depend on the alphabet –Constant size alphabet – O(n) (Weiner 1973) –For unbounded alphabet  (n log n) –For {1…n} – linear time RAM algorithm DAM algorithm (I/O optimal) Algorithm also works for PRAM, PDAM

4 Talk Outline Suffix trees –Reminder –Tools RAM algorithm for suffix tree construction Conclusion

5 Suffix Trees 341 5 8 12 2 7 11 6 10 9 13 S = $122212211121 1 $ 1 2 $ 2 2 21$ S[8,13]=12221$ n = 13

6 Suffix Tree Representation 341 5 8 12 2 7 11 6 10 9 13 $122212211121  = 1 2 $

7 Properties of Suffix Trees 341 5 8 12 2 7 11 6 10 9 13  = 1 L=1  = 12 L=2 2 2 1 lcp (  (v),  (w)) = |  ( lca (v, w)| 1 1  = 11 L=2 vw lca (v, w)

8 Suffix Links Lemma [Weiner 1973] Let a   and    *. If there is a node v in T s such that  (v)=a , then there is a node w in T s such that  (w)= . Define the suffix link as sl(v) = w.

9 Suffix Links 341 5 8 12 2 7 11 6 10 9 13  = 1 L=1  = 12 L=2  = 122 L=3  = 2 L=1 2 2 2 1 1 12

10 Suffix Links Example 341 5 8 12 2 7 11 6 10 9 13 22 3 2 3 2 11

11 Suffix Arrays Let  ={S i | S i   *, |S i |=n i } T = compacted trie of  In order traversal of leaves gives strings in lexicographical order – S p 1, …, S p |  | sort array  A T [i]=p i longest common prefix array  LCP T [i] = lcp (S p i, S p i+1 )

12 Suffix Array Example 341 5 8 12 2 7 11 6 10 9 13 910611721285143 -023122013212 ATAT LCP T 1 1  = 11 L=2

13 RAM Algorithm Input: string SOutput: T s Divide and Conquer: 1.Recursively compute T o – compacted trie of suffixes beginning at odd positions 2.Recursively compute T e – compacted trie of suffixes beginning at even positions 3.Merge T e and T o to get T s

14 Divide and Conquer Scheme A(n)A(n) A(n/2)A(n/2)A(n/2)A(n/2) A(n/4)A(n/4)A(n/4)A(n/4)A(n/4)A(n/4)A(n/4)A(n/4) S(n/2)S(n/2)S(n/2)S(n/2) S(n)S(n) Divide Conquer Merge

15 RAM Algorithm Scheme |S|=n,  =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2,  ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n) 1 2 3 4 5 6

16 Switching Representations |S|=n,  =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2,  ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n) 1 2 3 4 5 6

17 Suffix Tree  Suffix Array 341 5 8 12 2 7 11 6 10 9 13 910611721285143 -023122013212 ATAT LCP T 1 1  = 11 L=2

18 Suffix Array  Suffix Tree 341 5 8 12 2 7 11 6 10 9 13 910611721285143 -023122013212 ATAT LCP T 1 1  = 11 L=2

19 Compressing S |S|=n,  =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2,  ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n) 1 3 4 5 2 6

20 Compressing S Input: |S|=n  =[n] Map character pairs into single characters: –For i=1 to n form pairs  S[2i-1], S[2i]  –Sort lexicographically by radix sort O(n) –Remove duplicates S’[i] = rank of  S[2i-1], S[2i]  Now |S’|= n / 2 and  ’=[ n / 2 ]

21 Example S=121112212221$  =[13] 1.Pairs  1,2   1,1   1,2   2,1   2,2   2,1  2.Ordered pairs  1,1   1,2   1,2   2,1   2,1   2,2  3.Duplicates removed  1,1   1,2   2,1   2,2  4.S’=212343$  =[4]

22 Decompressing S |S|=n,  =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2,  ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n) 1 3 4 5 2 6

23 Decompressing S Input : A Ts’, LCP Ts’ Notice : S[2i-1] · · ·S[n]$ = S’[i] · · ·S[ n / 2 ]$ A To [i] = A Ts’ [i] · 2 – 1 · · ·2211211 · · · 2121211211 if S[ A To [i]+2* LCP Ts’ [i]] = S[ A To [i+1]+2* LCP Ts’ [i]] 1 { LCP To = 2 · LCP Ts’ + otherwise0

24 Building the Even Tree |S|=n,  =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2,  ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n) 1 3 4 5 2 6

25 Building the Even Tree Input : A To, LCP To Observation : P = even suffix of S then P = aP’ and P’ = odd suffix of S To get A Te apply radix sort on even suffixes S[2i,n] using the keys  S[2i], S[2i+1,n]  if S[2i]=S[2j] lcp (S[2i+1,n], S[2j+1,n])+1 { lcp (S[2i,n], S[2j,n]) = otherwise0

26 Merging T o and T e |S|=n,  =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2,  ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n) 1 3 4 5 2 6

27 Merging T o and T e Input :  A To, LCP To  and  A Te, LCP Te  Trivial method – sort suffixes lexicographically  (n 2 ) What if we have an oracle for lcp (S[2i, n], S[2j-1, n]) ? Merge A To and A Te directly (like sorted lists) Compute LCP T from previous results: 1.lcp of adjacent odd suffixes by LCP To 2.lcp of adjacent even suffixes by LCP Te 3.lcp of odd suffix and even suffix by oracle

28 Coupled-DFS (the uncompacted case) 12 12 AB C 12 13 DE F T1T1 T2T2 TMTM

29 12 12 AB C 12 13 DE F T1T1 T2T2 TMTM 1

30 12 12 AB C 12 13 DE F T1T1 T2T2 TMTM 1 1 A+D

31 Coupled-DFS (the uncompacted case) 12 12 AB C 12 13 DE F T1T1 T2T2 TMTM 1 1 2 BA+D

32 Coupled-DFS (the uncompacted case) 12 12 AB C 12 13 DE F T1T1 T2T2 TMTM 1 1 A+CB 3 E 2

33 Coupled-DFS (the uncompacted case) 12 12 AB C 12 13 DE F T1T1 T2T2 TMTM 1 1 A+CB 3 E 2 2 C+F

34 Coupled-DFS (the compacted case) 12342 12 AB C 122 13 DE F T1T1 T2T2 1234 TMTM

35 Coupled-DFS (the compacted case) 1234 2 12 AB C 122 13 DE F T1T1 T2T2 TMTM 1 D 2 C+F 12 34 G 3

36 Over-Merging T o and T e How do we merge compacted tries? An over-merge is like a merge but: –Compare only first characters of edges –In case of two edges with different lengths, k<l break l into k and l-k –Identify edges with first letter only

37 Over-Merge Example 1234 2 12 AB C 13132 13 DE F T1T1 T2T2 1x TMTM 1 D 2 C+F 1212 34 G 3

38 Over-Merge of Running Example 3 1 5 7 11 9 13 2 1 2 1 ToTo S=121112212221$

39 Over-Merge of Running Example 2 6 10 2 1 TeTe 4 12 8 1 S=121112212221$

40 Over-Merge of Running Example TMTM 2 11 7 9 10 6 1 3 12 13 8 5 4210 6 2 1 3 S=121112212221$

41 Building the lcp Oracle Definitions –Node in both T M and T o is odd –Node in both T M and T e is even –Node with both odd and even descendents is odd/even For every odd/even node u find l 2i and l 2j-1 such that u = lca (l 2i, l 2j-1 ) Compute d(u) = lca (l 2i+1, l 2j ) Compute  (u) = depth(u) in d-pointers tree

42 Over-Merge of Running Example TMTM 2 11 7 9 10 6 1 3 12 13 8 5 4210 6 2 1 3 S=121112212221$

43 Main Theorem The function d defines a tree on the odd/even nodes of T M, and for any l 2i and l 2j-1 we have  ( lca (l 2i, l 2j-1 ) ) = lcp (S[2i,n], S[2j-1,n])

44 Helpful Observations Let u be an odd/even node in T M. u is Either even or odd and so L(u) is defined. Let u be an even node: 1. For l 2i and l 2j below u lcp (S[2i,n], S[2j,n])  L(u) 2. For l 2i’-1 and l 2j’-1 below u lcp (S[2i’-1,n], S[2j’-1,n])  L(u) 3. For l 2i” and l 2j”-1 below u lcp (S[2i”,n], S[2j”-1,n])  L(u) Symmetrical proof is u is an odd node.

45 Lemma The lcp value of any odd and even pair of leaves whose lca is u must be the same Proof: Suppose lca (l 2i’, l 2j’-1 ) = lca (l 2i’’, l 2j”-1 ) = u  lcp (S[2i’,n], S[2j’-1,n]) = k  L(u) lcp (S[2i’,n], S[2i”,n])  L(u)  k  lcp (S[2i”,n], S[2j’-1,n]) = k S[2i’,n] S[2j’-1,n] S[2i”,n] k L(u)L(u)

46 Induction on the lcp Pick a pair of odd an even suffixes S[2i’,n] and S[2j’-1,n]. Base: If S[2i’]  S[2j’-1] then lca = root (recall the merge procedure)  lcp = 0. Assumption: Suppose theorem is true for lcp 0 u = lca (l 2i, l 2j-1 )  u  root. Suppose d(u) = lca (l 2i’+1, l 2j’ ) then:  (u) = 1 1 +  (d(u)) = 2 1 + lcp (S[2i’+1,n], S[2j’,n]) = 3 lcp (S[2i,n], S[2j-1,n]) 

47 Done! |S|=n,  =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2,  ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n) 1 3 4 5 2 6

48 The End


Download ppt "On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere."

Similar presentations


Ads by Google