Download presentation
Presentation is loading. Please wait.
Published byJasmin Hodge Modified over 9 years ago
1
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere
2
Fact From the Previous Talk Harel and Tarjan 1984, Bender and Farach-Colton 2000 A tree T with m nodes can be preprocessed in O(m) time so that, for any pair of its nodes u, v, lca (u, v) can be computed in constant time.
3
What’s in This Paper Bounds depend on the alphabet –Constant size alphabet – O(n) (Weiner 1973) –For unbounded alphabet (n log n) –For {1…n} – linear time RAM algorithm DAM algorithm (I/O optimal) Algorithm also works for PRAM, PDAM
4
Talk Outline Suffix trees –Reminder –Tools RAM algorithm for suffix tree construction Conclusion
5
Suffix Trees 341 5 8 12 2 7 11 6 10 9 13 S = $122212211121 1 $ 1 2 $ 2 2 21$ S[8,13]=12221$ n = 13
6
Suffix Tree Representation 341 5 8 12 2 7 11 6 10 9 13 $122212211121 = 1 2 $
7
Properties of Suffix Trees 341 5 8 12 2 7 11 6 10 9 13 = 1 L=1 = 12 L=2 2 2 1 lcp ( (v), (w)) = | ( lca (v, w)| 1 1 = 11 L=2 vw lca (v, w)
8
Suffix Links Lemma [Weiner 1973] Let a and *. If there is a node v in T s such that (v)=a , then there is a node w in T s such that (w)= . Define the suffix link as sl(v) = w.
9
Suffix Links 341 5 8 12 2 7 11 6 10 9 13 = 1 L=1 = 12 L=2 = 122 L=3 = 2 L=1 2 2 2 1 1 12
10
Suffix Links Example 341 5 8 12 2 7 11 6 10 9 13 22 3 2 3 2 11
11
Suffix Arrays Let ={S i | S i *, |S i |=n i } T = compacted trie of In order traversal of leaves gives strings in lexicographical order – S p 1, …, S p | | sort array A T [i]=p i longest common prefix array LCP T [i] = lcp (S p i, S p i+1 )
12
Suffix Array Example 341 5 8 12 2 7 11 6 10 9 13 910611721285143 -023122013212 ATAT LCP T 1 1 = 11 L=2
13
RAM Algorithm Input: string SOutput: T s Divide and Conquer: 1.Recursively compute T o – compacted trie of suffixes beginning at odd positions 2.Recursively compute T e – compacted trie of suffixes beginning at even positions 3.Merge T e and T o to get T s
14
Divide and Conquer Scheme A(n)A(n) A(n/2)A(n/2)A(n/2)A(n/2) A(n/4)A(n/4)A(n/4)A(n/4)A(n/4)A(n/4)A(n/4)A(n/4) S(n/2)S(n/2)S(n/2)S(n/2) S(n)S(n) Divide Conquer Merge
15
RAM Algorithm Scheme |S|=n, =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2, ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n) 1 2 3 4 5 6
16
Switching Representations |S|=n, =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2, ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n) 1 2 3 4 5 6
17
Suffix Tree Suffix Array 341 5 8 12 2 7 11 6 10 9 13 910611721285143 -023122013212 ATAT LCP T 1 1 = 11 L=2
18
Suffix Array Suffix Tree 341 5 8 12 2 7 11 6 10 9 13 910611721285143 -023122013212 ATAT LCP T 1 1 = 11 L=2
19
Compressing S |S|=n, =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2, ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n) 1 3 4 5 2 6
20
Compressing S Input: |S|=n =[n] Map character pairs into single characters: –For i=1 to n form pairs S[2i-1], S[2i] –Sort lexicographically by radix sort O(n) –Remove duplicates S’[i] = rank of S[2i-1], S[2i] Now |S’|= n / 2 and ’=[ n / 2 ]
21
Example S=121112212221$ =[13] 1.Pairs 1,2 1,1 1,2 2,1 2,2 2,1 2.Ordered pairs 1,1 1,2 1,2 2,1 2,1 2,2 3.Duplicates removed 1,1 1,2 2,1 2,2 4.S’=212343$ =[4]
22
Decompressing S |S|=n, =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2, ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n) 1 3 4 5 2 6
23
Decompressing S Input : A Ts’, LCP Ts’ Notice : S[2i-1] · · ·S[n]$ = S’[i] · · ·S[ n / 2 ]$ A To [i] = A Ts’ [i] · 2 – 1 · · ·2211211 · · · 2121211211 if S[ A To [i]+2* LCP Ts’ [i]] = S[ A To [i+1]+2* LCP Ts’ [i]] 1 { LCP To = 2 · LCP Ts’ + otherwise0
24
Building the Even Tree |S|=n, =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2, ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n) 1 3 4 5 2 6
25
Building the Even Tree Input : A To, LCP To Observation : P = even suffix of S then P = aP’ and P’ = odd suffix of S To get A Te apply radix sort on even suffixes S[2i,n] using the keys S[2i], S[2i+1,n] if S[2i]=S[2j] lcp (S[2i+1,n], S[2j+1,n])+1 { lcp (S[2i,n], S[2j,n]) = otherwise0
26
Merging T o and T e |S|=n, =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2, ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n) 1 3 4 5 2 6
27
Merging T o and T e Input : A To, LCP To and A Te, LCP Te Trivial method – sort suffixes lexicographically (n 2 ) What if we have an oracle for lcp (S[2i, n], S[2j-1, n]) ? Merge A To and A Te directly (like sorted lists) Compute LCP T from previous results: 1.lcp of adjacent odd suffixes by LCP To 2.lcp of adjacent even suffixes by LCP Te 3.lcp of odd suffix and even suffix by oracle
28
Coupled-DFS (the uncompacted case) 12 12 AB C 12 13 DE F T1T1 T2T2 TMTM
29
12 12 AB C 12 13 DE F T1T1 T2T2 TMTM 1
30
12 12 AB C 12 13 DE F T1T1 T2T2 TMTM 1 1 A+D
31
Coupled-DFS (the uncompacted case) 12 12 AB C 12 13 DE F T1T1 T2T2 TMTM 1 1 2 BA+D
32
Coupled-DFS (the uncompacted case) 12 12 AB C 12 13 DE F T1T1 T2T2 TMTM 1 1 A+CB 3 E 2
33
Coupled-DFS (the uncompacted case) 12 12 AB C 12 13 DE F T1T1 T2T2 TMTM 1 1 A+CB 3 E 2 2 C+F
34
Coupled-DFS (the compacted case) 12342 12 AB C 122 13 DE F T1T1 T2T2 1234 TMTM
35
Coupled-DFS (the compacted case) 1234 2 12 AB C 122 13 DE F T1T1 T2T2 TMTM 1 D 2 C+F 12 34 G 3
36
Over-Merging T o and T e How do we merge compacted tries? An over-merge is like a merge but: –Compare only first characters of edges –In case of two edges with different lengths, k<l break l into k and l-k –Identify edges with first letter only
37
Over-Merge Example 1234 2 12 AB C 13132 13 DE F T1T1 T2T2 1x TMTM 1 D 2 C+F 1212 34 G 3
38
Over-Merge of Running Example 3 1 5 7 11 9 13 2 1 2 1 ToTo S=121112212221$
39
Over-Merge of Running Example 2 6 10 2 1 TeTe 4 12 8 1 S=121112212221$
40
Over-Merge of Running Example TMTM 2 11 7 9 10 6 1 3 12 13 8 5 4210 6 2 1 3 S=121112212221$
41
Building the lcp Oracle Definitions –Node in both T M and T o is odd –Node in both T M and T e is even –Node with both odd and even descendents is odd/even For every odd/even node u find l 2i and l 2j-1 such that u = lca (l 2i, l 2j-1 ) Compute d(u) = lca (l 2i+1, l 2j ) Compute (u) = depth(u) in d-pointers tree
42
Over-Merge of Running Example TMTM 2 11 7 9 10 6 1 3 12 13 8 5 4210 6 2 1 3 S=121112212221$
43
Main Theorem The function d defines a tree on the odd/even nodes of T M, and for any l 2i and l 2j-1 we have ( lca (l 2i, l 2j-1 ) ) = lcp (S[2i,n], S[2j-1,n])
44
Helpful Observations Let u be an odd/even node in T M. u is Either even or odd and so L(u) is defined. Let u be an even node: 1. For l 2i and l 2j below u lcp (S[2i,n], S[2j,n]) L(u) 2. For l 2i’-1 and l 2j’-1 below u lcp (S[2i’-1,n], S[2j’-1,n]) L(u) 3. For l 2i” and l 2j”-1 below u lcp (S[2i”,n], S[2j”-1,n]) L(u) Symmetrical proof is u is an odd node.
45
Lemma The lcp value of any odd and even pair of leaves whose lca is u must be the same Proof: Suppose lca (l 2i’, l 2j’-1 ) = lca (l 2i’’, l 2j”-1 ) = u lcp (S[2i’,n], S[2j’-1,n]) = k L(u) lcp (S[2i’,n], S[2i”,n]) L(u) k lcp (S[2i”,n], S[2j’-1,n]) = k S[2i’,n] S[2j’-1,n] S[2i”,n] k L(u)L(u)
46
Induction on the lcp Pick a pair of odd an even suffixes S[2i’,n] and S[2j’-1,n]. Base: If S[2i’] S[2j’-1] then lca = root (recall the merge procedure) lcp = 0. Assumption: Suppose theorem is true for lcp 0 u = lca (l 2i, l 2j-1 ) u root. Suppose d(u) = lca (l 2i’+1, l 2j’ ) then: (u) = 1 1 + (d(u)) = 2 1 + lcp (S[2i’+1,n], S[2j’,n]) = 3 lcp (S[2i,n], S[2j-1,n])
47
Done! |S|=n, =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2, ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n) 1 3 4 5 2 6
48
The End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.