On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere
Fact From the Previous Talk Harel and Tarjan 1984, Bender and Farach-Colton 2000 A tree T with m nodes can be preprocessed in O(m) time so that, for any pair of its nodes u, v, lca (u, v) can be computed in constant time.
What’s in This Paper Bounds depend on the alphabet –Constant size alphabet – O(n) (Weiner 1973) –For unbounded alphabet (n log n) –For {1…n} – linear time RAM algorithm DAM algorithm (I/O optimal) Algorithm also works for PRAM, PDAM
Talk Outline Suffix trees –Reminder –Tools RAM algorithm for suffix tree construction Conclusion
Suffix Trees S = $ $ 1 2 $ $ S[8,13]=12221$ n = 13
Suffix Tree Representation $ = 1 2 $
Properties of Suffix Trees = 1 L=1 = 12 L= lcp ( (v), (w)) = | ( lca (v, w)| 1 1 = 11 L=2 vw lca (v, w)
Suffix Links Lemma [Weiner 1973] Let a and *. If there is a node v in T s such that (v)=a , then there is a node w in T s such that (w)= . Define the suffix link as sl(v) = w.
Suffix Links = 1 L=1 = 12 L=2 = 122 L=3 = 2 L=
Suffix Links Example
Suffix Arrays Let ={S i | S i *, |S i |=n i } T = compacted trie of In order traversal of leaves gives strings in lexicographical order – S p 1, …, S p | | sort array A T [i]=p i longest common prefix array LCP T [i] = lcp (S p i, S p i+1 )
Suffix Array Example ATAT LCP T 1 1 = 11 L=2
RAM Algorithm Input: string SOutput: T s Divide and Conquer: 1.Recursively compute T o – compacted trie of suffixes beginning at odd positions 2.Recursively compute T e – compacted trie of suffixes beginning at even positions 3.Merge T e and T o to get T s
Divide and Conquer Scheme A(n)A(n) A(n/2)A(n/2)A(n/2)A(n/2) A(n/4)A(n/4)A(n/4)A(n/4)A(n/4)A(n/4)A(n/4)A(n/4) S(n/2)S(n/2)S(n/2)S(n/2) S(n)S(n) Divide Conquer Merge
RAM Algorithm Scheme |S|=n, =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2, ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n)
Switching Representations |S|=n, =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2, ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n)
Suffix Tree Suffix Array ATAT LCP T 1 1 = 11 L=2
Suffix Array Suffix Tree ATAT LCP T 1 1 = 11 L=2
Compressing S |S|=n, =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2, ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n)
Compressing S Input: |S|=n =[n] Map character pairs into single characters: –For i=1 to n form pairs S[2i-1], S[2i] –Sort lexicographically by radix sort O(n) –Remove duplicates S’[i] = rank of S[2i-1], S[2i] Now |S’|= n / 2 and ’=[ n / 2 ]
Example S= $ =[13] 1.Pairs 1,2 1,1 1,2 2,1 2,2 2,1 2.Ordered pairs 1,1 1,2 1,2 2,1 2,1 2,2 3.Duplicates removed 1,1 1,2 2,1 2,2 4.S’=212343$ =[4]
Decompressing S |S|=n, =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2, ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n)
Decompressing S Input : A Ts’, LCP Ts’ Notice : S[2i-1] · · ·S[n]$ = S’[i] · · ·S[ n / 2 ]$ A To [i] = A Ts’ [i] · 2 – 1 · · · · · · if S[ A To [i]+2* LCP Ts’ [i]] = S[ A To [i+1]+2* LCP Ts’ [i]] 1 { LCP To = 2 · LCP Ts’ + otherwise0
Building the Even Tree |S|=n, =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2, ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n)
Building the Even Tree Input : A To, LCP To Observation : P = even suffix of S then P = aP’ and P’ = odd suffix of S To get A Te apply radix sort on even suffixes S[2i,n] using the keys S[2i], S[2i+1,n] if S[2i]=S[2j] lcp (S[2i+1,n], S[2j+1,n])+1 { lcp (S[2i,n], S[2j,n]) = otherwise0
Merging T o and T e |S|=n, =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2, ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n)
Merging T o and T e Input : A To, LCP To and A Te, LCP Te Trivial method – sort suffixes lexicographically (n 2 ) What if we have an oracle for lcp (S[2i, n], S[2j-1, n]) ? Merge A To and A Te directly (like sorted lists) Compute LCP T from previous results: 1.lcp of adjacent odd suffixes by LCP To 2.lcp of adjacent even suffixes by LCP Te 3.lcp of odd suffix and even suffix by oracle
Coupled-DFS (the uncompacted case) AB C DE F T1T1 T2T2 TMTM
12 12 AB C DE F T1T1 T2T2 TMTM 1
12 12 AB C DE F T1T1 T2T2 TMTM 1 1 A+D
Coupled-DFS (the uncompacted case) AB C DE F T1T1 T2T2 TMTM BA+D
Coupled-DFS (the uncompacted case) AB C DE F T1T1 T2T2 TMTM 1 1 A+CB 3 E 2
Coupled-DFS (the uncompacted case) AB C DE F T1T1 T2T2 TMTM 1 1 A+CB 3 E 2 2 C+F
Coupled-DFS (the compacted case) AB C DE F T1T1 T2T TMTM
Coupled-DFS (the compacted case) AB C DE F T1T1 T2T2 TMTM 1 D 2 C+F G 3
Over-Merging T o and T e How do we merge compacted tries? An over-merge is like a merge but: –Compare only first characters of edges –In case of two edges with different lengths, k<l break l into k and l-k –Identify edges with first letter only
Over-Merge Example AB C DE F T1T1 T2T2 1x TMTM 1 D 2 C+F G 3
Over-Merge of Running Example ToTo S= $
Over-Merge of Running Example TeTe S= $
Over-Merge of Running Example TMTM S= $
Building the lcp Oracle Definitions –Node in both T M and T o is odd –Node in both T M and T e is even –Node with both odd and even descendents is odd/even For every odd/even node u find l 2i and l 2j-1 such that u = lca (l 2i, l 2j-1 ) Compute d(u) = lca (l 2i+1, l 2j ) Compute (u) = depth(u) in d-pointers tree
Over-Merge of Running Example TMTM S= $
Main Theorem The function d defines a tree on the odd/even nodes of T M, and for any l 2i and l 2j-1 we have ( lca (l 2i, l 2j-1 ) ) = lcp (S[2i,n], S[2j-1,n])
Helpful Observations Let u be an odd/even node in T M. u is Either even or odd and so L(u) is defined. Let u be an even node: 1. For l 2i and l 2j below u lcp (S[2i,n], S[2j,n]) L(u) 2. For l 2i’-1 and l 2j’-1 below u lcp (S[2i’-1,n], S[2j’-1,n]) L(u) 3. For l 2i” and l 2j”-1 below u lcp (S[2i”,n], S[2j”-1,n]) L(u) Symmetrical proof is u is an odd node.
Lemma The lcp value of any odd and even pair of leaves whose lca is u must be the same Proof: Suppose lca (l 2i’, l 2j’-1 ) = lca (l 2i’’, l 2j”-1 ) = u lcp (S[2i’,n], S[2j’-1,n]) = k L(u) lcp (S[2i’,n], S[2i”,n]) L(u) k lcp (S[2i”,n], S[2j’-1,n]) = k S[2i’,n] S[2j’-1,n] S[2i”,n] k L(u)L(u)
Induction on the lcp Pick a pair of odd an even suffixes S[2i’,n] and S[2j’-1,n]. Base: If S[2i’] S[2j’-1] then lca = root (recall the merge procedure) lcp = 0. Assumption: Suppose theorem is true for lcp 0 u = lca (l 2i, l 2j-1 ) u root. Suppose d(u) = lca (l 2i’+1, l 2j’ ) then: (u) = (d(u)) = lcp (S[2i’+1,n], S[2j’,n]) = 3 lcp (S[2i,n], S[2j-1,n])
Done! |S|=n, =[n] A T ( n / 2 ), LCP T ( n / 2 ) A To ( n / 2 ), LCP To ( n / 2 ) |S’|= n / 2, ’=[ n / 2 ] A Ts’ ( n / 2 ), LCP Ts’ ( n / 2 ) T S’ ( n / 2 ) A Te ( n / 2 ), LCP Te ( n / 2 ) Divide Conquer Merge T S (n)
The End