Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA
What is Suffix Sorting? ✢ Given a string, give the sorted order of its suffixes. ✢ Why would we want that? ✤ More on this later.
Example: Mississippi$ Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ { {
How fast can we sort? ✢ Sorting suffixes of a string can be no faster than sorting the characters of a string. ✢ Can we match the sorting lower bound?
Suffix Sorting ✢ We are sorting strings, so Radix Sort is natural. ✤ This yields an algorithm with time O(n 2 ) to O(n 2 logn), depending on assumptions on sortability of characters. ✥ O(n 2 logn) in comparison model. ✥ O(n 2 ) for small integers. In between in general word model. ✢ We can do better by combining Merge Sort with Radix Sort.
Building Blocks ✢ Range Reduction ✢ Radix Step ✢ Chunking
Range Reduction ✢ Observation: If we apply a monotone function to the characters, the sorted order doesn’t change.
Example: Mississippi$ Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ i1 M2 p3 s4 $5
Range Reduction ✢ Our only range reduction operation will be: ✤ Replace every character by rank in sorted order of characters. ✤ After RR, length n string will be in [n] n i1 M2 p3 s4 $5
RR helps running time ✢ Radix Sort on raw input might take O(n 2 logn) time. ✢ RR takes at most O(nlogn) time. ✢ Radix Sort on small-integer inputs takes O(n 2 ) time. ✢ Total time is O(n 2 ).
Radix Step ✢ Recall that Radix Sort proceeds in steps: ✤ Lexicographically sort the last i characters of each string. ✤ Stably sort by preceding character. Now strings are lexicographically sorted by last i+1 characters. sorted
Using Radix Step ✢ If we have recursively sorted some subset of suffixes of a string: ✤ 1 step of radix will sort the preceding suffixes. ✢ Ex: If you have sorted suffixes at odd positions, then you get suffixes at even positions.
Example: Odd Suffixes
Example: Even Suffixes
Radix Step ✢ We normally think of Radix Sort as sorting one set of strings. ✢ But with suffixes (which are interrelated), we can use a Radix Step to get sorted orders of one set from another.
Where are we now? ✢ Step 1: Recursively sort odd suffixes. ✤ How? And how is it recursive? A recursive step must sort every suffix! We’ll get to that. ✢ Step 2: Get even suffixes in linear time. ✤ By Radix Step. ✢ Step 3: Merge!
Merging is tricky... ✢ F ‘97 gave first linear time solution. ✢ This yields an optimal suffix sorting routine. ✢ It’s a fun algorithm, but highly unintuitive. ✤ Or so I’ve been told!
Chunking ✢ Let’s solve the recursion problem first. ✢ Given two integers i and j, let 〈 i,j 〉 be their bit concatenation. ✤ If i,j ∈ [n], then 〈 i,j 〉 ∈ [n 2 ]. ✢ Given a string S = (s 1,s 2,...,s n ) ✢ Let S’ = ( 〈 s 1,s 2 〉, 〈 s 3,s 4 〉,..., 〈 s n-1,s n 〉 )
Chunking + Recursion: I ✢ Observation: The order of the odd suffixes of S = (s 1,s 2,...,s n ) is the same as the order of all suffixes of S’ = ( 〈 s 1,s 2 〉, 〈 s 3,s 4 〉,..., 〈 s n-1,s n 〉 ) ✤ Since bit concatenation preserves lexicographic ordering.
Example: 〈 21 〉〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉 〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉 〈 14 〉〈 41 〉〈 33 〉〈 15 〉 〈 41 〉〈 33 〉〈 15 〉 〈 33 〉〈 15 〉 〈 15 〉 base 8
Chunking + Recursion: II ✢ Chunking+Range Reduction = Recursion ✤ Input is in [n] n. ✤ Chunked Input is in [n 2 ] n/2. ✤ Range Reduced Chunking is in [n/2] n/2. ✤ So now problem instance is half the size and we can recurse.
Example: 〈 21 〉〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉
Recall Basic Operations ✢ Range Reduction ✢ Radix Step ✢ Chunking ✢ How we are ready for the whole algorithm.
Suffix Sorting ✢ Step 1: Chunk + Range Reduction. ✤ Recurse on new string. ✤ Get sorted order of odd suffixes. ✢ Step 2: Radix Step. ✤ Get sorted order of even suffixes. ✢ Step 3: Merge! ✤ We still don’t know how to do this.
The Trouble with Merging ✢ Let’s start merging. We need to see which is smaller: the smallest odd suffix s 2i-1 or the smallest even suffix s 2j. ✤ We can compare the first character. ✤ We can then compare s 2i with s 2j ✥ But this is another odd/even comparison.
The difference between 3 and 2 ✢ It’s possible to merge the lists. ✢ But Kärkkäinen & Sanders showed the cute way to merge. ✢ The modified the recursion to make the merge easy.
Mod 3 Recursion ✢ Given a string S = (s 1,s 2,...,s n ) ✢ Let S 1 = ( 〈 s 1,s 2,s 3 〉, 〈 s 4,s 5,s 6 〉,..., 〈 s n-2,s n-1,s n 〉 ) ✢ Let S 2 = ( 〈 s 2,s 3,s 4 〉, 〈 s 5,s 6,s 7 〉,..., 〈 s n-4,s n-3,s n-2 〉 ) ✢ Let O 12 be order of suffix congruent to 1 and 2 mod 3. ✤ You get this recursively from sorting the suffixes of S 1 S 2
Radix Step x 2 ✢ We have O 12 from the recursion. ✢ One Radix Step gives us O 01 ✢ Another Radix Step gives us O 02 ✢ Each pair of suffix is now compared in one list. ✢ Each suffix appears in two lists.
Merging... at last! ✢ To merge O 12, O 01, and O 12 note that: ✤ The smallest suffix is the first on two lists. ✤ Pop the smallest. ✤ Now the next smallest is the first on two lists. ✤ Etc.
Total time ✢ T(n) to sort suffix of strings in [n] n ✢ T(n) = recursion + 2*radix + merging ✢ T(n) = T(2n/3) + O(n) + O(n) ✢ T(n) = O(n) ✢ So the initial Range Reduction step into the integer alphabet is the bottleneck. ✤ So this algorithm is optimal for any alphabet.
Why did we want to sort suffixes anyway? ✢ It’s easy to go from Suffix Sorting to Suffix Arrays...
Suffix Arrays ✢ A suffix array is: ✤ The sorted order of suffixes ✤ Their pairwise adjacent lcp’s. ✢ They are handy as space efficient indexes. ✢ How do we compute the lcp’s from the suffix order?
Example: Mississippi$ ??????????? Mississippi$ ☟ ☝
Example: Mississippi$ ??0???????? Mississippi$ ☹ ☹
Example: Mississippi$ ??0???????? Mississippi$ ☟ ☝
Example: Mississippi$ ??0???????4 Mississippi$ ☺☺☺☺ ☺☺☺☺ ☹ ☹
Example: Mississippi$ ??0???????4 Mississippi$ ☺☺ ☺☺ ☹ ☹ ☟ ☝
Example: Mississippi$ ??0????3??4 Mississippi$ ☺☺ ☺☺ ☹ ☹ ☟ ☝
What’s a suffix tree? ✢ Compacted trie of all suffixes of a string. ✢ What’s it good for? ✤ Too many things to enumerate... ✤ See the stringology literature of the last 30 years!
Example: Mississippi$
Brute-force Algorithm ✢ We insert one suffix at a time. ✢ Each suffix insertion takes O(n) time. ✢ Total time is O(n 2 )
How low can we go? ✢ There is a leaf for each suffix ✤ That’s why we put a $ at end. ✢ Each internal node is branching ✤ Because we have a compacted trie. ✢ Each edge has a constant-size label. ✤ Just a pointer to string + length. ✢ So suffix tree has size O(n).
Weiner’s Algorithm ✢ Weiner showed a suffix-tree construction in O(n) time for binary alphabet. ✢ Still adds suffixes one at a time. ✤ But gets a speed-up on insertions by exploiting Suffix Links.
Defs: LCP ✢ Let li be the leaf for suffix S[1,i]. ✢ Let LCP(i,j) be the length of the longest common prefix of S[1,i] and S[1,j]. ✤ Ex: LCP(2,5) = 4 for Mississippi.
Suffix Links ✢ Claim: If some node in a suffix tree has string aα, for a a character and α a string, then some node in the suffix tree has string α.
Example: Mississippi$ issi
Example: Mississippi$ issi
Example: Mississippi$
Suffix Links: Proof ✢ How do we know a suffix link always exists? ✤ Maybe the link points into the middle of a edge...
Suffix Links: Proof Cartoon aα ii+1 j j+1
Adding Suffix Links ✢ Adding each suffix link is just a least common ancestors computation ✤ This takes O(n) preprocessing + O(1) per lca. ✤ Adding all suffix links is O(n) time.
Suffix Links Uses ✢ The first use of suffix links was to speed up suffix tree construction. ✤ If you keep suffix links on the partially constructed tree, you don’t have to start every insertion from the root. ✤ Yields a linear time construction!
So we’re done! ✢ The data structure has linear size. ✤ So linear lower bound. ✢ The algorithm runs in linear time. ✤ So linear upper bound. ✢ Declare victory and go home!
Alphabets everywhere. ✢ This algorithm runs in linear time for binary alphabet. ✢ For an alphabet of size s, it runs in O(n log s).
Lower Bounds ✢ Element uniqueness reduces to suffix tree construction, so in the algebraic decision tree model, the lower bound is Omega(n log n). ✢ If we require edges to be sorted, then sorting is lower bound.
Open Problem ✢ Is there an interesting lower bound for suffix trees where each node may have an arbitrary order of children? ✢ Such a tree is no good as an index, but just fine for LCA applications of suffix trees.
Back to Suffix Links ✢ Fun fact: The suffix links form a sl-tree. The length of the string at a node is the depth in the sl-tree.
Example: Mississippi$
Stripping a Suffix Tree: I ✢ If we are given a suffix tree with no edge labels, we can reconstruct the labels in linear time: ✤ Compute Suffix Links (by LCAs). ✤ Compute Depth in SL-tree. ✤ Each node selects a leaf and computes offset within its descendant suffix.
Example: Mississippi$
Suffix Arrays ✢ A suffix array is: ✤ The sorted order of suffixes ✤ Their pairwise adjacent lcp’s.
Example: Mississippi$
Suffix Arrays ✢ They are much more space efficient than suffix trees. ✢ You can still use them as indexes. ✢ You can build them as fast as suffix trees... ✤ By dfs of suffix tree. How about w/o suffix trees?
Building suffix trees from suffix arrays ✢ You can build the suffix tree left to right by keeping a stack of the right-most path. ✢ This gives you the shape of the tree. ✢ Then add link labels. ✢ Total construction is linear for any alphabet.
Example: Mississippi$
Example: Mississippi$ ☟
Example: Mississippi$ ☟
Example: Mississippi$ ☟
Example: Mississippi$ etc. ☟
Even Less Information! ✢ You don’t even need LCP information to build the suffix tree. ✢ You can compute LCP array from suffix order array in linear time. ✢ Big Conclusion: Given sorted order of suffixes, you can build the suffix array in linear time for any alphabet.
How do we sort suffixes? ✢ Mergesort! ✤ Careful choice of recursion. ✤ Careful merging.
How do we sort suffixes? ✢ Key Operations: ✤ Range Reduction. ✤ Radix Sorting. ✤ Clumping.