Download presentation
Presentation is loading. Please wait.
Published byCaitlin Chandler Modified over 9 years ago
1
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA
2
Suffix Sorting & Related Algoritmics Martin Faraç-Kolton Rutgers University USA
3
What is Suffix Sorting? ✢ Given a string, produce the sorted order of its suffixes. ✢ Why would we want that? ✤ More on this later.
4
Example: Mississippi$ Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ { { 825111910637412851119106374122 1 2 3 4 5 6 7 8 9 10 11 12
5
How fast can we sort? ✢ Sorting suffixes of a string can be no faster than sorting the characters of a string. ✢ Can we match the sorting lower bound?
6
Suffix Sorting ✢ We are sorting strings, so Radix Sort is natural. ✤ This yields an algorithm with time O(n 2 ) to O(n 2 logn), depending on assumptions on sortability of characters. ✥ O(n 2 logn) in comparison model. ✥ O(n 2 ) for small integers. In between in general word model. ✢ We can do better by combining Merge Sort with Radix Sort.
7
Building Blocks ✢ Range Reduction ✢ Radix Step ✢ Chunking
8
Range Reduction ✢ Observation: If we apply a monotone function to the characters, the sorted order doesn’t change.
9
Example: Mississippi$ Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ 214414413315 14414413315 4414413315 413315 4413315 14413315 13315 414413315 3315 315 15 5 214414413315 14414413315 4414413315 413315 4413315 14413315 13315 414413315 3315 315 15 5 825111910637412851119106374122 Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ i1 M2 p3 s4 $5
10
Range Reduction ✢ Our only range reduction operation will be: ✤ Replace every character by rank in sorted order of characters. ✤ After RR, length n string will be in [n] n i1 i- i- M4 p5 p- s7 s- s- s- $A i1 M2 p3 s4 $5 ☟
11
RR helps running time ✢ Radix Sort on raw input might take O(n 2 logn) time. ✢ RR takes at most O(nlogn) time. ✢ Radix Sort on small-integer inputs takes O(n 2 ) time. ✢ Total time is O(n 2 ).
12
Radix Step ✢ Recall that Radix Sort proceeds in steps: ✤ Lexicographically sort the last i characters of each string. ✤ Stably sort by preceding character. Now strings are lexicographically sorted by last i+1 characters. sorted
13
Using Radix Step ✢ Suppose you have some set of suffixes sorted. ✤ Maybe suffixes in odd positions. ✢ One Radix Step backs up each suffix by one character. ✤ This gives us sorted order of even suffixes.
14
Example: 214414413315 214414413315 4414413315 413315 14413315 3315 15 214414413315 4414413315 413315 14413315 3315 15 214414413315 14414413315 4414413315 413315 4413315 14413315 13315 414413315 3315 315 15 5 2511 11993773 Odd Suffixes
15
Example: 214414413315 214414413315 4414413315 413315 14413315 3315 15 Even Suffixes 5 14414413315 4413315 414413315 13315 315 5 14414413315 4413315 414413315 13315 315 88106 64124 52
16
Radix Step ✢ Normal Radix Step: ✤ Every string gets a little more sorted. ✢ Our use of Radix Step: ✤ The order of one set of suffixes is determined from the order of another set of suffixes.
17
Where are we now? ✢ Step 1: Recursively sort odd suffixes. ✤ How? And how is it recursive? A recursive step must sort every suffix! We’ll get to that. ✢ Step 2: Sort even suffixes in linear time. ✤ By Radix Step. ✢ Step 3: Merge!
18
Merging is tricky... ✢ F ‘97 gave first linear time solution. ✢ This yields an optimal suffix sorting routine. ✢ It’s a fun algorithm... ✤ but highly unintuitive. ✤ Or so I’ve been told!
19
Merging is tricky... ✢ F ‘97 gave first linear time solution. ✢ This yields an optimal suffix sorting routine. ✢ It’s a fun algorithm... ✤ but highly unintuitive. ✤ Or so I’ve been told!
20
Merging is tricky... ✢ F ‘97 gave first linear time solution. ✢ This yields an optimal suffix sorting routine. ✢ It’s a fun algorithm... ✤ but highly unintuitive. ✤ Or so people keep telling me!
21
Chunking ✢ Let’s solve the recursion problem first. ✢ Given two integers i and j, let 〈 i,j 〉 be their bit concatenation. ✤ If i,j ∈ [n], then 〈 i,j 〉 ∈ [n 2 ]. ✢ Given a string S = (s 1,s 2,...,s n ) ✢ Let S’ = ( 〈 s 1,s 2 〉, 〈 s 3,s 4 〉,..., 〈 s n-1,s n 〉 )
22
Chunking + Recursion: I ✢ Observation: The order of the odd suffixes of S = (s 1,s 2,...,s n ) is the same as the order of all suffixes of S’ = ( 〈 s 1,s 2 〉, 〈 s 3,s 4 〉,..., 〈 s n-1,s n 〉 ) ✤ Since bit concatenation preserves lexicographic ordering.
23
Example: 214414413315 8251119365421 214414413315 4414413315 413315 14413315 3315 15 〈 21 〉〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉 〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉 〈 14 〉〈 41 〉〈 33 〉〈 15 〉 〈 41 〉〈 33 〉〈 15 〉 〈 33 〉〈 15 〉 〈 15 〉 2511 11993773 17 36 12 33 27 13 36 12 33 27 13 12 33 27 13 33 27 13 27 13 13 17 36 12 33 27 13 36 12 33 27 13 12 33 27 13 33 27 13 27 13 13 base 8 2x-1
24
Chunking + Recursion: II ✢ Chunking+Range Reduction = Recursion ✤ Input is in [n] n. ✤ Chunked Input is in [n 2 ] n/2. ✤ Range Reduced Chunking is in [n/2] n/2. ✤ So now problem instance is half the size and we can recurse.
25
Example: 214414413315 8251119365421 〈 21 〉〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉 361542 61542 542 2 42 1542 361542 61542 542 2 42 1542
26
Recall Basic Operations ✢ Range Reduction ✢ Radix Step ✢ Chunking ✢ How we are ready for the whole algorithm.
27
Suffix Sorting ✢ Step 1: Chunk + Range Reduction. ✤ Recurse on new string. ✤ Get sorted order of odd suffixes. ✢ Step 2: Radix Step. (Not 2 nd Recursion!) ✤ Get sorted order of even suffixes. ✢ Step 3: Merge! ✤ We still don’t know how to do this.
28
The Trouble with Merging ✢ Know how the odd suffixes compare. ✢ Know how the even suffixes compare. ✢ No idea how odd & even compare!
29
The difference between 3 and 2 ✢ It’s possible to merge the lists. ✤ By “unintuitive” algorithm. ✢ But Kärkkäinen & Sanders showed the elegant way to merge. ✤ From Plato/Erdös’ Book of Proofs? ✢ They modified the recursion to make the merge easy.
30
Mod 3 Recursion ✢ Given a string S = (s 1,s 2,...,s n ) ✢ Let S 1 = ( 〈 s 1,s 2,s 3 〉, 〈 s 4,s 5,s 6 〉,..., 〈 s n-2,s n-1,s n 〉 ) ✢ Let S 2 = ( 〈 s 2,s 3,s 4 〉, 〈 s 5,s 6,s 7 〉,..., 〈 s n-1,s n,$ 〉 ) ✢ Let O 12 be order of suffix congruent to 1 or 2 mod 3. ✤ You get this recursively from sorting the suffixes of S 1 S 2
31
Example: 214414413315 〈 214 〉〈 414 〉〈 413 〉〈 315 〉 〈 144 〉〈 144 〉〈 133 〉〈 155 〉 〈 214 〉〈 414 〉〈 413 〉〈 315 〉〈 144 〉〈 144 〉〈 133 〉〈 155 〉 47652213 7652213 652213 52213 2213 213 13 3 213 47652213 7652213 652213 52213 2213 13 3 3(x-4)-1 x>4 3x-2 o/w 825111910637412811591106374122 82511176819459392 825 185 191029794 S 12
32
Radix Step x 2 ✢ We have O 12 from the recursion. ✢ One Radix Step gives us O 01 ✢ Another Radix Step gives us O 02 ✢ Each pair of suffixes is now compared in one list. ✢ Each suffix appears in two lists.
33
Example: 214414413315 82511185 191029794 O 12 214414413315 14414413315 414413315 14413315 413315 13315 315 15 5 214414413315 4414413315 414413315 4413315 413315 3315 315 5 214414413315 4414413315 414413315 4413315 413315 3315 315 8251111963971094912 O 01 82511185 996293912 O 02
34
Merging... at last! ✢ An example is worth a thousand words...
35
Example: 214414413315 82511185 191029794 O 12 8251111963971094912 O 01 82511185 996293912 O 02
36
Example: 214414413315 82511185 191029794 O 12 8251111963971094912 O 01 82511185 996293912 O 02 825111910637412118519106374122
37
Example: 214414413315 251115 191029794 O 12 8251111963971094912 O 01 251115 996293912 O 02 825111910637412118519106374122
38
Example: 214414413315 25111 5191029794 8251111963971094912 25111 5996293912 825111910637412118519106374122 O 12 O 01 O 02
39
Example: 214414413315 119109794 8251111963971094912 1996939 825111910637412118519106374122 O 12 O 01 O 02
40
Example: 214414413315 119109794 8251111963971094912 1996939 825111910637412118519106374122 O 12 O 01 O 02
41
Example: 214414413315 9109794 25111963971094912 1996939 825111910637412118519106374122 etc. O 12 O 01 O 02
42
Full Disclosure ✢ Kärkkäinen & Sanders have a more space-efficient but less talk-efficient merging routine. ✤ They get O 0 from O 12 and then merge by looking at up to two characters. ✤ I prefer to waste space.
43
Total time ✢ T(n) to sort suffix of strings in [n] n ✢ T(n) = recursion + 2*radix + merging ✢ T(n) = O(n)+T(2n/3) + O(n) + O(n) ✢ T(n) = O(n)
44
Total time ✢ The initial Range Reduction step into the integer alphabet is the bottleneck. ✤ So this algorithm is optimal for any alphabet.
45
Why did we want to sort suffixes anyway? ✢ Lots of interesting compressed index stuff ✤ See Paolo Ferragina’s talk ✢ It’s easy to go from Suffix Sorting to Suffix Arrays...
46
Suffix Arrays ✢ A suffix array is: ✤ The sorted order of suffixes ✤ Their pairwise adjacent lcp’s. ✢ They are handy as space efficient indexes. ✢ How do we compute the lcp’s from the suffix order?
47
1 Example: Mississippi$ 1001013204 Mississippi$ 825111910637412851119106374122
48
Example: Mississippi$ ??????????? Mississippi$ ☟ ☝ 825111910637412851119106374122
49
Example: Mississippi$ ??0???????? Mississippi$ 825111910637412851119106374122 ☹ ☹
50
Example: Mississippi$ ??0???????? Mississippi$ 825111910637412851119106374122 ☟ ☝
51
Example: Mississippi$ ??0???????4 Mississippi$ 825111910637412851119106374122 ☺☺☺☺ ☺☺☺☺ ☹ ☹
52
Example: Mississippi$ ??0???????4 Mississippi$ 825111910637412851119106374122 ☺☺ ☺☺ ☹ ☹ ☟ ☝
53
Example: Mississippi$ ??0????3??4 Mississippi$ 825111910637412851119106374122 ☺☺ ☺☺ ☹ ☹ ☟ ☝
54
Time ✢ Bottom hand only moves forward ✢ Top hand only moves as much as the bottom hand ✢ Linear time ✢ Due to: ✤ Kasai, Lee, Arimura, Arikawa & Park CPM01
55
Why do we care about suffix arrays? ✢ As I said, they are good indexes. ✢ Also, suffix arrays lead to suffix trees... ✤ But what’s a suffix tree?
56
What’s a suffix tree? ✢ Compacted trie of all suffixes of a string. ✢ What’s it good for? ✤ Too many things to enumerate... ✤ Algorithmic workhorse for stringology over the last 30 years!
57
Example: Mississippi$
69
Brute-force Algorithm ✢ We insert one suffix at a time. ✢ Each suffix insertion takes O(n) time. ✢ Total time is O(n 2 )
70
Turning Suffix Arrays into Suffix Trees ✢ There is a linear-time algorithm for turning a suffix array into a suffix tree. ✢ The short version of the algorithm is: ✤ Cartesian Trees
71
Building suffix trees from suffix arrays ✢ In left-to-right construction of suffix tree, each new suffix is rightmost leaf. ✢ You can build the suffix tree left to right by keeping a stack of the right-most path. ✢ Total construction is linear for any alphabet.
72
Example: Mississippi$ 825111910637412 11001013204 821119106374125 11001013204
73
Example: Mississippi$ 825111910637412 11001013204 1 ☟ 821119106374125 11001013204
74
Example: Mississippi$ 825111910637412 11001013204 1 4 ☟ 821119106374125 11001013204
75
Example: Mississippi$ 825111910637412 11001013204 1 4 ☟ 821119106374125 11001013204
76
Example: Mississippi$ 825111910637412 11001013204 1 4 etc. ☟ 821119106374125 11001013204
77
Suffix Tree Construction ✢ Suffix Sorting + LCP + Cartesian Tree ✢ O(Alphabet Sorting)+O(n) + O(n) + O(n) ✤ =O(Alphabet Sorting) ✢ Is this optimal?
78
Suffix Tree Optimality ✢ If you are using Suffix Tree as a trie, then each node must be sorted, and the construction is optimal. ✢ If you are using Suffix Tree + LCAs (=LCP), then the order of children is irrelevant. ✤ The children of each node can be in any order, and it need not even be consistent between nodes.
79
Suffix Tree Optimality ✢ For small integers, construction is already O(n), so this is optimal, even for Scrambled Suffix Trees. ✢ In comparison model, suffix trees have a lower bound from element uniqueness (depends on degree of root) so we have optimal algorithm. ✢ For large integers (word model of computation), lower bound is linear, upper bound is super-linear.
80
Open Problem ✢ Close the gap in the time for building a large-alphabet suffix tree, when child order is irrelevant. ✢ Related to Deterministic Hashing Open Problem: ✤ Given n large integers, can you map them to small integers (poly n) in linear time in the word model?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.