Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Similar presentations


Presentation on theme: "Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA."— Presentation transcript:

1 Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA

2 What is Suffix Sorting? ✢ Given a string, give the sorted order of its suffixes. ✢ Why would we want that? ✤ More on this later.

3 Example: Mississippi$ 825111910637412851119106374122 Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ { {

4 How fast can we sort? ✢ Sorting suffixes of a string can be no faster than sorting the characters of a string. ✢ Can we match the sorting lower bound?

5 Suffix Sorting ✢ We are sorting strings, so Radix Sort is natural. ✤ This yields an algorithm with time O(n 2 ) to O(n 2 logn), depending on assumptions on sortability of characters. ✥ O(n 2 logn) in comparison model. ✥ O(n 2 ) for small integers. In between in general word model. ✢ We can do better by combining Merge Sort with Radix Sort.

6 Building Blocks ✢ Range Reduction ✢ Radix Step ✢ Chunking

7 Range Reduction ✢ Observation: If we apply a monotone function to the characters, the sorted order doesn’t change.

8 Example: Mississippi$ Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ 214414413315 14414413315 4414413315 413315 4413315 14413315 13315 414413315 3315 315 15 5 214414413315 14414413315 4414413315 413315 4413315 14413315 13315 414413315 3315 315 15 5 825111910637412851119106374122 Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ i1 M2 p3 s4 $5

9 Range Reduction ✢ Our only range reduction operation will be: ✤ Replace every character by rank in sorted order of characters. ✤ After RR, length n string will be in [n] n i1 M2 p3 s4 $5

10 RR helps running time ✢ Radix Sort on raw input might take O(n 2 logn) time. ✢ RR takes at most O(nlogn) time. ✢ Radix Sort on small-integer inputs takes O(n 2 ) time. ✢ Total time is O(n 2 ).

11 Radix Step ✢ Recall that Radix Sort proceeds in steps: ✤ Lexicographically sort the last i characters of each string. ✤ Stably sort by preceding character. Now strings are lexicographically sorted by last i+1 characters. sorted

12 Using Radix Step ✢ If we have recursively sorted some subset of suffixes of a string: ✤ 1 step of radix will sort the preceding suffixes. ✢ Ex: If you have sorted suffixes at odd positions, then you get suffixes at even positions.

13 Example: 214414413315 214414413315 4414413315 413315 14413315 3315 15 214414413315 4414413315 413315 14413315 3315 15 214414413315 14414413315 4414413315 413315 4413315 14413315 13315 414413315 3315 315 15 5 2511 11993773 Odd Suffixes

14 Example: 214414413315 214414413315 4414413315 413315 14413315 3315 15 Even Suffixes 5 14414413315 4413315 414413315 13315 315 5 14414413315 4413315 414413315 13315 315 88106 64124 52

15 Radix Step ✢ We normally think of Radix Sort as sorting one set of strings. ✢ But with suffixes (which are interrelated), we can use a Radix Step to get sorted orders of one set from another.

16 Where are we now? ✢ Step 1: Recursively sort odd suffixes. ✤ How? And how is it recursive? A recursive step must sort every suffix! We’ll get to that. ✢ Step 2: Get even suffixes in linear time. ✤ By Radix Step. ✢ Step 3: Merge!

17 Merging is tricky... ✢ F ‘97 gave first linear time solution. ✢ This yields an optimal suffix sorting routine. ✢ It’s a fun algorithm, but highly unintuitive. ✤ Or so I’ve been told!

18 Chunking ✢ Let’s solve the recursion problem first. ✢ Given two integers i and j, let 〈 i,j 〉 be their bit concatenation. ✤ If i,j ∈ [n], then 〈 i,j 〉 ∈ [n 2 ]. ✢ Given a string S = (s 1,s 2,...,s n ) ✢ Let S’ = ( 〈 s 1,s 2 〉, 〈 s 3,s 4 〉,..., 〈 s n-1,s n 〉 )

19 Chunking + Recursion: I ✢ Observation: The order of the odd suffixes of S = (s 1,s 2,...,s n ) is the same as the order of all suffixes of S’ = ( 〈 s 1,s 2 〉, 〈 s 3,s 4 〉,..., 〈 s n-1,s n 〉 ) ✤ Since bit concatenation preserves lexicographic ordering.

20 Example: 214414413315 8251119365421 214414413315 4414413315 413315 14413315 3315 15 〈 21 〉〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉 〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉 〈 14 〉〈 41 〉〈 33 〉〈 15 〉 〈 41 〉〈 33 〉〈 15 〉 〈 33 〉〈 15 〉 〈 15 〉 2511 11993773 17 36 12 33 27 13 36 12 33 27 13 12 33 27 13 33 27 13 27 13 13 17 36 12 33 27 13 36 12 33 27 13 12 33 27 13 33 27 13 27 13 13 base 8

21 Chunking + Recursion: II ✢ Chunking+Range Reduction = Recursion ✤ Input is in [n] n. ✤ Chunked Input is in [n 2 ] n/2. ✤ Range Reduced Chunking is in [n/2] n/2. ✤ So now problem instance is half the size and we can recurse.

22 Example: 214414413315 8251119365421 〈 21 〉〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉 361542 61542 542 2 42 1542 361542 61542 542 2 42 1542

23 Recall Basic Operations ✢ Range Reduction ✢ Radix Step ✢ Chunking ✢ How we are ready for the whole algorithm.

24 Suffix Sorting ✢ Step 1: Chunk + Range Reduction. ✤ Recurse on new string. ✤ Get sorted order of odd suffixes. ✢ Step 2: Radix Step. ✤ Get sorted order of even suffixes. ✢ Step 3: Merge! ✤ We still don’t know how to do this.

25 The Trouble with Merging ✢ Let’s start merging. We need to see which is smaller: the smallest odd suffix s 2i-1 or the smallest even suffix s 2j. ✤ We can compare the first character. ✤ We can then compare s 2i with s 2j ✥ But this is another odd/even comparison.

26 The difference between 3 and 2 ✢ It’s possible to merge the lists. ✢ But Kärkkäinen & Sanders showed the cute way to merge. ✢ The modified the recursion to make the merge easy.

27 Mod 3 Recursion ✢ Given a string S = (s 1,s 2,...,s n ) ✢ Let S 1 = ( 〈 s 1,s 2,s 3 〉, 〈 s 4,s 5,s 6 〉,..., 〈 s n-2,s n-1,s n 〉 ) ✢ Let S 2 = ( 〈 s 2,s 3,s 4 〉, 〈 s 5,s 6,s 7 〉,..., 〈 s n-4,s n-3,s n-2 〉 ) ✢ Let O 12 be order of suffix congruent to 1 and 2 mod 3. ✤ You get this recursively from sorting the suffixes of S 1 S 2

28 Radix Step x 2 ✢ We have O 12 from the recursion. ✢ One Radix Step gives us O 01 ✢ Another Radix Step gives us O 02 ✢ Each pair of suffix is now compared in one list. ✢ Each suffix appears in two lists.

29 Merging... at last! ✢ To merge O 12, O 01, and O 12 note that: ✤ The smallest suffix is the first on two lists. ✤ Pop the smallest. ✤ Now the next smallest is the first on two lists. ✤ Etc.

30 Total time ✢ T(n) to sort suffix of strings in [n] n ✢ T(n) = recursion + 2*radix + merging ✢ T(n) = T(2n/3) + O(n) + O(n) ✢ T(n) = O(n) ✢ So the initial Range Reduction step into the integer alphabet is the bottleneck. ✤ So this algorithm is optimal for any alphabet.

31 Why did we want to sort suffixes anyway? ✢ It’s easy to go from Suffix Sorting to Suffix Arrays...

32 Suffix Arrays ✢ A suffix array is: ✤ The sorted order of suffixes ✤ Their pairwise adjacent lcp’s. ✢ They are handy as space efficient indexes. ✢ How do we compute the lcp’s from the suffix order?

33 Example: Mississippi$ ??????????? Mississippi$ ☟ ☝ 825111910637412851119106374122

34 Example: Mississippi$ ??0???????? Mississippi$ 825111910637412851119106374122 ☹ ☹

35 Example: Mississippi$ ??0???????? Mississippi$ 825111910637412851119106374122 ☟ ☝

36 Example: Mississippi$ ??0???????4 Mississippi$ 825111910637412851119106374122 ☺☺☺☺ ☺☺☺☺ ☹ ☹

37 Example: Mississippi$ ??0???????4 Mississippi$ 825111910637412851119106374122 ☺☺ ☺☺ ☹ ☹ ☟ ☝

38 Example: Mississippi$ ??0????3??4 Mississippi$ 825111910637412851119106374122 ☺☺ ☺☺ ☹ ☹ ☟ ☝

39

40 What’s a suffix tree? ✢ Compacted trie of all suffixes of a string. ✢ What’s it good for? ✤ Too many things to enumerate... ✤ See the stringology literature of the last 30 years!

41 Example: Mississippi$

42

43

44

45

46

47

48

49

50

51

52

53 Brute-force Algorithm ✢ We insert one suffix at a time. ✢ Each suffix insertion takes O(n) time. ✢ Total time is O(n 2 )

54 How low can we go? ✢ There is a leaf for each suffix ✤ That’s why we put a $ at end. ✢ Each internal node is branching ✤ Because we have a compacted trie. ✢ Each edge has a constant-size label. ✤ Just a pointer to string + length. ✢ So suffix tree has size O(n).

55 Weiner’s Algorithm ✢ Weiner showed a suffix-tree construction in O(n) time for binary alphabet. ✢ Still adds suffixes one at a time. ✤ But gets a speed-up on insertions by exploiting Suffix Links.

56 Defs: LCP ✢ Let li be the leaf for suffix S[1,i]. ✢ Let LCP(i,j) be the length of the longest common prefix of S[1,i] and S[1,j]. ✤ Ex: LCP(2,5) = 4 for Mississippi.

57 Suffix Links ✢ Claim: If some node in a suffix tree has string aα, for a a character and α a string, then some node in the suffix tree has string α.

58 Example: Mississippi$ issi

59 Example: Mississippi$ issi

60 Example: Mississippi$

61

62 Suffix Links: Proof ✢ How do we know a suffix link always exists? ✤ Maybe the link points into the middle of a edge...

63 Suffix Links: Proof Cartoon aα ii+1 j j+1

64 Adding Suffix Links ✢ Adding each suffix link is just a least common ancestors computation ✤ This takes O(n) preprocessing + O(1) per lca. ✤ Adding all suffix links is O(n) time.

65 Suffix Links Uses ✢ The first use of suffix links was to speed up suffix tree construction. ✤ If you keep suffix links on the partially constructed tree, you don’t have to start every insertion from the root. ✤ Yields a linear time construction!

66 So we’re done! ✢ The data structure has linear size. ✤ So linear lower bound. ✢ The algorithm runs in linear time. ✤ So linear upper bound. ✢ Declare victory and go home!

67 Alphabets everywhere. ✢ This algorithm runs in linear time for binary alphabet. ✢ For an alphabet of size s, it runs in O(n log s).

68 Lower Bounds ✢ Element uniqueness reduces to suffix tree construction, so in the algebraic decision tree model, the lower bound is Omega(n log n). ✢ If we require edges to be sorted, then sorting is lower bound.

69 Open Problem ✢ Is there an interesting lower bound for suffix trees where each node may have an arbitrary order of children? ✢ Such a tree is no good as an index, but just fine for LCA applications of suffix trees.

70 Back to Suffix Links ✢ Fun fact: The suffix links form a sl-tree. The length of the string at a node is the depth in the sl-tree.

71 Example: Mississippi$ 2 1 11 3 4

72 Stripping a Suffix Tree: I ✢ If we are given a suffix tree with no edge labels, we can reconstruct the labels in linear time: ✤ Compute Suffix Links (by LCAs). ✤ Compute Depth in SL-tree. ✤ Each node selects a leaf and computes offset within its descendant suffix.

73 Example: Mississippi$

74 2 1 11 3 4

75 2 1 11 3 4

76 Suffix Arrays ✢ A suffix array is: ✤ The sorted order of suffixes ✤ Their pairwise adjacent lcp’s.

77 Example: Mississippi$ 821119106374125 11001013204

78 Suffix Arrays ✢ They are much more space efficient than suffix trees. ✢ You can still use them as indexes. ✢ You can build them as fast as suffix trees... ✤ By dfs of suffix tree. How about w/o suffix trees?

79 Building suffix trees from suffix arrays ✢ You can build the suffix tree left to right by keeping a stack of the right-most path. ✢ This gives you the shape of the tree. ✢ Then add link labels. ✢ Total construction is linear for any alphabet.

80 Example: Mississippi$ 825111910637412 11001013204 821119106374125 11001013204

81 Example: Mississippi$ 825111910637412 11001013204 1 ☟ 821119106374125 11001013204

82 Example: Mississippi$ 825111910637412 11001013204 1 4 ☟ 821119106374125 11001013204

83 Example: Mississippi$ 825111910637412 11001013204 1 4 ☟ 821119106374125 11001013204

84 Example: Mississippi$ 825111910637412 11001013204 1 4 etc. ☟ 821119106374125 11001013204

85 Even Less Information! ✢ You don’t even need LCP information to build the suffix tree. ✢ You can compute LCP array from suffix order array in linear time. ✢ Big Conclusion: Given sorted order of suffixes, you can build the suffix array in linear time for any alphabet.

86 How do we sort suffixes? ✢ Mergesort! ✤ Careful choice of recursion. ✤ Careful merging.

87 How do we sort suffixes? ✢ Key Operations: ✤ Range Reduction. ✤ Radix Sorting. ✤ Clumping.


Download ppt "Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA."

Similar presentations


Ads by Google