Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Heaps1 Part-D2 Heaps Heaps2 Recall Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is a pair (key, value)
QuickSort Average Case Analysis An Incompressibility Approach Brendan Lucier August 2, 2005.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.
MS 101: Algorithms Instructor Neelima Gupta
© The McGraw-Hill Companies, Inc., Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Deterministic Selection and Sorting Prepared by John Reif, Ph.D. Analysis of Algorithms.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Constant-Time LCA Retrieval
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Suffix Trees and Suffix Arrays
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Advanced Topics in Algorithms and Data Structures Lecture pg 1 Recursion.
1 Sorting Problem: Given a sequence of elements, find a permutation such that the resulting sequence is sorted in some order. We have already seen: –Insertion.
Advanced Topics in Algorithms and Data Structures 1 Rooting a tree For doing any tree computation, we need to know the parent p ( v ) for each node v.
Lecture 5: Linear Time Sorting Shang-Hua Teng. Sorting Input: Array A[1...n], of elements in arbitrary order; array size n Output: Array A[1...n] of the.
On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.
Sorting Heapsort Quick review of basic sorting methods Lower bounds for comparison-based methods Non-comparison based sorting.
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Chapter 4: Divide and Conquer The Design and Analysis of Algorithms.
Amortized Rigidness in Dynamic Cartesian Trees Iwona Białynicka-Birula and Roberto Grossi Università di Pisa STACS 2006.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Lecture 5: Master Theorem and Linear Time Sorting
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
CSE 326: Data Structures Sorting Ben Lerner Summer 2007.
1 Today’s Material Lower Bounds on Comparison-based Sorting Linear-Time Sorting Algorithms –Counting Sort –Radix Sort.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
The Complexity of Algorithms and the Lower Bounds of Problems
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Sorting in Linear Time Lower bound for comparison-based sorting
CSE 373 Data Structures Lecture 15
1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.
The LCA Problem Revisited Michael A.Bender & Martin Farach-Colton Presented by: Dvir Halevi.
The LCA Problem Revisited
Introduction to Algorithms Jiafen Liu Sept
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Sorting Fun1 Chapter 4: Sorting     29  9.
Read Alignment Algorithms. The Problem 2 Given a very long reference sequence of length n and given several short strings.
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
Liang, Introduction to Java Programming, Ninth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 25 Sorting.
Sorting Lower Bounds n Beating Them. Recap Divide and Conquer –Know how to break a problem into smaller problems, such that –Given a solution to the smaller.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
CS6045: Advanced Algorithms Sorting Algorithms. Sorting So Far Insertion sort: –Easy to code –Fast on small inputs (less than ~50 elements) –Fast on nearly-sorted.
Liang, Introduction to Java Programming, Tenth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 23 Sorting.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Linear Time Suffix Array Construction Using D-Critical Substrings
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
COMP9319 Web Data Compression and Search
Two equivalent problems
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Quick-Sort 11/14/2018 2:17 PM Chapter 4: Sorting    7 9
Quick-Sort 11/19/ :46 AM Chapter 4: Sorting    7 9
Suffix trees.
Suffix trees and suffix arrays
Quick-Sort 2/23/2019 1:48 AM Chapter 4: Sorting    7 9
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Arrays and Suffix Trees
Sorting We have actually seen already two efficient ways to sort:
Presentation transcript:

Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA

What is Suffix Sorting? ✢ Given a string, give the sorted order of its suffixes. ✢ Why would we want that? ✤ More on this later.

Example: Mississippi$ Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ { {

How fast can we sort? ✢ Sorting suffixes of a string can be no faster than sorting the characters of a string. ✢ Can we match the sorting lower bound?

Suffix Sorting ✢ We are sorting strings, so Radix Sort is natural. ✤ This yields an algorithm with time O(n 2 ) to O(n 2 logn), depending on assumptions on sortability of characters. ✥ O(n 2 logn) in comparison model. ✥ O(n 2 ) for small integers. In between in general word model. ✢ We can do better by combining Merge Sort with Radix Sort.

Building Blocks ✢ Range Reduction ✢ Radix Step ✢ Chunking

Range Reduction ✢ Observation: If we apply a monotone function to the characters, the sorted order doesn’t change.

Example: Mississippi$ Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ Mississippi$ sissippi$ sippi$ i$ pi$ ppi$ ssissippi$ ississippi$ issippi$ ssippi$ ippi$ $ i1 M2 p3 s4 $5

Range Reduction ✢ Our only range reduction operation will be: ✤ Replace every character by rank in sorted order of characters. ✤ After RR, length n string will be in [n] n i1 M2 p3 s4 $5

RR helps running time ✢ Radix Sort on raw input might take O(n 2 logn) time. ✢ RR takes at most O(nlogn) time. ✢ Radix Sort on small-integer inputs takes O(n 2 ) time. ✢ Total time is O(n 2 ).

Radix Step ✢ Recall that Radix Sort proceeds in steps: ✤ Lexicographically sort the last i characters of each string. ✤ Stably sort by preceding character. Now strings are lexicographically sorted by last i+1 characters. sorted

Using Radix Step ✢ If we have recursively sorted some subset of suffixes of a string: ✤ 1 step of radix will sort the preceding suffixes. ✢ Ex: If you have sorted suffixes at odd positions, then you get suffixes at even positions.

Example: Odd Suffixes

Example: Even Suffixes

Radix Step ✢ We normally think of Radix Sort as sorting one set of strings. ✢ But with suffixes (which are interrelated), we can use a Radix Step to get sorted orders of one set from another.

Where are we now? ✢ Step 1: Recursively sort odd suffixes. ✤ How? And how is it recursive? A recursive step must sort every suffix! We’ll get to that. ✢ Step 2: Get even suffixes in linear time. ✤ By Radix Step. ✢ Step 3: Merge!

Merging is tricky... ✢ F ‘97 gave first linear time solution. ✢ This yields an optimal suffix sorting routine. ✢ It’s a fun algorithm, but highly unintuitive. ✤ Or so I’ve been told!

Chunking ✢ Let’s solve the recursion problem first. ✢ Given two integers i and j, let 〈 i,j 〉 be their bit concatenation. ✤ If i,j ∈ [n], then 〈 i,j 〉 ∈ [n 2 ]. ✢ Given a string S = (s 1,s 2,...,s n ) ✢ Let S’ = ( 〈 s 1,s 2 〉, 〈 s 3,s 4 〉,..., 〈 s n-1,s n 〉 )

Chunking + Recursion: I ✢ Observation: The order of the odd suffixes of S = (s 1,s 2,...,s n ) is the same as the order of all suffixes of S’ = ( 〈 s 1,s 2 〉, 〈 s 3,s 4 〉,..., 〈 s n-1,s n 〉 ) ✤ Since bit concatenation preserves lexicographic ordering.

Example: 〈 21 〉〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉 〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉 〈 14 〉〈 41 〉〈 33 〉〈 15 〉 〈 41 〉〈 33 〉〈 15 〉 〈 33 〉〈 15 〉 〈 15 〉 base 8

Chunking + Recursion: II ✢ Chunking+Range Reduction = Recursion ✤ Input is in [n] n. ✤ Chunked Input is in [n 2 ] n/2. ✤ Range Reduced Chunking is in [n/2] n/2. ✤ So now problem instance is half the size and we can recurse.

Example: 〈 21 〉〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉

Recall Basic Operations ✢ Range Reduction ✢ Radix Step ✢ Chunking ✢ How we are ready for the whole algorithm.

Suffix Sorting ✢ Step 1: Chunk + Range Reduction. ✤ Recurse on new string. ✤ Get sorted order of odd suffixes. ✢ Step 2: Radix Step. ✤ Get sorted order of even suffixes. ✢ Step 3: Merge! ✤ We still don’t know how to do this.

The Trouble with Merging ✢ Let’s start merging. We need to see which is smaller: the smallest odd suffix s 2i-1 or the smallest even suffix s 2j. ✤ We can compare the first character. ✤ We can then compare s 2i with s 2j ✥ But this is another odd/even comparison.

The difference between 3 and 2 ✢ It’s possible to merge the lists. ✢ But Kärkkäinen & Sanders showed the cute way to merge. ✢ The modified the recursion to make the merge easy.

Mod 3 Recursion ✢ Given a string S = (s 1,s 2,...,s n ) ✢ Let S 1 = ( 〈 s 1,s 2,s 3 〉, 〈 s 4,s 5,s 6 〉,..., 〈 s n-2,s n-1,s n 〉 ) ✢ Let S 2 = ( 〈 s 2,s 3,s 4 〉, 〈 s 5,s 6,s 7 〉,..., 〈 s n-4,s n-3,s n-2 〉 ) ✢ Let O 12 be order of suffix congruent to 1 and 2 mod 3. ✤ You get this recursively from sorting the suffixes of S 1 S 2

Radix Step x 2 ✢ We have O 12 from the recursion. ✢ One Radix Step gives us O 01 ✢ Another Radix Step gives us O 02 ✢ Each pair of suffix is now compared in one list. ✢ Each suffix appears in two lists.

Merging... at last! ✢ To merge O 12, O 01, and O 12 note that: ✤ The smallest suffix is the first on two lists. ✤ Pop the smallest. ✤ Now the next smallest is the first on two lists. ✤ Etc.

Total time ✢ T(n) to sort suffix of strings in [n] n ✢ T(n) = recursion + 2*radix + merging ✢ T(n) = T(2n/3) + O(n) + O(n) ✢ T(n) = O(n) ✢ So the initial Range Reduction step into the integer alphabet is the bottleneck. ✤ So this algorithm is optimal for any alphabet.

Why did we want to sort suffixes anyway? ✢ It’s easy to go from Suffix Sorting to Suffix Arrays...

Suffix Arrays ✢ A suffix array is: ✤ The sorted order of suffixes ✤ Their pairwise adjacent lcp’s. ✢ They are handy as space efficient indexes. ✢ How do we compute the lcp’s from the suffix order?

Example: Mississippi$ ??????????? Mississippi$ ☟ ☝

Example: Mississippi$ ??0???????? Mississippi$ ☹ ☹

Example: Mississippi$ ??0???????? Mississippi$ ☟ ☝

Example: Mississippi$ ??0???????4 Mississippi$ ☺☺☺☺ ☺☺☺☺ ☹ ☹

Example: Mississippi$ ??0???????4 Mississippi$ ☺☺ ☺☺ ☹ ☹ ☟ ☝

Example: Mississippi$ ??0????3??4 Mississippi$ ☺☺ ☺☺ ☹ ☹ ☟ ☝

What’s a suffix tree? ✢ Compacted trie of all suffixes of a string. ✢ What’s it good for? ✤ Too many things to enumerate... ✤ See the stringology literature of the last 30 years!

Example: Mississippi$

Brute-force Algorithm ✢ We insert one suffix at a time. ✢ Each suffix insertion takes O(n) time. ✢ Total time is O(n 2 )

How low can we go? ✢ There is a leaf for each suffix ✤ That’s why we put a $ at end. ✢ Each internal node is branching ✤ Because we have a compacted trie. ✢ Each edge has a constant-size label. ✤ Just a pointer to string + length. ✢ So suffix tree has size O(n).

Weiner’s Algorithm ✢ Weiner showed a suffix-tree construction in O(n) time for binary alphabet. ✢ Still adds suffixes one at a time. ✤ But gets a speed-up on insertions by exploiting Suffix Links.

Defs: LCP ✢ Let li be the leaf for suffix S[1,i]. ✢ Let LCP(i,j) be the length of the longest common prefix of S[1,i] and S[1,j]. ✤ Ex: LCP(2,5) = 4 for Mississippi.

Suffix Links ✢ Claim: If some node in a suffix tree has string aα, for a a character and α a string, then some node in the suffix tree has string α.

Example: Mississippi$ issi

Example: Mississippi$ issi

Example: Mississippi$

Suffix Links: Proof ✢ How do we know a suffix link always exists? ✤ Maybe the link points into the middle of a edge...

Suffix Links: Proof Cartoon aα ii+1 j j+1

Adding Suffix Links ✢ Adding each suffix link is just a least common ancestors computation ✤ This takes O(n) preprocessing + O(1) per lca. ✤ Adding all suffix links is O(n) time.

Suffix Links Uses ✢ The first use of suffix links was to speed up suffix tree construction. ✤ If you keep suffix links on the partially constructed tree, you don’t have to start every insertion from the root. ✤ Yields a linear time construction!

So we’re done! ✢ The data structure has linear size. ✤ So linear lower bound. ✢ The algorithm runs in linear time. ✤ So linear upper bound. ✢ Declare victory and go home!

Alphabets everywhere. ✢ This algorithm runs in linear time for binary alphabet. ✢ For an alphabet of size s, it runs in O(n log s).

Lower Bounds ✢ Element uniqueness reduces to suffix tree construction, so in the algebraic decision tree model, the lower bound is Omega(n log n). ✢ If we require edges to be sorted, then sorting is lower bound.

Open Problem ✢ Is there an interesting lower bound for suffix trees where each node may have an arbitrary order of children? ✢ Such a tree is no good as an index, but just fine for LCA applications of suffix trees.

Back to Suffix Links ✢ Fun fact: The suffix links form a sl-tree. The length of the string at a node is the depth in the sl-tree.

Example: Mississippi$

Stripping a Suffix Tree: I ✢ If we are given a suffix tree with no edge labels, we can reconstruct the labels in linear time: ✤ Compute Suffix Links (by LCAs). ✤ Compute Depth in SL-tree. ✤ Each node selects a leaf and computes offset within its descendant suffix.

Example: Mississippi$

Suffix Arrays ✢ A suffix array is: ✤ The sorted order of suffixes ✤ Their pairwise adjacent lcp’s.

Example: Mississippi$

Suffix Arrays ✢ They are much more space efficient than suffix trees. ✢ You can still use them as indexes. ✢ You can build them as fast as suffix trees... ✤ By dfs of suffix tree. How about w/o suffix trees?

Building suffix trees from suffix arrays ✢ You can build the suffix tree left to right by keeping a stack of the right-most path. ✢ This gives you the shape of the tree. ✢ Then add link labels. ✢ Total construction is linear for any alphabet.

Example: Mississippi$

Example: Mississippi$ ☟

Example: Mississippi$ ☟

Example: Mississippi$ ☟

Example: Mississippi$ etc. ☟

Even Less Information! ✢ You don’t even need LCP information to build the suffix tree. ✢ You can compute LCP array from suffix order array in linear time. ✢ Big Conclusion: Given sorted order of suffixes, you can build the suffix array in linear time for any alphabet.

How do we sort suffixes? ✢ Mergesort! ✤ Careful choice of recursion. ✤ Careful merging.

How do we sort suffixes? ✢ Key Operations: ✤ Range Reduction. ✤ Radix Sorting. ✤ Clumping.