Presentation is loading. Please wait.

Presentation is loading. Please wait.

On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.

Similar presentations


Presentation on theme: "On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald."— Presentation transcript:

1 On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

2 On-Demand String Sorting Preprocessing a set of n strings for an efficient subsequent repeated: extract the next lexicographically smallest string Motivation, e.g.: Search engines recurrently return the next best k (k < n) pages Pages typically ranked by relevance, but can also by values of specified field Our Heap of Strings (HoS ) preprocesses in O (n) time and extracts next smallest in O (log n) time + amortized O (N ) time, over all operations on all n strings N = total length of the n strings. Combines Heap and Longest Common Prefix properties Works for unbounded alphabets e Sorting of Strings is possible in O(n log n + N ) time. Implementing all classic Heap operations, only paying extra O (N ) amortized, is not simple

3 LCP Def: lcp (S 1, S 2 ) denotes the length of the largest common prefix of S 1 and S 2 Folklore Lemma: For strings S 1, S 2,…, S m, lcp (S 1, S m ) = min 1 [ i<m lcp (S i, S i+1 ) e For strings S 1 m S 2, S 3 S 2 [ S 3 t lcp (S 1, S 2 ) [ lcp (S 1, S 3 ) Equivalently: lcp (S 1, S 2 ) > lcp (S 1, S 3 ) t S 2 > S 3 e For strings S 1 [ S 2, S 3 S 2 [ S 3 t lcp (S 1, S 2 ) m lcp (S 1, S 3 ) Equivalently: lcp (S 1, S 2 ) S 3

4 Heap of Strings (HoS) Binary (balanced) tree Each node n holds a string S (n) and an lcp value lcp (n) Each node n satisfies the HoS property: S(n) m S(p) for parent p of n lcp (n) = lcp (S(n), S(p) )

5 Procedures for Heapify Basic Step for 2 (BS2): Given strings S 1, S 2, compare the two, letter by letter, until smaller is identified, and the lcp of the two is determined. Basic Step for 3 (BS3): Given strings S 1, S 2, and S 3, compare all three, letter by letter, until smallest, S i, is identified, and the lcp of each of the other strings with S i is determined.

6 More Procedures for Heapify Basic Step for 2, starting in position l, BS2(l ) : Given strings S 1, S 2, with common prefix of length l compare the two, letter by letter, starting from position l until smaller, S i, is identified, and the lcp of the two is determined. Basic Step for 3, starting in position l, BS3(l ) : Given strings S 1, S 2, and S 3, with common prefix of length l compare all three, letter by letter, starting from position l until smallest, S i, is identified, and the lcp of each of the other strings with S i is determined.

7 Heapify - based on the classic O(n) process Strings thrown into a binary balanced tree Bottom up, Subtrees are made into HoS-s

8 Merge two little HoS-es and a root node into one big HoS BS3 of three nodes. (two larger get lcp wrt smallest) At most one subtree gets a new root This new root, as well as two children, all have larger strings than grand root, and have lcp-s wrt it Swap, if needed, to get smallest positioned as root

9 Merge two little HoS-es and a root node into one big HoS Comparing lcp-s suffices. On equality, read, from lcp on, BS2(l), or BS3(l), until smallest is found and updated lcp determined. Always record lcp with strings found larger, and swap, if needed, to make smallest the parent. Thus maintaining HoS property If swap needed, continue recursively sifting down

10 Merge two little HoS-es and a root node into one big HoS: Sifting Down On each swap, a sub-HoS gets a new root. That new root, as well as its two children, all have larger strings than, and lcp wrt, old root now positioned as parent of new root. e Comparing lcp suffices to tell smallest of On tie, read more of two (or three) strings, starting at common lcp, until smallest found and updated lcp is determined Record updated lcp with larger string(s).

11 Sifting down in a HoS of height h O(h) node operations For each string comparison, at least one string has its lcp field increase by the number of letter comparisons made.

12 Heapifying into a HoS of n strings of total length N takes O(n+N) time A string with lcp = l never gets its prefix of length l participating in any letter comparison e No more than a total of O(N) letter comparisons e Heapifying completes in O(n) node operations + a total of O(N) letter comparisons.

13 Extracting next smallest string from a HoS Extract the root Both (now orphan) children are larger than, and have lcp wrt, their gone parent Comparing lcp-s suffices to find smaller of the two. In case of a tie, BS2(l ) finds smaller and updates lcp; record updated lcp with the larger child Promote smaller child to vacant parent position Recurse in subtree rooted by promoted child HoS property maintained. Tree might become unbalanced, but not higher.

14 Extracting next smallest string from a HoS For a HoS of height h O(h) node operations. Letter comparison only from common lcp on. e No more than a total of O(N) letter comparisons for heapify followed by sequential extraction of all strings For each letter comparison, at least one lcp grows. lcp never decreases. HoS becomes unbalanced But height does not grow. Thm: Sorting of n strings of total length N, over unbounded alphabet, is possible in O(n log n + N) time, using O(n) space.

15 O (n log n + N ) string sorting String sorting is a classical problem, appears in textbooks [Knuth, AHU] Variants: multikey sorting, parallel sorting Weight balanced ternary search trie [M 79] achieves this runtime QuickSort with average sorting time of O (n log n + N ) [BS, 97] Multi phase merge sort, for enhancing cache utilization [I, 05. IBM in the ’80s]

16 O (n log n + N ) string sorting Indexing data structures: suffix trees, suffix arrays, BIS [AKLL, 05] for suffixes of same string Allow O (n log n + N ) sorting Some can adjust to a general set of strings (BIS) All use O (n log n + N) just to build the data structure and get the first result out.

17 Efficient On Demand Sorting Thm: On Demand Sorting of n strings of total length N can be done with the extraction of the first result in O(n + N 1 ) time, after which the retrieval of further results in O(log n + N i ) time for the i-th result, with S i N i [ N.

18 Variation: Find the smallest k < n strings Maintain a HoS of k elements, with parents LARGER than children. root holds largest of smallest k Build a HoS from arbitrary k of the set For each remaining string in the set: compare with root and determine lcp of the two if new is larger than root – discard new otherwise, discard root, and sift down

19 Find the smallest k < n strings O (n log k + N) to identify k smallest of n + O ( k log k) to get these sorted.

20 Can HoS do additional operations of the ordinary (integer) Heap? Already seen: Extract min costs O(height) yet does not maintain tree balanced The classic delete takes the last leaf and sifts it down from root, thus maintaining balance Leaf loses the lcp it has gained, and compares again its leading letters Classic insertion by sifting up Some nodes get their grandparent becoming their parent, need to decrease their lcp

21 BIS Insertion, creating embedded data structure Original nodes do not move. Grandparent do not become parents When pumping up for extraction, smallest node leaves BIS and becomes HoS node. HoS node never gets into a BIS.

22 BIS Balanced Indexing Structures AKLL, 2005 Adapted here from suffixes to any set of strings BIS is an AVL tree with fixed size extra info (lcp s and pointers) in nodes allows to insert a string of length l to a tree of size n in O (log n + l ) time Deletion in O(log n)

23 Altogether Thm: It is possible to construct a heap of n strings in O(n) time and support further string insertions and smallest string extractions in time O(log(n) + log(m)) + O(N) amortized over the whole sequence of heap operations, where m is the number of strings inserted post heapifying and were not extracted yet.

24 Conclusion Combining basic elements, we support a modern concept: On Demand lcp is proved again an interesting, useful measure (lcp with what?) This is a real need: basic sort and k smallest sort are implemented in a search engine product


Download ppt "On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald."

Similar presentations


Ads by Google