On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.

Slides:



Advertisements
Similar presentations
§6 Leftist Heaps CHAPTER 5 Graph Algorithms  Heap: Structure Property + Order Property Target : Speed up merging in O(N). Leftist Heap: Order Property.
Advertisements

COL 106 Shweta Agrawal and Amit Kumar
Heaps1 Part-D2 Heaps Heaps2 Recall Priority Queue ADT (§ 7.1.3) A priority queue stores a collection of entries Each entry is a pair (key, value)
Chapter 4: Trees Part II - AVL Tree
Transform and Conquer Chapter 6. Transform and Conquer Solve problem by transforming into: a more convenient instance of the same problem (instance simplification)
Trees Types and Operations
Analysis of Algorithms
Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.
Binary Heaps CSE 373 Data Structures Lecture 11. 2/5/03Binary Heaps - Lecture 112 Readings Reading ›Sections
Heapsort. 2 Why study Heapsort? It is a well-known, traditional sorting algorithm you will be expected to know Heapsort is always O(n log n) Quicksort.
Nick Harvey & Kevin Zatloukal
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Data Structures Using C++ 2E Chapter 11 Binary Trees and B-Trees.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Heaps and heapsort COMP171 Fall 2005 Part 2. Sorting III / Slide 2 Heap: array implementation Is it a good idea to store arbitrary.
1 BST Trees A binary search tree is a binary tree in which every node satisfies the following: the key of every node in the left subtree is.
Advanced Data Structures and Algorithms COSC-600 Lecture presentation-6.
Heapsort Based off slides by: David Matuszek
B-trees (Balanced Trees) A B-tree is a special kind of tree, similar to a binary tree. However, It is not a binary search tree. It is not a binary tree.
Compiled by: Dr. Mohammad Alhawarat BST, Priority Queue, Heaps - Heapsort CHAPTER 07.
ADT Table and Heap Ellen Walker CPSC 201 Data Structures Hiram College.
Heaps, Heapsort, Priority Queues. Sorting So Far Heap: Data structure and associated algorithms, Not garbage collection context.
The Binary Heap. Binary Heap Looks similar to a binary search tree BUT all the values stored in the subtree rooted at a node are greater than or equal.
Priority Queues and Binary Heaps Chapter Trees Some animals are more equal than others A queue is a FIFO data structure the first element.
Data Structure & Algorithm II.  Delete-min  Building a heap in O(n) time  Heap Sort.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
Chapter 11 Heap. Overview ● The heap is a special type of binary tree. ● It may be used either as a priority queue or as a tool for sorting.
Heapsort. Heapsort is a comparison-based sorting algorithm, and is part of the selection sort family. Although somewhat slower in practice on most machines.
B + -Trees Same structure as B-trees. Dictionary pairs are in leaves only. Leaves form a doubly-linked list. Remaining nodes have following structure:
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
WEEK 3 Leftist Heaps CE222 Dr. Senem Kumova Metin CE222_Dr. Senem Kumova Metin.
Algorithms and data structures Protected by
Priority Queues and Heaps. October 2004John Edgar2  A queue should implement at least the first two of these operations:  insert – insert item at the.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Heapsort. What is a “heap”? Definitions of heap: 1.A large area of memory from which the programmer can allocate blocks as needed, and deallocate them.
Binary Search Trees (BSTs) 18 February Binary Search Tree (BST) An important special kind of binary tree is the BST Each node stores some information.
Lecture 15 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
Tree Data Structures. Heaps for searching Search in a heap? Search in a heap? Would have to look at root Would have to look at root If search item smaller.
CIS 068 Welcome to CIS 068 ! Lesson 12: Data Structures 3 Trees.
1 Heap Sort. A Heap is a Binary Tree Height of tree = longest path from root to leaf =  (lgn) A heap is a binary tree satisfying the heap condition:
Internal and External Sorting External Searching
CS 367 Introduction to Data Structures Lecture 8.
2 Binary Heaps What if we’re mostly concerned with finding the most relevant data?  A binary heap is a binary tree (2 or fewer subtrees for each node)
Heaps, Heap Sort, and Priority Queues. Background: Binary Trees * Has a root at the topmost level * Each node has zero, one or two children * A node that.
Priority Queues and Heaps. John Edgar  Define the ADT priority queue  Define the partially ordered property  Define a heap  Implement a heap using.
"Teachers open the door, but you must enter by yourself. "
DAST Tirgul 7.
Lecture 15 Nov 3, 2013 Height-balanced BST Recall:
Multiway Search Trees Data may not fit into main memory
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
B+-Trees.
Heap Sort Example Qamar Abbas.
Binary Trees, Binary Search Trees
Chapter 8 – Binary Search Tree
Description Given a linear collection of items x1, x2, x3,….,xn
original list {67, 33,49, 21, 25, 94} pass { } {67 94}
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
"Teachers open the door, but you must enter by yourself. "
CS Data Structure: Heaps.
Multiway Trees Searching and B-Trees Advanced Tree Structures
Binary Search Trees.
Binary Trees, Binary Search Trees
Algorithms: Design and Analysis
CO4301 – Advanced Games Development Week 4 Binary Search Trees
Binary Trees, Binary Search Trees
CMPT 225 Lecture 16 – Heap Sort.
Presentation transcript:

On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald

On-Demand String Sorting Preprocessing a set of n strings for an efficient subsequent repeated: extract the next lexicographically smallest string Motivation, e.g.: Search engines recurrently return the next best k (k < n) pages Pages typically ranked by relevance, but can also by values of specified field Our Heap of Strings (HoS ) preprocesses in O (n) time and extracts next smallest in O (log n) time + amortized O (N ) time, over all operations on all n strings N = total length of the n strings. Combines Heap and Longest Common Prefix properties Works for unbounded alphabets e Sorting of Strings is possible in O(n log n + N ) time. Implementing all classic Heap operations, only paying extra O (N ) amortized, is not simple

LCP Def: lcp (S 1, S 2 ) denotes the length of the largest common prefix of S 1 and S 2 Folklore Lemma: For strings S 1, S 2,…, S m, lcp (S 1, S m ) = min 1 [ i<m lcp (S i, S i+1 ) e For strings S 1 m S 2, S 3 S 2 [ S 3 t lcp (S 1, S 2 ) [ lcp (S 1, S 3 ) Equivalently: lcp (S 1, S 2 ) > lcp (S 1, S 3 ) t S 2 > S 3 e For strings S 1 [ S 2, S 3 S 2 [ S 3 t lcp (S 1, S 2 ) m lcp (S 1, S 3 ) Equivalently: lcp (S 1, S 2 ) S 3

Heap of Strings (HoS) Binary (balanced) tree Each node n holds a string S (n) and an lcp value lcp (n) Each node n satisfies the HoS property: S(n) m S(p) for parent p of n lcp (n) = lcp (S(n), S(p) )

Procedures for Heapify Basic Step for 2 (BS2): Given strings S 1, S 2, compare the two, letter by letter, until smaller is identified, and the lcp of the two is determined. Basic Step for 3 (BS3): Given strings S 1, S 2, and S 3, compare all three, letter by letter, until smallest, S i, is identified, and the lcp of each of the other strings with S i is determined.

More Procedures for Heapify Basic Step for 2, starting in position l, BS2(l ) : Given strings S 1, S 2, with common prefix of length l compare the two, letter by letter, starting from position l until smaller, S i, is identified, and the lcp of the two is determined. Basic Step for 3, starting in position l, BS3(l ) : Given strings S 1, S 2, and S 3, with common prefix of length l compare all three, letter by letter, starting from position l until smallest, S i, is identified, and the lcp of each of the other strings with S i is determined.

Heapify - based on the classic O(n) process Strings thrown into a binary balanced tree Bottom up, Subtrees are made into HoS-s

Merge two little HoS-es and a root node into one big HoS BS3 of three nodes. (two larger get lcp wrt smallest) At most one subtree gets a new root This new root, as well as two children, all have larger strings than grand root, and have lcp-s wrt it Swap, if needed, to get smallest positioned as root

Merge two little HoS-es and a root node into one big HoS Comparing lcp-s suffices. On equality, read, from lcp on, BS2(l), or BS3(l), until smallest is found and updated lcp determined. Always record lcp with strings found larger, and swap, if needed, to make smallest the parent. Thus maintaining HoS property If swap needed, continue recursively sifting down

Merge two little HoS-es and a root node into one big HoS: Sifting Down On each swap, a sub-HoS gets a new root. That new root, as well as its two children, all have larger strings than, and lcp wrt, old root now positioned as parent of new root. e Comparing lcp suffices to tell smallest of On tie, read more of two (or three) strings, starting at common lcp, until smallest found and updated lcp is determined Record updated lcp with larger string(s).

Sifting down in a HoS of height h O(h) node operations For each string comparison, at least one string has its lcp field increase by the number of letter comparisons made.

Heapifying into a HoS of n strings of total length N takes O(n+N) time A string with lcp = l never gets its prefix of length l participating in any letter comparison e No more than a total of O(N) letter comparisons e Heapifying completes in O(n) node operations + a total of O(N) letter comparisons.

Extracting next smallest string from a HoS Extract the root Both (now orphan) children are larger than, and have lcp wrt, their gone parent Comparing lcp-s suffices to find smaller of the two. In case of a tie, BS2(l ) finds smaller and updates lcp; record updated lcp with the larger child Promote smaller child to vacant parent position Recurse in subtree rooted by promoted child HoS property maintained. Tree might become unbalanced, but not higher.

Extracting next smallest string from a HoS For a HoS of height h O(h) node operations. Letter comparison only from common lcp on. e No more than a total of O(N) letter comparisons for heapify followed by sequential extraction of all strings For each letter comparison, at least one lcp grows. lcp never decreases. HoS becomes unbalanced But height does not grow. Thm: Sorting of n strings of total length N, over unbounded alphabet, is possible in O(n log n + N) time, using O(n) space.

O (n log n + N ) string sorting String sorting is a classical problem, appears in textbooks [Knuth, AHU] Variants: multikey sorting, parallel sorting Weight balanced ternary search trie [M 79] achieves this runtime QuickSort with average sorting time of O (n log n + N ) [BS, 97] Multi phase merge sort, for enhancing cache utilization [I, 05. IBM in the ’80s]

O (n log n + N ) string sorting Indexing data structures: suffix trees, suffix arrays, BIS [AKLL, 05] for suffixes of same string Allow O (n log n + N ) sorting Some can adjust to a general set of strings (BIS) All use O (n log n + N) just to build the data structure and get the first result out.

Efficient On Demand Sorting Thm: On Demand Sorting of n strings of total length N can be done with the extraction of the first result in O(n + N 1 ) time, after which the retrieval of further results in O(log n + N i ) time for the i-th result, with S i N i [ N.

Variation: Find the smallest k < n strings Maintain a HoS of k elements, with parents LARGER than children. root holds largest of smallest k Build a HoS from arbitrary k of the set For each remaining string in the set: compare with root and determine lcp of the two if new is larger than root – discard new otherwise, discard root, and sift down

Find the smallest k < n strings O (n log k + N) to identify k smallest of n + O ( k log k) to get these sorted.

Can HoS do additional operations of the ordinary (integer) Heap? Already seen: Extract min costs O(height) yet does not maintain tree balanced The classic delete takes the last leaf and sifts it down from root, thus maintaining balance Leaf loses the lcp it has gained, and compares again its leading letters Classic insertion by sifting up Some nodes get their grandparent becoming their parent, need to decrease their lcp

BIS Insertion, creating embedded data structure Original nodes do not move. Grandparent do not become parents When pumping up for extraction, smallest node leaves BIS and becomes HoS node. HoS node never gets into a BIS.

BIS Balanced Indexing Structures AKLL, 2005 Adapted here from suffixes to any set of strings BIS is an AVL tree with fixed size extra info (lcp s and pointers) in nodes allows to insert a string of length l to a tree of size n in O (log n + l ) time Deletion in O(log n)

Altogether Thm: It is possible to construct a heap of n strings in O(n) time and support further string insertions and smallest string extractions in time O(log(n) + log(m)) + O(N) amortized over the whole sequence of heap operations, where m is the number of strings inserted post heapifying and were not extracted yet.

Conclusion Combining basic elements, we support a modern concept: On Demand lcp is proved again an interesting, useful measure (lcp with what?) This is a real need: basic sort and k smallest sort are implemented in a search engine product