PhD Thesis Iwona Bialynicka-Birula Ranked Queries in Index Data Structures
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 2 Outline Background The problem State of the art Rank-sensitivity Making suffix trees rank-sensitive Experimental results A general framework Dynamic Cartesian trees
Part I Introduction and background
Rank-sensitivity Output-sensitive l – size of output set Query time: O(s(n) + l) s(n) = o(n) Rank-sensitive k – runtime parameter Query time: O(s(n) + k) k l Results in rank order 21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 4
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 5 Motivation Output-sensitive data structures can still be too costly Most often additional criteria exist Examples Web pages – PageRank or similar Geometrical objects – Z-order Various databases – physical location News items – time stamp Biological databases – biological relevance Real-time systems
Suffix trees 21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 6 $ens senselessness$ l[7–14] n[4–14] ness$ ess$ss$es $ness$$
Range trees 21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 7
Priority search trees Heap with respect to y coordinate Left subtree < right subtree inorder not necessarily x order (balanced) Three-sided query Input: x 0, x 1, y 1 Output: 〈x, y〉 : x 0 ≤ x ≤ x 1, y ≤ y 1 Dynamic version All points in leaves + possibly on root path Red-black rotations + „pushing down” 21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 8
Dynamic priority search tree 21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 9 〈11, 1〉 〈1, 2〉 〈12, 3〉 〈3, 9〉 〈1, 2〉 〈3, 9〉 〈4, 18〉 〈10, 7〉〈15, 4〉 〈5, 12〉 〈14, 5〉 〈10, 7〉 〈8, 13〉 〈5, 12〉 〈15, 4〉〈14, 5〉〈11, 1〉 〈12, 3〉 (y)(y)(x)(x) (y)(y)(x)(x) 4 ≤ x ≤ 13 y ≤ 11
Cartesian trees Heap with respect to y coordinate Inorder matches x order Set of points uniquely determines tree shape Dynamic version presented in Part IV 21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 10
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 11 Cartesian tree example 〈2, 22〉 〈21, 20〉 〈18, 19〉 〈6, 17〉 〈20, 16〉 〈7, 15〉 〈8, 13〉 〈5, 12〉 〈9, 11〉 〈17, 10〉 〈3, 9〉 〈16, 8〉 〈10, 7〉 〈22, 6〉 〈15, 4〉 〈12, 3〉 〈1, 2〉 〈11, 1〉 〈19, 21〉 〈4, 18〉 〈14, 5〉
Part II Adding rank-sensitivity to suffix trees
The problem 21 September 2008 Iwona Bialynicka-Birula – PhD Thesis
Naive solution (1) For each node, store best-ranking descendant For each leaf– ancestor pair, store successor Problem: quadratic space 21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 14
Naive solution (2) Store only distinct successors Space is now O(n log n) Problem: non-constant time not rank-sensitive 21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 15
Ranked tree Store predecessors, not successors There is now need to store 1 st, 2 nd, 4 th, 8 th,..., 2 l-th – best ranking descendents instead of just the first O(log n) per node Store only distinct predecessors O(log n) per node Augment list with pointers to quickly access any light depth O(log n) per node 21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 16
Example 21 September 2008 Iwona Bialynicka-Birula – PhD Thesis k =
Complexity Space: O(n log n) Query time: O(k) Amortized over the k elements reported No additional search cost if pointer to node given (e.g. suffix trees) 21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 18
Experimental results Used various texts and queries random, English, DNA up to 2×10 6 characters long Query time depends only on k For total results < even faster than unsorted subtree traversal 4–5 times faster than traversal + sorting For all values tested faster than traversal + sorting 21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 19
Part III Rank-sensitivity – a general framework
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 21 Generic solution Tree data structures Result set is obtained from An interval of consecutive leaves or O(polylog n) such disjoint intervals Examples Suffix trees Range trees Hierarchy
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 22 Our results in this model Static version Query time: O(t(n)+k) Space: |D|+O(s(n)log n) for any 0 1 Dynamic version Query time: O(t(n)+k) +O(log n / log log n)∗interval Space: |D|+O(s(n)log n/log log n) Update: O(log n) per copy D – output-sensitive data structure Query time: O(t(n)+l) Space: |D| in memory words s(n) – number of items stored in D (incl. copies) D – rank-sensitive version of D
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 23 Basic idea O(n log n) space
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 24 Query Reduced to merging O(log n) lists
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 25 Space reduction in static case Chazelle 1988 O(n log n) space
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 26 Dynamic case Store explicit values in lists Weight-balanced B-tree Degree proportional to log n/log log n Dynamic fractional cascading Multi-Q-heaps Constant-depth hierarchical pipeline of heaps
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 27 Multi-Q-heaps Similar to Q-heap Stores up to O(log N/log log N) integers The integers are from 0...O(N) Search, find-min, insert, delete takes O(1) Requires lookup tables of O(N) space Performs operations on any subset of items Simple implementation, no special instructions
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 28 Multi-Q-heaps in our solution Constant depth O(log N) Multi-Q ______ log log N log N ______ log log N log N ______ log log N log N
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 29 Multi-Q-heaps in our solution Nodes have non-constant degree Multi-Q 3
Part IV Dynamic Cartesian Trees
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 31 Cartesian trees Vuillemin 1980 Nodes store points 〈x, y〉 y value can be viewed as priority Recursive definition Root stores point with greatest y value x value partitions remaining points (left and right subtrees)
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 32 Cartesian tree example 〈2, 22〉 〈21, 20〉 〈18, 19〉 〈6, 17〉 〈20, 16〉 〈7, 15〉 〈8, 13〉 〈5, 12〉 〈9, 11〉 〈17, 10〉 〈3, 9〉 〈16, 8〉 〈10, 7〉 〈22, 6〉 〈15, 4〉 〈12, 3〉 〈1, 2〉 〈11, 1〉 〈19, 21〉 〈4, 18〉 〈14, 5〉
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 33 Applications Priority queue Randomized searching (treaps) Range and dominance searching RMQ (Range Maximum Query) LCA (Least Common Ancestor) Integer sorting Memory management Suffix trees...
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 34 From RMQ to LCA 2, 22, 9, 18, 12, 17, 15, 13, 11, 7, 1, 3, 5, 4, 8, 10, 19, 21, 16, 20, , 18, 12, 17, 15, 13, 11, 7,
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 35 From LCP array to suffix tree $ I$ IPPI$ ISSIPPI$ ISSISSIPPI$ MISSISSIPPI$ PI$ PPI$ SIPPI$ SISSIPPI$ SSIPPI$ SSISSIPPI$ $ IM...P S $P...SSI P...S... I$PI$ISI P...S... P...S
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 36 History Static setting O(n) construction time, provided elements already sorted Randomized Random priority values – treaps O(log n) expected height O(log n) expected update time Non-uniform probability distributions yield O(√n) or even O(n) height Dynamic and deterministic ???
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 37 Our result Dynamic Cartesian tree Supports insertion Supports weak deletion Maintains actual tree structure between each operation O(log n) amortized time per operation
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 38 Solution outline Combinatorial analysis How many tree elements change due to n insertions? Notion of entropy is exploited Auxiliary structure for accessing tree Needed to quickly access tree elements which need to change Based on the interval tree
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 39 Insertion 〈2, 22〉 〈21, 20〉 〈18, 19〉 〈6, 17〉 〈20, 16〉 〈7, 15〉 〈8, 13〉 〈5, 12〉 〈9, 11〉 〈17, 10〉 〈3, 9〉 〈16, 8〉 〈10, 7〉 〈22, 6〉 〈15, 4〉 〈12, 3〉 〈1, 2〉 〈11, 1〉 〈19, 21〉 〈4, 18〉 〈14, 5〉 〈13, 14〉
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 40 Insertion – worst case 〈1, 16〉 〈7, 4〉 〈8, 2〉 〈17, 15〉 〈16, 13〉 〈15, 11〉 〈14, 9〉 〈13, 7〉 〈12, 5〉 〈11, 3〉 〈10, 1〉 〈2, 14〉 〈3, 12〉 〈4, 10〉 〈5, 8〉 〈6, 6〉 〈9, 17〉
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 41 Analysis – main idea Inserting new elements does not require comparing y coordinates of existing points In turn, deleting points does Conclusion: insertions reduce tree information content... so information entropy can be used as a potential function in an amortized analysis
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 42 > > > > > > > > > > > Insertion revisited
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 43 Insertion reversed (deletion) ???????????
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 44 Formally... Tree T induces partial order ≺ T on nodes Defined by the heap condition Partial order ≺ T has ℒ(T) linear extensions Linear extensions are permutations satisfying the order, i.e. P[i] ≺ T P[j] ⇒ i < j We define missing entropy: ℋ(T)=log ℒ(T) Information needed to sort nodes given tree topology > > > > > > > > > A B C D E I G J FH A B H J F G I D E C H A J F G I D B E C A J D F H G I B E C D J I A H F G E B C H J D A F G I E B C A D H F G J I B E C...
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 45 Missing entropy Can be zero ℋ(T)=0 Or can be up to ℋ(T)=O(n log n) When an insertion affects k edges, ℋ(T) increases by at least Ω(k)
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 46 So what now? Amortized number of edge modifications is O(log n) per insertion into an initially empty tree Node modifications are always constant But how to access the edges to modify? Without increasing the complexity
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 47 Implementation overview Companion interval tree stores tree edges Edges in Cartesian tree are either disjoint or nested So the interval tree has additional properties Operations are tailored to the special case of the Cartesian tree
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 48 Insertion once again 1. Find parent 2. Edges affected 4. Shrink k 3a. Delete 2 3b. Insert 3 1. Find parent 2. Edges affected 3a. Delete 2 3b. Insert 3 4. Shrink k
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 49 Action implementations 1. Find parent Uses the interval tree as a search tree 2. Edges affected Special kind of stabbing query 3. Insert and delete Standard interval tree operations 4. Shrink Emulating using inserts and deletes would yield O(k∗log n) Amortized argument based on the fact that shrinking edge travels down O(log n) O(log n+k) O(1)∗O(log n) k∗O(1)
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 50 Summary Rank-sensitivity Rank-sensitive suffix trees + experimental results A general framework Dynamic Cartesian trees
21 September 2008 Iwona Bialynicka-Birula – PhD Thesis 51 Thank you! Questions?