I/O-Algorithms Lars Arge Aarhus University February 27, 2007
Lars Arge I/O-algorithms 2 I/O-Model Parameters N = # elements in problem instance B = # elements that fits in disk block M = # elements that fits in main memory T = # output size in searching problem We often assume that M>B 2 I/O: Movement of block between memory and disk D P M Block I/O
Lars Arge I/O-algorithms 3 Fundamental Bounds Internal External Scanning: N Sorting: N log N Permuting Searching:
Lars Arge I/O-algorithms 4 Fundamental Data Structures B-trees: Node degree (B) queries in –Rebalancing using split/fuse updates in Weight-balanced B-tress: Weight rather than degree constraint Ω(w(v)) updates below v between rebalancing operations on v Persistent B-trees: –Update in current version in –Search in all previous versions in Buffer trees –Batching of operations to obtain bounds construction algorithms
Lars Arge I/O-algorithms 5 Last time: Interval management Maintain N intervals with unique endpoints dynamically such that stabbing query with point x can be answered efficiently Static solution: Persistent B-tree –Linear space and query Dynamic solution: External interval tree – update x
Lars Arge I/O-algorithms 6 Base tree on endpoints – “slab” X v associated with each node v Interval stored in highest node v where it contains midpoint of X v Intervals I v associated with v stored in –Left slab list sorted by left endpoint (search tree) –Right slab list sorted by right endpoint (search tree) Linear space and O(log N) update Internal Interval Tree
Lars Arge I/O-algorithms 7 Query with x on left side of midpoint of X root –Search left slab list left-right until finding non-stabbed interval –Recurse in left child O(log N+T) query bound x Internal Interval Tree
Lars Arge I/O-algorithms 8 Externalizing Interval Tree Natural idea: –Block tree –Use B-tree for slab lists Number of stabbed intervals in large slab list may be small (or zero) –We can be forced to do I/O in each of O(log N) nodes
Lars Arge I/O-algorithms 9 Externalizing Interval Tree Idea: –Decrease fan-out to height remains – slabs define multislabs –Interval stored in two slab lists (as before) and one multislab list –Intervals in small multislab lists collected in underflow structure –Query answered in v by looking at 2 slab lists and not O(log N) multislab
Lars Arge I/O-algorithms 10 External Interval Tree Linear space, query, update General solution techniques: –Filtering: Charge part of query cost to output –Bootstrapping: *Use O(B 2 ) size structure in each internal node *Constructed using persistence *Dynamic using global rebuilding –Weight-balanced B-tree: Split/fuse in amortized O(1)
Lars Arge I/O-algorithms 11 Three-Sided Range Queries Interval management: “1.5 dimensional” search More general 2d problem: Dynamic 3-sidede range searching –Maintain set of points in plane such that given query (q 1, q 2, q 3 ), all points (x,y) with q 1 x q 2 and y q 3 can be found efficiently (x,x) (x 1,x 2 ) x x1x1 x2x2 q3q3 q2q2 q1q1
Lars Arge I/O-algorithms 12 Three-Sided Range Queries Report all points (x,y) with q 1 x q 2 and y q 3 Static solution: –Sweep top-down inserting x in persistent B-tree at (x,y) –Answer query by performing range query with [q 1,q 2 ] in B-tree at q 3 Optimal: –O(N/B) space –O(log B N+T/B) query – construction Dynamic? … in internal memory priority search tree
Lars Arge I/O-algorithms 13 Base tree on x-coordinates with nodes augmented with points Heap on y-coordinates –Decreasing y values on root-leaf path –(x,y) on path from root to leaf holding x –If v holds point then parent(v) holds point Internal Priority Search Tree , , ,3 4 5,6 5 9,4 1 1, ,1 1
Lars Arge I/O-algorithms 14 Linear space Insert of (x,y) (assuming fixed x-coordinate set): –Compare y with y-coordinate in root –Smaller: Recursively insert (x,y) in subtree on path to x –Bigger: Insert in root and recursively insert old point in subtree O(log N) update Internal Priority Search Tree , , ,3 4 5,6 5 9,4 1 1, ,1 1 Insert (10,21) 10,21
Lars Arge I/O-algorithms 15 Internal Priority Search Tree Query with (q 1, q 2, q 3 ) starting at root v: –Report point in v if satisfying query –Visit both children of v if point reported –Always visit child(s) of v on path(s) to q 1 and q 2 O(log N+T) query , , ,3 4 5,6 5 9,4 1 1, ,
Lars Arge I/O-algorithms 16 Natural idea: Block tree Problem: – I/Os to follow paths to to q 1 and q 2 –But O(T) I/Os may be used to visit other nodes (“overshooting”) query Externalizing Priority Search Tree , , ,3 4 5,6 5 9,4 1 1, ,1 1
Lars Arge I/O-algorithms 17 Externalizing Priority Search Tree Solution idea: –Store B points in each node *O(B 2 ) points stored in each supernode *B output points can pay for “overshooting” –Bootstrapping: *Store O(B 2 ) points in each supernode in static structure , , ,3 4 5,6 5 9,4 1 1, ,1 1
Lars Arge I/O-algorithms 18 External Priority Search Tree Base tree: Weight-balanced B-tree with branching parameter B/4 and leaf parameter B on x-coordinates Points in “heap order”: –Root stores B top points for each of the child slabs –Remaining points stored recursively Points in each node stored in “B 2 -structure” –Persistent B-tree structure for static problem Linear space
Lars Arge I/O-algorithms 19 External Priority Search Tree Query with (q 1, q 2, q 3 ) starting at root v: –Query B 2 -structure and report points satisfying query –Visit child v if *v on path to q 1 or q 2 *All points corresponding to v satisfy query
Lars Arge I/O-algorithms 20 External Priority Search Tree Analysis: – I/Os used to visit node v – nodes on path to q 1 or q 2 –For each node v not on path to q 1 or q 2 visited, B points reported in parent(v) query
Lars Arge I/O-algorithms 21 External Priority Search Tree Insert (x,y) (ignoring insert in base tree - rebalancing): –Find relevant node v: *Query B 2 -structure to find B points in root corresponding to node u on path to x *If y smaller than y-coordinates of all B points then recursively search in u –Insert (x,y) in B 2 -structure of v –If B 2 -structure contains >B points for child u, remove lowest point and insert recursively in u Delete: Similarly u
Lars Arge I/O-algorithms 22 Analysis: –Update visits nodes –B 2 -structure queried/updated in each node *One query *One insert and one delete B 2 -structure analysis: –Query: –Update: O(1) using global rebuilding *Store updates in update block *Rebuild after B updates using I/Os I/O updates External Priority Search Tree u
Lars Arge I/O-algorithms 23 Dynamic Base Tree Deletion: –Delete point as previously –Delete x-coordinate from base tree using global rebuilding I/Os amortized Insertion: –Insert x-coordinate in base tree and rebalance (using splits) –Insert point as previously Split: Boundary in v becomes boundary in parent(v) v v’’ v’
Lars Arge I/O-algorithms 24 Dynamic Base Tree Split: When v splits B new points needed in parent(v) One point obtained from v’ (v’’) using “bubble-up” operation: –Find top point p in v’ –Insert p in B 2 -structure –Remove p from B 2 -structure of v’ –Recursively bubble-up point to v’ Bubble-up in I/Os –Follow one path from v to leaf –Uses O(1) I/O in each node Split in I/Os v’’ v’
Lars Arge I/O-algorithms 25 Dynamic Base Tree O(1) amortized split cost: –Cost: O(w(v)) –Weight balanced base tree: inserts below v between splits External Priority Search Tree –Space: O(N/B) –Query: –Updates: I/Os amortized Amortization can be removed from update bound in several ways –Utilizing lazy rebuilding v’’ v’
Lars Arge I/O-algorithms 26 Summary/Conclusion: Priority Search Tree We have now discussed structures for special cases of two- dimensional range searching –Space: O(N/B) –Query: –Updates: Cannot be obtained for general (4-sided) 2d range searching: – query requires space – space requires query q3q3 q2q2 q1q1 q q q3q3 q2q2 q1q1 q4q4
Lars Arge I/O-algorithms 27 Base tree: Weight balanced tree with branching parameter and leaf parameter B on x-coordinates height Points below each node stored in 4 linear space secondary structures: –“Right” priority search tree –“Left” priority search tree –B-tree on y-coordinates –Interval (priority search) tree space External Range Tree
Lars Arge I/O-algorithms 28 Secondary interval tree: –Connect points in each slab in y-order –Project obtained segments in y-axis –Intervals stored in priority search tree *Interval augmented with pointer to corresponding points in y- coordinate B-tree in corresponding child node External Range Tree
Lars Arge I/O-algorithms 29 Query with (q 1, q 2, q 3, q 4 ) answered in top node with q 1 and q 2 in different slabs v 1 and v 2 Points in slab v 1 –Found with 3-sided query in v 1 using right priority search tree Points in slab v 2 –Found with 3-sided query in v 2 using left priority search tree Points in slabs between v 1 and v 2 –Answer stabbing query with q 3 using interval tree first point above q 3 in each of the slabs –Find points using y-coordinate B-tree in slabs External Range Tree v1v1 v2v2
Lars Arge I/O-algorithms 30 External Range Tree Query analysis: – I/Os to find relevant node – I/Os to answer two 3-sided queries – I/Os to query interval tree – I/Os to traverse B-trees I/Os v1v1 v2v2
Lars Arge I/O-algorithms 31 External Range Tree Insert: –Insert x-coordinate in weight-balanced B-tree *Split of v can be performed in I/Os I/Os –Update secondary structures in all nodes on one root-leaf path *Update priority search trees *Update interval tree *Update B-tree I/Os Delete: –Similar and using global rebuilding v1v1 v2v2
Lars Arge I/O-algorithms 32 Summary: External Range Tree 2d range searching in space – I/O query – I/O update Optimal among query structures q3q3 q2q2 q1q1 q4q4
Lars Arge I/O-algorithms 33 kdB-tree kd-tree: –Recursive subdivision of point-set into two half using vertical/horizontal line –Horizontal line on even levels, vertical on uneven levels –One point in each leaf Linear space and logarithmic height
Lars Arge I/O-algorithms 34 kd-Tree: Query Query –Recursively visit nodes corresponding to regions intersecting query –Report point in trees/nodes completely contained in query Query analysis –Horizontal line intersect Q(N) = 2+2Q(N/4) = regions –Query covers T regions I/Os worst-case
Lars Arge I/O-algorithms 35 kdB-tree kdB-tree: –Stop subdivision when leaf contains between B/2 and B points –BFS-blocking of internal nodes Query as before –Analysis as before but each region now contains Θ(B) points I/O query
Lars Arge I/O-algorithms 36 Construction of kdB-tree Simple algorithm –Find median of y-coordinates (construct root) –Distribute point based on median –Recursively build subtrees –Construct BFS-blocking top-down Idea in improved algorithm –Construct levels at a time using O(N/B) I/Os
Lars Arge I/O-algorithms 37 Construction of kdB-tree Sort N points by x- and by y-coordinates using I/Os Building levels ( nodes) in O(N/B) I/Os: 1. Construct by grid with points in each slab 2. Count number of points in each grid cell and store in memory 3. Find slab s with median x-coordinate 4. Scan slab s to find median x-coordinate and construct node 5. Split slab containing median x-coordinate and update counts 6. Recurse on each side of median x-coordinate using grid (step 3) Grid grows to during algorithm Each node constructed in I/Os
Lars Arge I/O-algorithms 38 kdB-tree kdB-tree: –Linear space –Query in I/Os –Construction in I/Os –Point search in I/Os Dynamic? –Deletions relatively easily in I/Os (partial rebuilding)
Lars Arge I/O-algorithms 39 kdB-tree Insertion using Logarithmic Method Partition pointset S into subsets S 0, S 1, … S log N, |S i | = 2 i or |S i | = 0 Build kdB-tree D i on S i Query: Query each D i Insert: Find first empty D i and construct D i out of elements in S 0,S 1, … S i-1 – I/Os per moved point –Point moved O(log N) times I/Os amortized
Lars Arge I/O-algorithms 40 kdB-tree Insertion and Deletion Insert: Use logarithmic method ignoring deletes Delete: Simply delete point p from relevant D i –i can be calculated based on # insertions since p was inserted –# insertions calculated by storing insertion number of each point in separate B-tree extra update cost To maintain O(log N) structures D i –Perform global rebuild after every Θ(N) updates extra update cost
Lars Arge I/O-algorithms 41 Summary: kdB-tree 2d range searching in O(N/B) space –Query in I/Os –Construction in I/Os –Updates in I/Os Optimal query among linear space structures q3q3 q2q2 q1q1 q4q4
Lars Arge I/O-algorithms 42 O-Tree Structure O-tree: –B-tree on vertical slabs –B-tree on horizontal slabs in each vertical slab –kdB-tree on points in each leaf
Lars Arge I/O-algorithms 43 O-Tree Query Perform rangesearch with q 1 and q 2 in vertical B-tree –Query all kdB-trees in leaves of two horizontal B-trees with x- interval intersected but not spanned by query –Perform rangesearch with q 3 and q 4 horizontal B-trees with x- interval spanned by query *Query all kdB-trees with range intersected by query
Lars Arge I/O-algorithms 44 O-Tree Query Analysis Vertical B-tree query: Query of all kdB-trees in leaves of two horizontal B-trees: Query horizontal B-trees: Query kdB-trees not completely in query Query in kdB-trees completely contained in query: I/Os
Lars Arge I/O-algorithms 45 O-Tree Update Insert: –Search in vertical B-tree: I/Os –Search in horizontal B-tree: I/Os –Insert in kdB-tree: I/Os Use global rebuilding when structures grow too big/small –B-trees not contain elements –kdB-trees not contain elements I/Os Deletes can be handled in I/Os similarly
Lars Arge I/O-algorithms 46 Summary: O-Tree 2d range searching in linear space – I/O query – I/O update Optimal among structures using linear space Can be extended to work in d-dimensions with optimal query bound q3q3 q2q2 q1q1 q4q4
Lars Arge I/O-algorithms 47 Summary/Conclusion: 3 and 4-sided Queries 3-sided 2d range searching: External priority search tree – query, space, update General (4-sided) 2d range searching: –External range tree: query, space, update –O-tree: query, space, update q3q3 q2q2 q1q1 q3q3 q2q2 q1q1 q4q4
Lars Arge I/O-algorithms 48 Summary/Conclusion: Tools and Techniques Tools: –B-trees –Persistent B-trees –Buffer trees –Logarithmic method –Weight-balanced B-trees –Global rebuilding Techniques: –Bootstrapping –Filtering q3q3 q2q2 q1q1 q3q3 q2q2 q1q1 q4q4 (x,x)
Lars Arge I/O-algorithms 49 References External Memory Geometric Data Structures Lecture notes by Lars Arge. –Section 7-9