I/O-Algorithms Lars Arge Aarhus University February 16, 2006.

Slides:



Advertisements
Similar presentations
Temporal Databases S. Srinivasa Rao April 12, 2007
Advertisements

Chapter 4: Trees Part II - AVL Tree
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
Lars Arge 1/43 Big Terrain Data Analysis Algorithms in the Field Workshop SoCG June 19, 2012 Lars Arge.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Chapter 6: Transform and Conquer
External Memory Geometric Data Structures
I/O-Algorithms Lars Arge January 31, Lars Arge I/O-algorithms 2 Random Access Machine Model Standard theoretical model of computation: –Infinite.
Multiversion Access Methods - Temporal Indexing. Basics A data structure is called : Ephemeral: updates create a new version and the old version cannot.
I/O-Algorithms Lars Arge University of Aarhus February 21, 2005.
I/O-Algorithms Lars Arge Aarhus University February 27, 2007.
I/O-Algorithms Lars Arge Spring 2011 March 8, 2011.
I/O-Algorithms Lars Arge Aarhus University February 13, 2007.
I/O-Algorithms Lars Arge Aarhus University March 16, 2006.
I/O-Algorithms Lars Arge Spring 2009 February 2, 2009.
I/O-Algorithms Lars Arge Spring 2009 January 27, 2009.
I/O-Algorithms Lars Arge Spring 2007 January 30, 2007.
I/O-Algorithms Lars Arge Aarhus University February 7, 2005.
I/O-Algorithms Lars Arge University of Aarhus February 13, 2005.
I/O-Algorithms Lars Arge Spring 2009 April 28, 2009.
I/O-Algorithms Lars Arge University of Aarhus March 1, 2005.
I/O-Algorithms Lars Arge Spring 2009 March 3, 2009.
I/O-Algorithms Lars Arge Aarhus University February 6, 2007.
I/O-Algorithms Lars Arge Spring 2006 February 2, 2006.
Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1
I/O-Algorithms Lars Arge Aarhus University March 5, 2008.
I/O-Efficient Structures for Orthogonal Range Max and Stabbing Max Queries Second Year Project Presentation Ke Yi Advisor: Lars Arge Committee: Pankaj.
I/O-Algorithms Lars Arge Aarhus University April 16, 2008.
I/O-Algorithms Lars Arge Aarhus University February 9, 2006.
TTIT33 Algorithms and Optimization – DALG Lecture 6 Jan Maluszynski - HT TTIT33- Algorithms and optimization Algorithms Lecture 5 Balanced Search.
I/O-Algorithms Lars Arge Aarhus University March 9, 2006.
I/O-Algorithms Lars Arge Aarhus University February 14, 2008.
I/O-Algorithms Lars Arge Aarhus University March 6, 2007.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
I/O-Algorithms Lars Arge University of Aarhus March 7, 2005.
1 Geometric index structures April 15, 2004 Based on GUW Chapter , [Arge01] Sections 1, 2.1 (persistent B- trees), 3-4 (static versions.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
I/O-Algorithms Lars Arge Spring 2008 January 31, 2008.
CS4432: Database Systems II
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
I/O-Algorithms Lars Arge Fall 2014 August 28, 2014.
Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
External Memory Algorithms for Geometric Problems Piotr Indyk (slides partially by Lars Arge and Jeff Vitter)
B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
Mehdi Mohammadi March Western Michigan University Department of Computer Science CS Advanced Data Structure.
Lars Arge Presented by Or Ozery. I/O Model Previously defined: N = # of elements in input M = # of elements that fit into memory B = # of elements per.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
Starting at Binary Trees
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
Lecture 2: External Memory Indexing Structures CS6931 Database Seminar.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture17.
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
External Memory Geometric Data Structures Lars Arge Duke University June 27, 2002 Summer School on Massive Datasets.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
COMP 5704 Project Presentation Parallel Buffer Trees and Searching Cory Fraser School of Computer Science Carleton University, Ottawa, Canada
Michal Balas1 I/O-efficient Point Location using Persistent B-Trees Lars Arge, Andrew Danner, and Sha-Mayn Teh Department of Computer Science, Duke University.
Multiway Search Trees Data may not fit into main memory
Advanced Topics in Data Management
Advanced Topics in Data Management
Chapter 6: Transform and Conquer
8th Workshop on Massive Data Algorithms, August 23, 2016
General External Merge Sort
Presentation transcript:

I/O-Algorithms Lars Arge Aarhus University February 16, 2006

Lars Arge I/O-algorithms 2 I/O-Model Parameters N = # elements in problem instance B = # elements that fits in disk block M = # elements that fits in main memory K = # output size in searching problem We often assume that M>B 2 I/O: Movement of block between memory and disk D P M Block I/O

Lars Arge I/O-algorithms 3 Fundamental Bounds Internal External Scanning: N Sorting: N log N Permuting Searching:

Lars Arge I/O-algorithms 4 B-tree B-tree with branching parameter b and leaf parameter k (b,k≥8) –All leaves on same level and contain between 1/4k and k elements –Except for the root, all nodes have degree between 1/4b and b –Root has degree between 2 and b B-tree with leaf parameter –O(N/B) space –Height – amortized leaf rebalance operations – amortized internal node rebalance operations B-tree with branching parameter B c, 0<c≤1, and leaf parameter B –Space O(N/B), updates, queries

Lars Arge I/O-algorithms 5 Secondary Structures When secondary structures used, a rebalance on v often require O(w(v)) I/Os (w(v) is weight of v) –If inserts have to be made below v between operations  O(1) amortized split bound  amortized insert bound Nodes in standard B-tree do not have this property  tree

Lars Arge I/O-algorithms 6 Weight-balanced B-tree Idea: Combination of B-tree and BB[  ]-tree –Weight constraint on nodes instead of degree constraint –Rebalancing performed using split/fuse as in B-tree Weight-balanced B-tree with parameters b and k (b>8, k≥8) –All leaves on same level and contain between k/4 and k elements –Internal node v at level l has w(v) < –Except for the root, internal node v at level l have w(v)> –The root has more than one child level l-1 level l

Lars Arge I/O-algorithms 7 Weight-balanced B-tree Weight-balanced B-tree with branching parameter b and leaf paramater k=Ω(B) –O(N/B) space –Height – rebalancing operations after update –Ω(w(v)) updates below v between consecutive operations on v Weight-balanced B-tree with branching parameter B c and leaf parameter B –Updates in and queries in I/Os Construction bottom-up in I/O level l-1 level l

Lars Arge I/O-algorithms 8 Persistent B-tree Update current version (getting new version) Query all versions N is number of updates performed –O(N/B) space – update – query in any version

Lars Arge I/O-algorithms 9 Persistent B-tree Idea: Elements augmented with “existence interval” and stored in one structure Persistent B-tree with parameter b (>16): –Directed graph *Nodes contain elements augmented with existence interval *At any time t, nodes with elements alive at time t form B-tree with leaf and branching parameter b –B-tree with leaf and branching parameter b on indegree 0 node  If b=B: –Query at any time t in I/Os

Lars Arge I/O-algorithms 10 B-tree Construction In internal memory we can sort N elements in O(N log N) time using a balanced search tree: –Insert all elements one-by-one (construct tree) –Output in sorted order using in-order traversal Same algorithm using B-tree use I/Os –A factor of non-optimal As discussed we could build B-tree bottom-up in I/Os –But what about persistent B-tree? –In general we would like to have dynamic data structure to use in algorithms  I/O operations

Lars Arge I/O-algorithms 11 Main idea: Logically group nodes together and add buffers –Insertions done in a “lazy” way – elements inserted in buffers. –When a buffer runs full elements are pushed one level down. –Buffer-emptying in O(M/B) I/Os  every block touched constant number of times on each level  inserting N elements (N/B blocks) costs I/Os. Buffer-tree Technique B B M elements fan-out M/B

Lars Arge I/O-algorithms 12 Definition: –B-tree with branching parameter and leaf parameter B –Size M buffer in each internal node Updates: –Add time-stamp to insert/delete element –Collect B elements in memory before inserting in root buffer –Perform buffer-emptying when buffer runs full Basic Buffer-tree $m$ blocks M B

Lars Arge I/O-algorithms 13 Basic Buffer-tree Note: –Buffer can be larger than M during recursive buffer-emptying *Elements distributed in sorted order  at most M elements in buffer unsorted –Rebalancing needed when “leaf-node” buffer emptied *Leaf-node buffer-emptying only performed after all full internal node buffers are emptied $m$ blocks M B

Lars Arge I/O-algorithms 14 Basic Buffer-tree Internal node buffer-empty: –Load first M (unsorted) elements into memory and sort them –Merge elements in memory with rest of (already sorted) elements –Scan through sorted list while *Removing “matching” insert/deletes *Distribute elements to child buffers –Recursively empty full child buffers Emptying buffer of size X takes O(X/B+M/B)=O(X/B) I/Os $m$ blocks M

Lars Arge I/O-algorithms 15 Basic Buffer-tree Buffer-empty of leaf node with K elements in leaves –Sort buffer as previously –Merge buffer elements with elements in leaves –Remove “matching” insert/deletes obtaining K’ elements –If K’<K then *Add K-K’ “dummy” elements and insert in “dummy” leaves Otherwise *Place K elements in leaves *Repeatedly insert block of elements in leaves and rebalance Delete dummy leaves and rebalance when all full buffers emptied K

Lars Arge I/O-algorithms 16 Basic Buffer-tree Invariant: Buffers of nodes on path from root to emptied leaf-node are empty  Insert rebalancing (splits) performed as in normal B-tree Delete rebalancing: v’ buffer emptied before fuse of v –Necessary buffer emptyings performed before next dummy- block delete –Invariant maintained v v v’ v v’’

Lars Arge I/O-algorithms 17 Basic Buffer-tree Analysis: –Not counting rebalancing, a buffer-emptying of node with X ≥ M elements (full) takes O(X/B) I/Os  total full node emptying cost I/Os –Delete rebalancing buffer-emptying (non-full) takes O(M/B) I/Os  cost of one split/fuse O(M/B) I/Os –During N updates *O(N/B) leaf split/fuse * internal node split/fuse  Total cost of N operations: I/Os

Lars Arge I/O-algorithms 18 Basic Buffer-tree Emptying all buffers after N insertions: Perform buffer-emptying on all nodes in BFS-order  resulting full-buffer emptyings cost I/Os empty non-full buffers using O(M/B)  O(N/B) I/Os  N elements can be sorted using buffer tree in I/Os $m$ blocks M B

Lars Arge I/O-algorithms 19 Batching of operations on B-tree using M-sized buffers – I/O updates amortized –All buffers emptied in I/Os One-dim. rangesearch operations can also be supported in I/Os amortized –Search elements handle lazily like updates –All elements in relevant sub-trees reported during buffer-emptying –Buffer-emptying in O(X/B+T’/B), where T’ is reported elements Using buffer technique persistent B-tree built in I/O Summary/Conclusion: Buffer-tree $m$ blocks

Lars Arge I/O-algorithms 20 Basic buffer tree can be used in external priority queue To delete minimal element: –Empty all buffers on leftmost path –Delete elements in leftmost leaf and keep in memory –Deletion of next M minimal elements free –Inserted elements checked against minimal elements in memory I/Os every O(M) delete  amortized Buffered Priority Queue B

Lars Arge I/O-algorithms 21 Other External Priority Queues Buffer technique can be used on other priority queue structures –Heap –Tournament tree Priority queue supporting update often used in graph algorithms – on tournament tree –Major open problem to do it in I/Os Worst case efficient priority queue has also been developed –B operations require I/Os

Lars Arge I/O-algorithms 22 Other Buffer-tree Technique Results Attaching  (B) size buffers to normal B-tree can also be use to improve update bound Buffered segment tree –Has been used in batched range searching and rectangle intersection algorithm Has been used on String B-tree to obtain I/O-efficient string sorting algorithms

Lars Arge I/O-algorithms 23 Summary/Conclusions: Fund. Data Structures B-tree –O(N/B) space, O(log B N) update, O(log B N+T/B) query Weight-balanced B-tree –Ω(w(v)) updates below v between consecutive operations on v Persistent B-tree –Query in any previous version Buffer tree –Batching of operations to obtain bounds

Lars Arge I/O-algorithms 24 References External Memory Geometric Data Structures Lecture notes by Lars Arge. –Section 5

Lars Arge I/O-algorithms 25 Flow Accumulation Flow accumulation on grid terrain model: –Initially one unit of water in each grid cell –Water (initial and received) distributed from each cell to lowest lower neighbor cell (if existing) –Flow accumulation of cell is total flow through it Flow accumulation (in a more general form) a basic index used in environmental sciences (e.g. to used to compute drainage network)

Lars Arge I/O-algorithms 26 Computing Flow Accumulation Process (sweep) points by decreasing height. At each cell: –Read flow from flow grid and neighbor heights from height grid –Update flow (flow grid) for downslope neighbors  One sweep  O(N log N) time algorithm

Lars Arge I/O-algorithms 27 Flow Accumulation Computed for Appalachian Mountains (800km x 800km) by Duke University environmental researchers –100m resolution  ~ 64M cells  ~128MB raw data (~500MB processing)  14 days (on 512MB machine) – ~ 1.2GB at 30m resolution, ~12GB at 10m, ~1.2TB at 1m Problem: Cells of same height distributed over the terrain  scattered access to flow grid and height grid  Ω(N) I/Os

Lars Arge I/O-algorithms 28 I/O-Efficient Flow Accumulation Eliminating height grid scattered accesses –Augment each cell with height of 8 neighbors Eliminating flow grid scattered accesses –Utilize that flow to neighbor cell is only needed when sweep reaches its elevation: *Distribute flow by inserting element in priority queue with priority equal to neighbor’s height (and grid position) *Flow of cell obtained using DeleteMin operations  Turns O(N) grid accesses into O(N) priority queue operations  algorithms Appalachian Mountains in 3 hours!