Lecture 2: External Memory Indexing Structures CS6931 Database Seminar.

Slides:



Advertisements
Similar presentations
Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.
Advertisements

I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
B+-trees. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B = n pages I/O complexity:
External Memory Geometric Data Structures
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
Spatial Indexing I Point Access Methods. PAMs Point Access Methods Multidimensional Hashing: Grid File Exponential growth of the directory Hierarchical.
I/O-Algorithms Lars Arge University of Aarhus February 21, 2005.
1 Advanced Database Technology Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Spring 2004 March 4, 2004 INDEXING II Lecture based on [GUW,
I/O-Algorithms Lars Arge Aarhus University February 27, 2007.
I/O-Algorithms Lars Arge Spring 2011 March 8, 2011.
CSE332: Data Abstractions Lecture 9: B Trees Dan Grossman Spring 2010.
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
B+-tree and Hashing.
I/O-Algorithms Lars Arge Aarhus University February 13, 2007.
I/O-Algorithms Lars Arge Aarhus University March 16, 2006.
I/O-Algorithms Lars Arge Spring 2009 February 2, 2009.
I/O-Algorithms Lars Arge Spring 2009 January 27, 2009.
I/O-Algorithms Lars Arge Spring 2007 January 30, 2007.
I/O-Algorithms Lars Arge Aarhus University February 16, 2006.
I/O-Algorithms Lars Arge Aarhus University February 7, 2005.
I/O-Algorithms Lars Arge University of Aarhus February 13, 2005.
I/O-Algorithms Lars Arge Spring 2009 April 28, 2009.
I/O-Algorithms Lars Arge University of Aarhus March 1, 2005.
I/O-Algorithms Lars Arge Spring 2009 March 3, 2009.
I/O-Algorithms Lars Arge Aarhus University February 6, 2007.
Course Review COMP171 Spring Hashing / Slide 2 Elementary Data Structures * Linked lists n Types: singular, doubly, circular n Operations: insert,
Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1
I/O-Algorithms Lars Arge Aarhus University March 5, 2008.
I/O-Efficient Structures for Orthogonal Range Max and Stabbing Max Queries Second Year Project Presentation Ke Yi Advisor: Lars Arge Committee: Pankaj.
I/O-Algorithms Lars Arge Aarhus University February 9, 2006.
I/O-Algorithms Lars Arge Aarhus University March 9, 2006.
I/O-Algorithms Lars Arge Aarhus University February 14, 2008.
I/O-Algorithms Lars Arge Aarhus University March 6, 2007.
I/O-Algorithms Lars Arge University of Aarhus March 7, 2005.
1 Geometric index structures April 15, 2004 Based on GUW Chapter , [Arge01] Sections 1, 2.1 (persistent B- trees), 3-4 (static versions.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Primary Indexes Dense Indexes
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
AALG, lecture 11, © Simonas Šaltenis, Range Searching in 2D Main goals of the lecture: to understand and to be able to analyze the kd-trees and.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
UNC Chapel Hill M. C. Lin Orthogonal Range Searching Reading: Chapter 5 of the Textbook Driving Applications –Querying a Database Related Application –Crystal.
External Memory Algorithms for Geometric Problems Piotr Indyk (slides partially by Lars Arge and Jeff Vitter)
14/13/15 CMPS 3130/6130 Computational Geometry Spring 2015 Windowing Carola Wenk CMPS 3130/6130 Computational Geometry.
B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
Lars Arge Presented by Or Ozery. I/O Model Previously defined: N = # of elements in input M = # of elements that fit into memory B = # of elements per.
2IL50 Data Structures Fall 2015 Lecture 9: Range Searching.
Starting at Binary Trees
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
1 CPS216: Data-intensive Computing Systems Operators for Data Access (contd.) Shivnath Babu.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Review for Exam 2 Topics covered (since exam 1): –Splay Tree –K-D Trees –RB Tree –Priority Queue and Binary Heap –B-Tree For each of these data structures.
Lecture 3: External Memory Indexing Structures (Contd) CS6931 Database Seminar.
External Memory Geometric Data Structures Lars Arge Duke University June 27, 2002 Summer School on Massive Datasets.
Advanced Topics in Data Management
B+-Trees and Static Hashing
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
CPS216: Advanced Database Systems
Tree-Structured Indexes
Presentation transcript:

Lecture 2: External Memory Indexing Structures CS6931 Database Seminar

External Memory Data Structures Names: –I/O-efficient data structures –Disk-based data structures (index structures) used in DB –Disk-resilient data structures (index structures) used in DB –Secondary indexes used in DB Other Data structures –Queue, stack *O(N/B) space, O(1/B) push, O(1/B) pop –Priority queue *O(N/B) space, O(1/B ∙ log M/B N/B) insert, delete-max Mainly used in algorithms

External Memory Data Structures General-purpose data structures –Space: linear or near-linear (very important) –Query: logarithmic in B or 2 for any query (very important) –Update: logarithmic in B or 2 (important) In some sense, more useful than I/O-algorithms –Structure stored in disk most of the time –DB typically maintains many data structures for many different data sets: can’t load all of them to memory –Nearly all index structures in large DB are disk based

–If nodes stored arbitrarily on disk  Search in I/Os  Rangesearch in I/Os Binary search tree: –Standard method for search among N elements –We assume elements in leaves –Search traces at least one root-leaf path External Search Trees

Bottom-up BFS blocking: –Block height –Output elements blocked  Range query in I/Os Optimal: O(N/B) space and query

Maintaining BFS blocking during updates? –Balance normally maintained in search trees using rotations Seems very difficult to maintain BFS blocking during rotation –Also need to make sure output (leaves) is blocked! External Search Trees x y x y

B-trees BFS-blocking naturally corresponds to tree with fan-out B-trees balanced by allowing node degree to vary –Rebalancing performed by splitting and merging nodes

(a,b)-tree uses linear space and has height  Choosing a,b = each node/leaf stored in one disk block   space and query (a,b)-tree T is an (a,b)-tree (a≥2 and b≥2a-1) –All leaves on the same level (contain between a and b elements) –Except for the root, all nodes have degree between a and b –Root has degree between 2 and b  tree

(a,b)-Tree Insert Insert: Search and insert element in leaf v DO v has b+1 elements/children Split v: make nodes v’ and v’’ with and elements insert element (ref) in parent(v) (make new root if necessary) v=parent(v) Insert touch nodes v v’v’’

(a,b)-Tree Insert

(a,b)-Tree Delete Delete: Search and delete element from leaf v DO v has a-1 elements/children Fuse v with sibling v’: move children of v’ to v delete element (ref) from parent(v) (delete root if necessary) If v has >b (and ≤ a+b-1<2b) children split v v=parent(v) Delete touch nodes v v

(a,b)-Tree Delete

13 External Searching: B-Tree Each node (except root) has fan-out between B/2 and B Size: O(N/B) blocks on disk Search: O(log B N) I/Os following a root-to-leaf path Insertion and deletion: O(log B N) I/Os

Summary/Conclusion: B-tree B-trees: (a,b)-trees with a,b = –O(N/B) space –O(log B N+T/B) query –O(log B N) update B-trees with elements in the leaves sometimes called B + -tree –Now B-tree and B + tree are synonyms Construction in I/Os –Sort elements and construct leaves –Build tree level-by-level bottom-up

2D Range Searching q3q3 q2q2 q1q1 q4q4

Quadtree No worst-case bound! Hard to block!

kd-tree kd-tree: –Recursive subdivision of point-set into two half using vertical/horizontal line –Horizontal line on even levels, vertical on uneven levels –One point in each leaf  Linear space and logarithmic height

kd-Tree: Query Query –Recursively visit nodes corresponding to regions intersecting query –Report point in trees/nodes completely contained in query Query analysis –Horizontal line intersect Q(N) = 2+2Q(N/4) = regions –Query covers T regions  I/Os worst-case

kdB-tree kdB-tree: –Bottom-up BFS blocking –Same as B-tree Query as before –Analysis as before but each region now contains Θ(B) points  I/O query

Construction of kdB-tree Simple algorithm –Find median of y-coordinates (construct root) –Distribute point based on median –Recursively build subtrees –Construct BFS-blocking top-down (can compute the height in advance) Idea in improved algorithm –Construct levels at a time using O(N/B) I/Os

Construction of kdB-tree Sort N points by x- and by y-coordinates using I/Os Building levels ( nodes) in O(N/B) I/Os: 1. Construct by grid with points in each slab 2. Count number of points in each grid cell and store in memory 3. Find slab s with median x-coordinate 4. Scan slab s to find median x-coordinate and construct node 5. Split slab containing median x-coordinate and update counts 6. Recurse on each side of median x-coordinate using grid (step 3)  Grid grows to during algorithm  Each node constructed in I/Os

kdB-tree kdB-tree: –Linear space –Query in I/Os –Construction in O(sort(N)) I/Os –Height Dynamic? –Difficult to do splits/merges or rotations …