1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.

Slides:



Advertisements
Similar presentations
Hashing and Indexing John Ortiz.
Advertisements

B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
1 Lecture 8: Data structures for databases II Jose M. Peña
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
1 Advanced Database Technology Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Spring 2004 March 4, 2004 INDEXING II Lecture based on [GUW,
CS4432: Database Systems II
CS CS4432: Database Systems II Basic indexing.
1 Advanced Database Technology Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Spring 2004 February 19, 2004 INDEXING I Lecture based on [GUW,
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
CPSC-608 Database Systems Fall 2008 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #7.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
1 Geometric index structures April 15, 2004 Based on GUW Chapter , [Arge01] Sections 1, 2.1 (persistent B- trees), 3-4 (static versions.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
CSE 326: Data Structures B-Trees Ben Lerner Summer 2007.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Primary Indexes Dense Indexes
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
CS 255: Database System Principles slides: B-trees
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
1 B+ Trees. 2 Tree-Structured Indices v Tree-structured indexing techniques support both range searches and equality searches. v ISAM : static structure;
CS4432: Database Systems II
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
B-Tree. B-Trees a specialized multi-way tree designed especially for use on disk In a B-tree each node may contain a large number of keys. The number.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
COSC 2007 Data Structures II Chapter 15 External Methods.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
DBMS 2001Notes 4.1: B-Trees1 Principles of Database Management Systems 4.1: B-Trees Pekka Kilpeläinen (after Stanford CS245 slide originals by Hector Garcia-Molina,
Adapted from Mike Franklin
Starting at Binary Trees
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
B+ tree & B tree Extracted from Garcia Molina
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Indexing and B+-Trees By Kenneth Cheung CS 157B TR 07:30-08:45 Professor Lee.
1 Chapter 12: Indexing and Hashing Indexing Indexing Basic Concepts Basic Concepts Ordered Indices Ordered Indices B+-Tree Index Files B+-Tree Index Files.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.
1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.
1 Query Processing Part 3: B+Trees. 2 Dense and Sparse Indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Insertions.
1 Ullman et al. : Database System Principles Notes 4: Indexing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
CS422 Principles of Database Systems Indexes Chengyu Sun California State University, Los Angeles.
CS 540 Database Management Systems
Indexing and hashing.
Tree-Structured Indexes
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
CPSC-310 Database Systems
(Slides by Hector Garcia-Molina,
B+-Trees and Static Hashing
B+Trees The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The.
Lecture 20: Indexes Monday, February 27, 2006.
Tree-Structured Indexes
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter 10] and [Pagh03, ] Slides based on Notes 04: Indexing for Stanford CS 245, fall 2002 by Hector Garcia-Molina

2 Today  Indexes  Primary, secondary, dense, sparse  B-trees  Analysis of B-trees  B-tree variants and extensions

3 Why indexing?  Support more efficiently queries like: SELECT * FROM R WHERE a=11 SELECT * FROM R WHERE 8<= b and b<42  Indexing an attribute (or set of attributes) can be used to speed up finding tuples with specific values.  Goal of an index: Look at as few blocks as possible to find the matching record(s)

4 Sequential files  Store relation in sorted order according to search key.  Search: binary search in logarithmic time (I/Os) in the number of blocks used by the relation.  Drawback: expensive to maintain.

5 Primary and secondary indexes  In a primary index, records are stored in an order determined by the search key e.g. sequentially  A relation can have at most one primary index. (Often on primary key.)  A secondary index can not take advantage of any specific order, hence it has to be dense.  Secondary index can have a second, sparse level.

6 Dense index  For each record store the key and a pointer to the record in the sequential file.  Why? Uses less space, hence less time to search. Time (I/Os) logarithmic in number of blocks used by the index.  Need not access the data for some kinds of queries.  Can also be used as secondary index, i.e. with another order of records.

7 Sparse index  Store first value in each block in the sequential file and a pointer to the block.  Uses even less space than dense index, but the block has to be searched, even for unsuccessful searches.  Time (I/Os) logarithmic in the number of blocks used by the index.

8 Multiple levels of indexes  If an index is small enough it can be stored in internal memory. Only one I/O is used.  If the index is too large, an index of the index can be used.  Generalize, and you have a B-tree. The top level index has size equal to one block.

9 B-trees  Can be seen as a general form of multi- level indexes.  Generalize usual (binary) search trees.  Allow efficient insertions and deletions at the expense of using slightly more space (than sequential files).  Popular variant: B + -tree

10 Root B + -tree Example Each node stored in one disk block

11 Sample internal node to keysto keysto keys to keys < 5757  k<8181  k<95 

12 Sample leaf node: From internal node to next leaf in sequence To record with key 57 To record with key 81 To record with key 85 Alternative: Records in leaves

13 Searching a B + -tree Question: How does one search for a range of keys? Above: Search path for tuple with key 101.

14 B + -tree invariants on nodes  Suppose a node (stored in a block) has space for n keys and n+1 pointers.  Don't want block to be too empty: Should have at least  (n+1)/2  non-null pointers. (Different from the text book (RG) notation!)  Exception: The root, which may have only 2 non-null pointers (only 1 key).

15 Other B + -tree invariants (1) All leaves at same lowest level (perfectly balanced tree) (2) Pointers in leaves point to records except for sequence pointer

16 Problem session: Analysis of B + -trees  What is the height of a B + -tree with N leaves and room for n pointers in a node?

17 Insertion into B + -tree (a) simple case - space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root

18 (a) Insert key = 32 n=

19 (b) Insert key = 7 n=

20 (c) Insert key = 160 n=

21 (d) New root, insert 45 n= new root

22 (a) Simple case - no example (b) Coalesce with neighbour (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf Deletion from B + -tree

23 (b) Coalesce with sibling - Delete 50 n=

24 (c) Redistribute keys - Delete 50 n=

25 (d) Non-leaf coalesce - Delete 37 n= new root

26 Alternative B + -tree deletion  In practice, coalescing is often not implemented (hard, and often not worth it)  An alternative is to mark deleted nodes.  Periodic global rebuilding may be used to remove marked nodes when they start taking too much space.

27 Problem session: Analysis of B + -trees  What is the worst case I/O cost of  Searching?  Inserting and deleting?

28 B + -tree summary  Height  1+log n/2 N, typically 3 or 4.  Best search time we could hope for!  By keeping top node(s) in memory, the number of I/Os can be reduced.  Updates: Same cost as search, except for rebalancing.

29 Problem session  Prove that B-trees are optimal in terms of search time among pointer-based indexes, i.e., Suppose we want to search among N keys, that internal memory can hold M keys/pointers, and that a disk block can hold n keys/pointers. Further, suppose that the only way of accessing disk blocks is by following pointers. Show that a search takes at least  log n (N/M)  I/Os in the worst case.  Hint: Consider the sizes of the set of blocks that can be accessed in at most t I/Os.

30 Sorting using B-trees  In internal memory, sorting can be done in O(N log N) time by inserting the N keys into a balanced search tree.  The number of I/Os for sorting by inserting into a B-tree is O(N log B N).  This is more than a factor B slower than multiway mergesort (Feb 22 lecture).

31 Next: Buffering in B-trees  Based on slides by Gerth Brodal, covering a paper published in 2003 at the SODA conference.  Using buffering techniques could be the next big thing in DB indexing.  A nice thesis subject!

32 More on rebalancing  "It will be a rare event that calls for splitting or merging of blocks" – GUW, page 645.  This is true (in particular at the top levels), but a little hard to see.  Easier seen for weight-balanced B- trees.

33 Weight-balanced B-trees (based on [Pagh03], where n corresponds to B/2)  Remove the B + -tree invariant: There must be  (n+1)/2  non-null pointers in a node.  Add new weight invariant: A node at height i must have weight (number of leaves in the subtree below) that is between (n/4) i and 4(n/4) i. (Again, the root is an exception.)

34 Consequences of the weight invariant:  Tree height is  1+log n/4 N (almost same)  A node at height i with weight, e.g., 2(n/4) i will not need rebalancing until there have been at least (n/4) i updates in its subtree. (Why?) Consequences of the weight invariant:  Tree height is  1+log n/4 N (almost same)  A node at height i with weight, e.g., 2(n/4) i will not need rebalancing until there have been at least (n/4) i updates in its subtree. (Why?) Weight-balanced B-trees

35 Rebalancing weight A B Y Z New insertion in subtree More than 4(n/4) i leaves in subtree  weight balance invariant violated

36 Rebalancing weight A B Y Z Node is split into two nodes of weight around 2(n/4) i, i.e., far from violating the invariant (details in [Pagh03])

37 Summary of properties  Deletions similar to insertions (or: use marking and global rebuilding).  Search in time O(log n N).  A node at height i is rebalanced (costing O(1) I/Os) once for every  ((n/4) i ) updates in its subtree. Summary of properties  Deletions similar to insertions (or: use marking and global rebuilding).  Search in time O(log n N).  A node at height i is rebalanced (costing O(1) I/Os) once for every  ((n/4) i ) updates in its subtree. Weight-balanced B-trees

38 Other kinds of B-trees  String B-trees: Fast searches even if keys span many blocks. (April 19 lecture.)  Persistent B-trees: Make searches in any previous version of the tree, e.g. ”find x at time t”. The time for a search is O(log B N), where N is the total number of keys inserted in the tree. (April 12 lecture.)

39 Summary  Indexing is a "key" database technology.  Conventional indexes (when few updates).  B-trees (and variants) are more flexible  The choice of most DBMSs Range queries. Deterministic/reliable.  Theoretically “optimal”: O(log B N) I/Os per operation.  Buffering can be used to achieve fast updates, at the cost of increasing the height of the tree.