CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes. –B-tree complexity. –Merging and redistribution –Advantages of B-trees.
CPSC 231 B-Trees (D.H.)2 Problems related to storing large indexes on the disk If the index file is too large to be kept in main memory than it has to be stored on the disk. If the index file has to be stored on the disk then: –searching the index should be faster than binary search –inserting and deleting records to the index must be as fast as searching it
CPSC 231 B-Trees (D.H.)3 Indexing with Binary Search Trees There are two problems with this approach if the index is large and has to be kept on the disk: –Binary searching requires too many seeks –Keeping an index in sorted order is very expensive
CPSC 231 B-Trees (D.H.)4 Early attempts to elevate the binary search problems AVL tree - height balanced binary tree in which insertions and deletions can be performed with minimal accesses to internal nodes. See fig 9.8. P. 378 Paged binary trees - a binary tree that is divided into sub-trees. Each sub-tree is kept in a separate page that can be read/written in a single disk access. See fig 9.12 P.380
CPSC 231 B-Trees (D.H.)5 Problems with AVL trees AVL trees, while balanced, still require too many disk accesses to search a key. (Searches that require more than 5-6 disk accesses are unacceptable).
CPSC 231 B-Trees (D.H.)6 Problems with Paged Binary Trees While the number of disk accesses is greatly reduced, the binary paged trees suffer from inefficient disk usage. This is due to the number of unnecessary references in each sub-tree. Another drawback of this method is the complexity of maintaining the paged structure if the number of random insertions is large.
CPSC 231 B-Trees (D.H.)7 Multi-level and Multi-record Indexing : A better Approach to Tree Indexes All indexing methods discussed so far involved so called simple indexes, i.e. index structures of ordered, linear sequences of records consisting of pairs (Key, Offset). Multilevel indexes are tree structured indexes in which each record consists of ordered list of keys. Sometimes those records are referred to as pages. (WHY?)
CPSC 231 B-Trees (D.H.)8 B-Trees B-Tree of order m is a multilevel index tree with the following properties: –Every node has a maximum of m descendants. –Every node except the root has at least ceiling(m/2) descendants. –The root has at least two descendants (unless is a leaf) –All of the leaves appear on the same level. –The leaf level forms a complete, ordered index of the associated data file.
CPSC 231 B-Trees (D.H.)9 B-Trees- a bottom up approach B-Trees are build upward from the leaf level. So creation of new pages always starts at leaf level.
CPSC 231 B-Trees (D.H.)10 Insertion of a new key to a B-Tree It begins with a search that starts at the root of the tree and proceeds all the way down to the leaf level. After finding the insertion location at the leaf level is inserts the new key, checks for the overflow in leaf record (page, node) splits the record if the overflow exists and modifies the tree on the upward path. See example 9.14 page 389.
CPSC 231 B-Trees (D.H.)11 Splitting Splitting is creation of two nodes out of one when the original node becomes overfull. Splitting results in the need to promote a key to a higher-level node to provide an index separating the two new nodes.
CPSC 231 B-Trees (D.H.)12 Worst Case Search Depth In the worst case, what is the maximum number of disk accesses required to locate a key in the tree? This is the same as asking how deep a tree will be.
CPSC 231 B-Trees (D.H.)13 Worst Case Search Depth Formula Formula (see p.403 of text) is where d is an upper bound on the depth of the tree B-Tree of order m with N keys. For N=1,000,000 d <=3.37 How does this compare with a binary search?
CPSC 231 B-Trees (D.H.)14 Deleting a Key from B-Tree Rules for deleting a key K from a node n in a B- tree: –If n has more than the minimum number of keys and the K is not the largest in n then delete K –If n has more than the minimum number of keys and K is the largest then delete K and modify the higher level indexes to reflect the new largest K in n.
CPSC 231 B-Trees (D.H.)15 If n has exactly the minimum number of keys and one of its siblings has few enough keys, merge n with its sibling and delete a key from the parent node. If n has exactly the minimum number of keys and one of the siblings of n has extra keys, redistribute by moving some keys from this sibling to n, and modify the higher level indexes to reflect largest keys in the affected nodes. See example fig.9.21 p. 404.
CPSC 231 B-Trees (D.H.)16 Merging When a B-Tree node underflows (Becomes less than 50% full), it sometimes becomes necessary to combine the node with an adjacent node, thus decreasing the total number of nodes in the tree. Since merging involves a change in the number of nodes in the tree, its effects can require reorganization at many levels of the tree.
CPSC 231 B-Trees (D.H.)17 Redistribution When a B-Tree node underflows (becomes less 50% full), it may be possible to move keys into the node from an adjacent node with the same parent. This helps ensure that the 50% (m/2) full property is maintained. When keys are redistributed, it becomes necessary to alter the contents of the parent as well.
CPSC 231 B-Trees (D.H.)18 Redistribution During Insertion. Redistribution can be used during insertion to postpone creation of new pages. The use of redistribution in place of splitting should make a B-Tree more efficient in space utilization.
CPSC 231 B-Trees (D.H.)19 Advantages of B-Trees They are balanced (do not have overly long branches). They are shallow (requiring few seeks). They accommodate random insertions and deletions at a relatively low cost while remaining in balance. They guarantee at least 50% storage utilization.