CS 206 Introduction to Computer Science II 12 / 01 / 2008 Instructor: Michael Eckmann
Michael Eckmann - Skidmore College - CS Fall 2008 Today’s Topics Questions/comments? Finish up coding for insert in AVL trees –Need to return the sub tree root node each time B-Trees
Would someone like to remind everyone what an AVL tree is? AVL tree
An AVL tree is a BST with the added restriction that –for all nodes, the height of its left subtree is at most 1 different than the height of its right subtree. The height of an empty subtree is -1. –A node in an AVL tree contains the data item, reference to left child node reference to right child node height of the node AVL tree
Red-Black trees are another form of balanced binary search tree, that we will not discuss. There are others too. What I'd like to introduce now is an idea of a tree that works well for large amounts of data, too big to all fit in memory (RAM). The data will have to partially reside in memory and the rest (the bulk of it) on disk. Why do we care? Any ideas? Other Balanced Trees
We care because disk accesses take so much longer than main memory accesses. Some numbers on the board. Leading up to B-Trees
We would rather do many many calculations in order to save us from having to access the disk. The worst case height for an AVL tree is 1.44 log n which is close to optimal (least) height for a set of nodes, but we'll see that for large amounts of data that reside on disk, it will not perform as well as we'd like. The worst case height of an optimal BST is log n. Log n is the best we can do with binary trees and that's not good enough. We want a small constant number of disk accesses. So, instead of using a Binary tree, we'll use an M-ary tree (where M>2.) The least height of an M-ary tree is log M n. Example on the board for M=5 B-Trees are discussed in section 19.8 in your text. Leading up to B-Trees
B-Trees were created to take that discrepancy in processing time into account (when we have large volumes of data to process.) A B-tree can guarantee only a few disk accesses. An M-ary B-Tree has the following properties –data items stored in leaves only –interior (non-leaf) nodes store a maximum of M-1 keys –root is a leaf or has from 2 to M children –all non-leaf nodes have between the ceil(M/2) up to M children –all leaves at same depth –all leaves have between the ceil(L/2) up to L children L means... (see next few slides) all non leaf nodes have at least half of M children, so for large M this will guarantee that the M-ary tree will not approach anything close to a binary tree (why is that a good thing?) B-Trees
Each node represents a disk block (the amount of data read in one disk access). –example: if a disk block is 8k each node holds M-1 keys and M branches (links) –let's say our keys are of size 32 bytes each –and a link is 4 bytes How would we determine M? –the largest value such that a node doesn't hold more than 8k Determine L by the number of records we can store in one block. –if our records are 256 bytes, how many could we store in one block of 8k? B-Trees
Each node represents a disk block (the amount of data read in one disk access). –example: if a disk block is 8k (=8192 bytes) each node holds M-1 keys and M branches (links) –let's say our keys are of size 32 bytes each –and a link is 4 bytes How would we determine M? –the largest value such that a node doesn't hold more than 8k 32*(M-1) + 4*M = 8192 M = 228 (is the largest M such that we don't go over 8192) Determine L by the number of records we can store in one block. –if our records are 256 bytes, how many could we store in one block of 8k? 8192/256 = 32 From the rules of our B-Tree, each leaf then has to have between 16 and 32 records and each non-leaf node (except the root) has to have at least 114 children (up to 228 children). B-Trees
Example: –From the rules of our B-Tree, each leaf then has to have between 16 and 32 records and each non-leaf node (except the root) has to have at least 114 children (up to 228 children). –Picking the worst case B-tree for our example, that is the one with the least number of children per node to give us our highest possible B-Tree for 10,000,000 records, we'll have at most 625,000 leaves which means the height of our B-tree is 4. The worst height of an M-ary tree is approx. log M/2 n. Why? B-Trees
Each disk read will get a block which is a whole node. When we read an interior node's data from disk we get keys and links. When we read a leaf node's data from disk we get up to L data records. Insert examples –normal insert –may need to split a leaf into two option of putting a child up for adoption to a neighbor –may need to split parents may need to split the root (root will then have 2 children) Splits are infrequent (For every split there will be L/2 nonsplits) Heightening of the tree is even more infrequent ( the only way the tree gets higher when insertion leads to splitting the root.) –notice: for a tree with four levels, the root was only split 3 times during all those inserts. And for M & L as we set in the example it occurred 3 times in 10,000,000 inserts. Splits and heightening lead to more processing when they happen but they don't happen often. B-Trees
Let's see some insertions into an existing B-Tree –a 5-ary B-tree, M=5, (with L=5 too) will be show on the board. First verify visually that it is indeed a 5-ary B-tree with L=5. Insert 57 and rearrange the leaf (1 disk access) Insert 55. The leaf is full (it has L=5 items already). With the 55 it would have L+1. L+1 / 2 is > L/2 so we can split it into 2 leaves. B-Trees
What would be the first step in deletion of a node? Do we have to check anything after the deletion of a node? B-Trees
Delete examples: –normal delete –may need to adopt if leaf goes below minimum ok if the neighbor is not already at minimum or if neighbor has minimum too, then join the two leaves together to get one full leaf –parent loses a child –if parent is now below min, then continue up if root ever loses a child and would cause only 1 remaining child, reduce the height of the tree by one and make that child the root B-Trees