CSE332: Data Abstractions Lecture 9: BTrees Tyler Robison Summer 2010 1.

Slides:



Advertisements
Similar presentations
Chapter 4: Trees Part II - AVL Tree
Advertisements

AVL Trees COL 106 Amit Kumar Shweta Agrawal Slide Courtesy : Douglas Wilhelm Harder, MMath, UWaterloo
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
CSE332: Data Abstractions Lecture 10: More B-Trees Tyler Robison Summer
CSE332: Data Abstractions Lecture 9: B Trees Dan Grossman Spring 2010.
CSE 332 Review Slides Tyler Robison Summer
CSE332: Data Abstractions Lecture 7: AVL Trees Tyler Robison Summer
Tirgul 5 AVL trees.
CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.
CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.
CS 206 Introduction to Computer Science II 12 / 03 / 2008 Instructor: Michael Eckmann.
CS 206 Introduction to Computer Science II 12 / 01 / 2008 Instructor: Michael Eckmann.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
CSE 326: Data Structures B-Trees Ben Lerner Summer 2007.
CS 206 Introduction to Computer Science II 11 / 24 / 2008 Instructor: Michael Eckmann.
CSE 326: Data Structures Lecture #13 Extendible Hashing and Splay Trees Alon Halevy Spring Quarter 2001.
Balanced Trees. Binary Search tree with a balance condition Why? For every node in the tree, the height of its left and right subtrees must differ by.
(B+-Trees, that is) Steve Wolfman 2014W1
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
CS4432: Database Systems II
Splay Trees and B-Trees
1 B-Trees Section AVL (Adelson-Velskii and Landis) Trees AVL tree is binary search tree with balance condition –To ensure depth of the tree is.
Advanced Data Structures and Algorithms COSC-600 Lecture presentation-6.
CPSC 335 BTrees Dr. Marina Gavrilova Computer Science University of Calgary Canada.
1 Multiway trees & B trees & 2_4 trees Go&Ta Chap 10.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
B-Tree. B-Trees a specialized multi-way tree designed especially for use on disk In a B-tree each node may contain a large number of keys. The number.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
ALGORITHMS FOR ISNE DR. KENNETH COSH WEEK 6.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Starting at Binary Trees
Comp 335 File Structures B - Trees. Introduction Simple indexes provided a way to directly access a record in an entry sequenced file thereby decreasing.
CPSC 221: Algorithms and Data Structures Lecture #7 Sweet, Sweet Tree Hives (B+-Trees, that is) Steve Wolfman 2010W2.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture17.
CSE 326 Killer Bee-Trees David Kaplan Dept of Computer Science & Engineering Autumn 2001 Where was that root?
CS 206 Introduction to Computer Science II 04 / 22 / 2009 Instructor: Michael Eckmann.
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
B+ Trees  What if you have A LOT of data that needs to be stored and accessed quickly  Won’t all fit in memory.  Means we have to access your hard.
 B-tree is a specialized multiway tree designed especially for use on disk  B-Tree consists of a root node, branch nodes and leaf nodes containing the.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Internal and External Sorting External Searching
CSE373: Data Structures & Algorithms Lecture 7: AVL Trees Linda Shapiro Winter 2015.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
CSE 326: Data Structures Lecture #9 Big, Bad B-Trees Steve Wolfman Winter Quarter 2000.
CSE332: Data Abstractions Lecture 7: AVL Trees
COMP261 Lecture 23 B Trees.
B/B+ Trees 4.7.
Multiway Search Trees Data may not fit into main memory
CSE 332 Data Abstractions B-Trees
B+-Trees.
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
B+-Trees.
B+-Trees.
CSE373: Data Structures & Algorithms Lecture 7: AVL Trees
B- Trees D. Frey with apologies to Tom Anastasio
CSE373: Data Structures & Algorithms Lecture 5: AVL Trees
CSE 332: Data Abstractions AVL Trees
CSE 373: Data Structures and Algorithms
CSE 373 Data Structures and Algorithms
CSE 373: Data Structures and Algorithms
Richard Anderson Spring 2016
B-Trees.
CSE 326: Data Structures Lecture #10 B-Trees
Presentation transcript:

CSE332: Data Abstractions Lecture 9: BTrees Tyler Robison Summer

BTrees 2 B B B B B B B B B XX

Our goal 3  Problem: What if our dictionary has so much data, most of it resides on disk; very slow to access  Say we had to do a disk access for each node access  Unbalanced BST with n nodes: n disk accesses (worst case)  AVL tree: log 2 n accesses (worst case)  log 2 (2 30 )=30 disk accesses still a bit slow  An improvement, but we can do better  Idea: A balanced tree (logarithmic height) that is even shallower than AVL trees so that we can minimize disk accesses and exploit disk-block size  Increase the branching factor to decrease the height  Gives us height log 3 n, log 10 n, log 50 n etc., based on branching factor  Asymptotically still O(logn) height though

M -ary Search Tree 4  # hops for find : If balanced, using log M n instead of log 2 n  If M=256, that’s an 8x improvement  Example: M = 256 and n = 2 40 that’s 5 instead of 40  To decide which branch to take, divide into portions  Binary tree: Less than node value or greater?  M-ary: In range 1? In range 2? In range 3?... In range M?  Runtime of find if balanced: O( log 2 M log M n)  Hmmm… log M n is the height we traverse. Why the log 2 M multiplier?  log 2 M: At each step, find the correct child branch to take using binary search Build some kind of search tree with branching factor M: – Say, each node has an array of M sorted children ( Node[] ) – Choose M to fit node snugly into a disk block (1 access per node)

Problems with how to proceed 5  What should the order property be? How do we decide the ‘portion’ each child will take?  How would you rebalance (ideally without more disk accesses)?  Storing real data at inner-nodes (like in a BST) seems kind of wasteful…  To access the node, will have to load data from disk, even though most of the time we won’t use it

B+ Trees (we and the book say “B Trees”) 6  Two types of nodes: internal nodes & leaves  Each internal node has room for up to M- 1 keys and M children  In example on right, M=7  No other data; all data at the leaves!  Think of M-1 keys stored in internal nodes as ‘signposts’ used to find a path to a leaf  Order property: Subtree between keys x and y contains only data that is  x and < y (notice the  )  Leaf nodes (not shown here) have up to L sorted data items  Remember:  Leaves store data  Internal nodes are ‘signposts’  x12  x<217  x<123  x<7 x<3 Empty cells at the end are currently unused, but may get filled in later What’s the ‘B’ for? Wikipedia quote from 1979 textbook: The origin of "B-tree" has never been explained by the authors. As we shall see, "balanced," "broad," or "bushy" might apply. Others suggest that the "B" stands for Boeing. Because of his contributions, however, it seems appropriate to think of B- trees as "Bayer"-trees.

Find 7  Different from BST in that we don’t store values at internal nodes  But find is still an easy recursive algorithm, starting at root  At each internal-node do binary search on the (up to) M-1 keys to find the branch to take  At the leaf do binary search on the (up to) L data items  But to get logarithmic running time, we need a balance condition…  x12  x<217  x<123  x<7 x<3

Structure Properties 8  Non-root Internal nodes  Have between  M/2  and M children (inclusive), i.e., at least half full  Leaf nodes  All leaves at the same depth  Have between  L/2  and L data items (inclusive), i.e., at least half full  Root (special case)  If tree has  L items, root is a leaf (occurs when starting up; otherwise unusual); can have any number of items  L  Else has between 2 and M children (Any M > 2 and L will work; picked based on disk-block size)  Uh, why these bounds for internal nodes & children?  Upper bounds make sense: don’t want one long list; pick to fit in block  Why lower bounds?  Ensures tree is sufficiently ‘filled in’: Don’t have unnecessary height, for instance  Guarantees ‘broadness’  We’ll see more when we cover insert & delete Remember: M=max # children L=max items in leaf

Example 9 Suppose M=4 (max children) and L=5 (max items at leaf)  All internal nodes have at least 2 children  All leaves have at least 3 data items (only showing keys)  All leaves at same depth Note: Only shows 1 key, but has 2 children, so it’s okay Note on notation: Inner nodes drawn horizontally, leaves vertically to distinguish. Include empty cells

Balanced enough 10 Not too difficult to show height h is logarithmic in number of data items n  Let M > 2 (if M = 2, then a list tree is legal – no good!)  Because all nodes are at least half full (except root may have only 2 children) and all leaves are at the same level, the minimum number of data items n for a height h>0 tree is… n  2  M/2  h-1  L/2  minimum number of leaves minimum data per leaf Exponential in height because  M/2  > 1

Disk Friendliness 11 What makes B trees so disk friendly?  Many keys stored in one node  All brought into memory in one disk access  IF we pick M wisely  Makes the binary search over M-1 keys totally worth it (insignificant compared to disk access times)  Internal nodes contain only keys  Any find wants only one data item; wasteful to load unnecessary items with internal nodes  So only bring one leaf of data items into memory  Data-item size doesn’t affect what M is

Maintaining balance 12  So this seems like a great data structure (and it is)  But we haven’t implemented the other dictionary operations yet  insert  delete  As with AVL trees, the hard part is maintaining structure properties  Example: for insert, there might not be room at the correct leaf  Unlike AVL trees, there are no rotations

Building a B-Tree (insertions) 13 The empty B-Tree (the root will be a leaf at the beginning) M = 3 L = 3 Remember: Horizontal=internal node; Vertical=leaf Insert(3) 3 Insert(18) 3 18 Insert(14) Just need to keep data in order

Insert(30) ??? M = 3 L = 3 When we ‘overflow’ a leaf, we split it into 2 leaves Parent gains another child If there is no parent (like here), we create one; how do we pick the key shown in it? Smallest element in right tree

Insert(32) Insert(36) Insert(15) M = 3 L = Split leaf again

Insert(16) M = 3 L = What now? Split the internal node (in this case, the root)

Insert(12,40,45,38) M = 3 L = 3 17 Note: Given the leaves and the structure of the tree, we can always fill in internal node keys; ‘the smallest value in my right branch’

Insertion Algorithm Traverse from the root to the proper leaf. Insert the data in its leaf in sorted order 2. If the leaf now has L+1 items, overflow!  Split the leaf into two leaves:  Original leaf with  (L+1)/2  items  New leaf with  (L+1)/2  items  Attach the new child to the parent  Adding new key to parent in sorted order

Insertion algorithm continued If an internal node has M+1 children, overflow!  Split the node into two nodes  Original node with  (M+1)/2  children  New node with  (M+1)/2  children  Attach the new child to the parent  Adding new key to parent in sorted order Splitting at a node (step 3) could make the parent overflow too  So repeat step 3 up the tree until a node doesn’t overflow  If the root overflows, make a new root with two children  This is the only case that increases the tree height

Efficiency of insert 20  Find correct leaf:  Insert in leaf:  Split leaf:  Split parents all the way up to root: Worst-case for insert: O(L + M log M n) But it’s not that bad:  Splits are not that common (have to fill up nodes)  Splitting the root is extremely rare  Remember disk accesses were the name of the game: O( log M n) O( log 2 M log M n) O(L) O(M log M n)

Another option (we won’t use) 21  Adoption  When leaf gains L+1 items, instead of splitting, try to put one ‘up for adoption’  Check neighboring leaves; if they have space, they can take it  If not, will have to split anyway  Doesn’t change worst-case asymptotic run-time  Can also use for internal node; pass off child pointers

Adoption example Insert(31) 32 Adoption  Adoption  If overflow, then try to pass on to neighbor  If no neighbor has space, split

Same example, with splitting Insert(31) Splitting  For this class, we’ll stick with splitting  No adoption  But good to be aware of alternatives