CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Slides:



Advertisements
Similar presentations
The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:
Advertisements

Advanced Database Discussion B Trees. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
CSE332: Data Abstractions Lecture 10: More B-Trees Tyler Robison Summer
AVL Deletion: Case #1: Left-left due to right deletion Spring 2010CSE332: Data Abstractions1 h a Z Y b X h+1 h h+2 h+3 b ZY a h+1 h h+2 X h h+1 Same single.
CSE332: Data Abstractions Lecture 9: B Trees Dan Grossman Spring 2010.
CSE 332 Review Slides Tyler Robison Summer
CSE332: Data Abstractions Lecture 9: BTrees Tyler Robison Summer
B-Trees. Motivation for B-Trees Index structures for large datasets cannot be stored in main memory Storing it on disk requires different approach to.
1 B trees Nodes have more than 2 children Each internal node has between k and 2k children and between k-1 and 2k-1 keys A leaf has between k-1 and 2k-1.
CSC 2300 Data Structures & Algorithms February 27, 2007 Chapter 5. Hashing.
1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 19: B-Trees: Data Structures for Disk.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Hash Tables1 Part E Hash Tables  
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Hash Tables1 Part E Hash Tables  
CS 206 Introduction to Computer Science II 12 / 01 / 2008 Instructor: Michael Eckmann.
Hash Tables1 Part E Hash Tables  
CSE 326: Data Structures B-Trees Ben Lerner Summer 2007.
CSE 326: Data Structures Lecture #13 Extendible Hashing and Splay Trees Alon Halevy Spring Quarter 2001.
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
B-Trees and B+-Trees Disk Storage What is a multiway tree?
Balanced Trees. Binary Search tree with a balance condition Why? For every node in the tree, the height of its left and right subtrees must differ by.
Cpt S 223 – Advanced Data Structures Course Review Midterm Exam # 2
(B+-Trees, that is) Steve Wolfman 2014W1
CS 206 Introduction to Computer Science II 04 / 06 / 2009 Instructor: Michael Eckmann.
Hashing. Hashing as a Data Structure Performs operations in O(c) –Insert –Delete –Find Is not suitable for –FindMin –FindMax –Sort or output as sorted.
CS4432: Database Systems II
Splay Trees and B-Trees
B+ Trees COMP
1 Hash Tables  a hash table is an array of size Tsize  has index positions 0.. Tsize-1  two types of hash tables  open hash table  array element type.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
CSE373: Data Structures & Algorithms Lecture 15: B-Trees Linda Shapiro Winter 2015.
Oct 29, 2001CSE 373, Autumn External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer.
1 Road Map Associative Container Impl. Unordered ACs Hashing Collision Resolution Collision Resolution Open Addressing Open Addressing Separate Chaining.
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
1 B-Trees & (a,b)-Trees CS 6310: Advanced Data Structures Western Michigan University Presented by: Lawrence Kalisz.
CSE 221: Algorithms and Data Structures Lecture #8 The Constant Struggle for Hash Steve Wolfman 2014W1 1.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
CSE 326: Data Structures Lecture #13 Bart Niswonger Summer Quarter 2001.
B-Trees. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it.
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
CPSC 221: Algorithms and Data Structures Lecture #7 Sweet, Sweet Tree Hives (B+-Trees, that is) Steve Wolfman 2010W2.
CSE 326: Data Structures Lecture #12
CSE 326 Killer Bee-Trees David Kaplan Dept of Computer Science & Engineering Autumn 2001 Where was that root?
Hash Tables - Motivation
CS 206 Introduction to Computer Science II 04 / 22 / 2009 Instructor: Michael Eckmann.
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
B+ Trees  What if you have A LOT of data that needs to be stored and accessed quickly  Won’t all fit in memory.  Means we have to access your hard.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
CSE 326: Data Structures Lecture #14 Whoa… Good Hash, Man Steve Wolfman Winter Quarter 2000.
Lecture 10COMPSCI.220.FS.T Binary Search Tree BST converts a static binary search into a dynamic binary search allowing to efficiently insert and.
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
B+-Tree Deletion Underflow conditions B+ tree Deletion Algorithm
CSE 326: Data Structures Lecture #9 Big, Bad B-Trees Steve Wolfman Winter Quarter 2000.
Multiway Search Trees Data may not fit into main memory
Btrees Deletion.
CSE 332 Data Abstractions B-Trees
Btrees Insertion.
Advanced Associative Structures
CSE 326: Data Structures Lecture #9 AVL II
CSE 373: Data Structures and Algorithms
CSE 373 Data Structures and Algorithms
CSE 373: Data Structures and Algorithms
B-Trees.
CSE 326: Data Structures Lecture #10 B-Trees
Presentation transcript:

CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001

B-Tree Properties Properties –maximum branching factor of M –the root has between 2 and M children –other internal nodes have between  M/2  and M children –internal nodes contain only search keys (no data) –smallest datum between search keys x and y equals x –each (non-root) leaf contains between  L/2  and L keys –all leaves are at the same depth Result –tree is (log M/2 n/(L/2)) +/- 1 deep  (log n) –all operations run in time proportional to depth –operations pull in at least M/2 or L/2 items at a time

When Big-O is Not Enough B-Tree is about log M/2 n/(L/2) deep = log M/2 n - log M/2 L/2 = O(log M/2 n) = O(log n) steps per operation (same as BST!) Where’s the beef?! log 2 ( 10,000,000 ) = 24 disk accesses log 200/2 ( 10,000,000 ) < 4 disk accesses

… __ k1k1 k2k2 … kiki B-Tree Nodes Internal node –i search keys; i+1 subtrees; M - i - 1 inactive entries Leaf –j data keys; L - j inactive entries k1k1 k2k2 … kjkj … __ 12M L i j

Example B-Tree with M = 4 and L =

Making a B-Tree The empty B-Tree M = 3 L = 2 3 Insert(3) 314 Insert(14) Now, Insert(1)?

Splitting the Root And create a new root Insert(1) Too many keys in a leaf! So, split the leaf.

Insertions and Split Ends Insert(59) Insert(26) And add a new child Too many keys in a leaf! So, split the leaf.

Propagating Splits Insert(5) Add new child Create a new root Too many keys in an internal node! So, split the node.

Insertion in Boring Text Insert the key in its leaf If the leaf ends up with L+1 items, overflow! –Split the leaf into two nodes: original with  (L+1)/2  items new one with  (L+1)/2  items –Add the new child to the parent –If the parent ends up with M+1 items, overflow! If an internal node ends up with M+1 items, overflow! –Split the node into two nodes: original with  (M+1)/2  items new one with  (M+1)/2  items –Add the new child to the parent –If the parent ends up with M+1 items, overflow! Split an overflowed root in two and hang the new nodes under a new root This makes the tree deeper!

Deletion in B-trees Come to section tomorrow. Slides follow.

After More Routine Inserts Insert(89) Insert(79)

Deletion Delete(59)

Deletion and Adoption Delete(5) ? A leaf has too few keys! So, borrow from a neighbor

Deletion with Propagation Delete(3) ? A leaf has too few keys! And no neighbor with surplus! So, delete the leaf But now a node has too few subtrees!

Adopt a neighbor Finishing the Propagation (More Adoption)

Delete(1) (adopt a neighbor) A Bit More Adoption

Delete(26) Pulling out the Root A leaf has too few keys! And no neighbor with surplus! So, delete the leaf A node has too few subtrees and no neighbor with surplus! Delete the leaf But now the root has just one subtree!

Pulling out the Root (continued) The root has just one subtree! But that’s silly! Just make the one child the new root!

Deletion in Two Boring Slides of Text Remove the key from its leaf If the leaf ends up with fewer than  L/2  items, underflow! –Adopt data from a neighbor; update the parent –If borrowing won’t work, delete node and divide keys between neighbors –If the parent ends up with fewer than  M/2  items, underflow! Why will dumping keys always work if borrowing doesn’t?

Deletion Slide Two If a node ends up with fewer than  M/2  items, underflow! –Adopt subtrees from a neighbor; update the parent –If borrowing won’t work, delete node and divide subtrees between neighbors –If the parent ends up with fewer than  M/2  items, underflow! If the root ends up with only one child, make the child the new root of the tree This reduces the height of the tree!

Thinking about B-Trees B-Tree insertion can cause (expensive) splitting and propagation B-Tree deletion can cause (cheap) borrowing or (expensive) deletion and propagation Propagation is rare if M and L are large (Why?) Repeated insertions and deletion can cause thrashing If M = L = 128, then a B-Tree of height 4 will store at least 30,000,000 items –height 5: 2,000,000,000!

Tree Summary BST: fast finds, inserts, and deletes O(log n) on average (if data is random!) AVL trees: guaranteed O(log n) operations B-Trees: also guaranteed O(log n), but shallower depth makes them better for disk-based databases What would be even better? –How about: O(1) finds and inserts?

Hash Table Approach But… is there a problem in this pipe-dream? f(x) Zasha Steve Nic Brad Ed

Hash Table Dictionary Data Structure Hash function: maps keys to integers –result: can quickly find the right spot for a given entry Unordered and sparse table –result: cannot efficiently list all entries, –Cannot find min and max efficiently, –Cannot find all items within a specified range efficiently. f(x) Zasha Steve Nic Brad Ed

Hash Table Terminology f(x) Zasha Steve Nic Brad Ed hash function collision keys load factor = # of entries in table tableSize

Hash Table Code First Pass Value & find(Key & key) { int index = hash(key) % tableSize; return Table[index]; } What should the hash function be? What should the table size be? How should we resolve collisions?

A Good Hash Function… …is easy (fast) to compute (O(1) and practically fast). …distributes the data evenly (hash(a)  hash(b) ). …uses the whole hash table (for all 0  k < size, there’s an i such that hash(i) % size = k).

Good Hash Function for Integers Choose –tableSize is prime –hash(n) = n % tableSize Example: –tableSize = 7 insert(4) insert(17) find(12) insert(9) delete(17)

Good Hash Function for Strings? Ideas?

Good Hash Function for Strings? Sum the ASCII values of the characters. Consider only the first 3 characters. –Uses only 2871 out of 17,576 entries in the table on English words. Let s = s 1 s 2 s 3 s 4 …s 5 : choose –hash(s) = s 1 + s s s … + s n 128 n Problems: –hash(“really, really big”) = well… something really, really big –hash(“one thing”) % 128 = hash(“other thing”) % 128 Think of the string as a base 128 number.

Making the String Hash Easy to Compute Use Horner’s Rule int hash(String s) { h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (s i + 128*h) % tableSize; } return h; }

Universal Hashing For any fixed hash function, there will be some pathological sets of inputs –everything hashes to the same cell! Solution: Universal Hashing –Start with a large (parameterized) class of hash functions No sequence of inputs is bad for all of them! –When your program starts up, pick one of the hash functions to use at random (for the entire time) –Now: no bad inputs, only unlucky choices! If universal class large, odds of making a bad choice very low If you do find you are in trouble, just pick a different hash function and re-hash the previous inputs

Universal Hash Function: “Random” Vector Approach Parameterized by prime size and vector: a = where 0 <= a i < size Represent each key as r + 1 integers where k i < size –size = 11, key = ==> –size = 29, key = “hello world” ==> h a (k) = dot product with a “random” vector!

Universal Hash Function Strengths: –works on any type as long as you can form k i ’s –if we’re building a static table, we can try many a’s –a random a has guaranteed good properties no matter what we’re hashing Weaknesses –must choose prime table size larger than any k i

Hash Function Summary Goals of a hash function –reproducible mapping from key to table entry –evenly distribute keys across the table –separate commonly occurring keys (neighboring keys?) –complete quickly Hash functions –h(n) = n % size –h(n) = string as base 128 number % size –Universal hash function #1: dot product with random vector

How to Design a Hash Function Know what your keys are Study how your keys are distributed Try to include all important information in a key in the construction of its hash Try to make “neighboring” keys hash to very different places Prune the features used to create the hash until it runs “fast enough” (very application dependent)

Collisions Pigeonhole principle says we can’t avoid all collisions –try to hash without collision m keys into n slots with m > n –try to put 6 pigeons into 5 holes What do we do when two keys hash to the same entry? –open hashing: put little dictionaries in each entry –closed hashing: pick a next entry to try