CSE332: Data Abstractions Lecture 10: More B Trees; Hashing Dan Grossman Spring 2010.

Slides:



Advertisements
Similar presentations
Chapter 4: Trees Part II - AVL Tree
Advertisements

Advanced Database Discussion B Trees. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if.
CSE332: Data Abstractions Lecture 10: More B-Trees Tyler Robison Summer
AVL Deletion: Case #1: Left-left due to right deletion Spring 2010CSE332: Data Abstractions1 h a Z Y b X h+1 h h+2 h+3 b ZY a h+1 h h+2 X h h+1 Same single.
CSE332: Data Abstractions Lecture 7: AVL Trees Dan Grossman Spring 2010.
CSE332: Data Abstractions Lecture 9: B Trees Dan Grossman Spring 2010.
CSE332: Data Abstractions Lecture 9: BTrees Tyler Robison Summer
CSE332: Data Abstractions Lecture 7: AVL Trees Tyler Robison Summer
CSE332: Data Abstractions Lecture 8: AVL Delete; Memory Hierarchy Dan Grossman Spring 2010.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
CSC 2300 Data Structures & Algorithms February 27, 2007 Chapter 5. Hashing.
CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.
CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.
CS 206 Introduction to Computer Science II 12 / 03 / 2008 Instructor: Michael Eckmann.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
CSE 326: Data Structures B-Trees Ben Lerner Summer 2007.
CSE 326: Data Structures Lecture #13 Extendible Hashing and Splay Trees Alon Halevy Spring Quarter 2001.
Cpt S 223 – Advanced Data Structures Course Review Midterm Exam # 2
(B+-Trees, that is) Steve Wolfman 2014W1
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Storage CMSC 461 Michael Wilson. Database storage  At some point, database information must be stored in some format  It’d be impossible to store hundreds.
Oct 29, 2001CSE 373, Autumn External Storage For large data sets, the computer will have to access the disk. Disk access can take 200,000 times longer.
CSE 332 Data Abstractions: B Trees and Hash Tables Make a Complete Breakfast Kate Deibel Summer 2012 July 9, 2012CSE 332 Data Abstractions, Summer
B-Trees. CSM B-Trees 2 Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so.
Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.
Balanced Search Trees Fundamental Data Structures and Algorithms Margaret Reid-Miller 3 February 2005.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
CSE373: Data Structures & Algorithms Lecture 11: Hash Tables Dan Grossman Fall 2013.
CSE373: Data Structures & Algorithms Lecture 17: Hash Collisions Kevin Quinn Fall 2015.
CSE 326 Killer Bee-Trees David Kaplan Dept of Computer Science & Engineering Autumn 2001 Where was that root?
CS 206 Introduction to Computer Science II 04 / 22 / 2009 Instructor: Michael Eckmann.
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
CSE373: Data Structures & Algorithms Lecture 11: Implementing Union-Find Nicki Dell Spring 2014.
CSE373: Data Structures & Algorithms Lecture 10: Implementing Union-Find Dan Grossman Fall 2013.
Hashing Fundamental Data Structures and Algorithms Margaret Reid-Miller 18 January 2005.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
1 Multi-Level Indexing and B-Trees. 2 Statement of the Problem When indexes grow too large they have to be stored on secondary storage. However, there.
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
CSE373: Data Structures & Algorithms Lecture 7: AVL Trees Linda Shapiro Winter 2015.
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
Week 9 - Monday.  What did we talk about last time?  Practiced with red-black trees  AVL trees  Balanced add.
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables Kevin Quinn Fall 2015.
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
CSE373: Data Structures & Algorithms Lecture 13: Hash Tables
CSE332: Data Abstractions Lecture 7: AVL Trees
B/B+ Trees 4.7.
Multiway Search Trees Data may not fit into main memory
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables
CSE373: Data Structures & Algorithms Lecture 12: Hash Tables
CSE 332 Data Abstractions B-Trees
B+-Trees.
Instructor: Lilian de Greef Quarter: Summer 2017
B+-Trees.
CSE373: Data Structures & Algorithms Lecture 7: AVL Trees
CSE373: Data Structures & Algorithms Lecture 7: AVL Trees
CSE332: Data Abstractions Lecture 10: More B Trees; Hashing
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
CSE373: Data Structures & Algorithms Lecture 13: Hash Tables
CSE373: Data Structures & Algorithms Lecture 5: AVL Trees
CSE 332: Data Abstractions AVL Trees
CSE 373: Data Structures and Algorithms
CSE373: Data Structures & Algorithms Lecture 13: Hash Tables
CSE 373 Data Structures and Algorithms
CSE 373: Data Structures and Algorithms
Richard Anderson Spring 2016
CSE 326: Data Structures Lecture #10 B-Trees
Presentation transcript:

CSE332: Data Abstractions Lecture 10: More B Trees; Hashing Dan Grossman Spring 2010

B+ Tree Review M-ary tree with room for L data items at each leaf Order property: Subtree between keys x and y contains only data that is  x and < y (notice the  ) Balance property: All nodes and leaves at least half full, and all leaves at same height find and insert efficient –insert uses splitting to handle overflow, which may require splitting parent, and so on recursively Spring 20102CSE332: Data Abstractions  x12  x<217  x<123  x<7 x<3

Can do a little better with insert Eventually have to split up to the root (the tree will fill) But can sometimes avoid splitting via adoption –Change what leaf is correct by changing parent keys –This idea “in reverse” is necessary in deletion (next) Example: Spring 20103CSE332: Data Abstractions Insert(31) Splitting

Adoption for insert Eventually have to split up to the root (the tree will fill) But can sometimes avoid splitting via adoption –Change what leaf is correct by changing parent keys –This idea “in reverse” is necessary in deletion (next) Example: Spring 20104CSE332: Data Abstractions Insert(31) 32 Adoption 1832

Delete(32) And Now for Deletion… M = 3 L = Spring 20105CSE332: Data Abstractions

Delete(15) M = 3 L = 3 What’s wrong? Adopt from a neighbor! Spring 20106CSE332: Data Abstractions

M = 3 L = Spring 20107CSE332: Data Abstractions

Delete(16) M = 3 L = Uh-oh, neighbors at their minimum! Spring 20108CSE332: Data Abstractions

M = 3 L = Move in together and remove leaf – now parent might underflow; it has neighbors Spring 20109CSE332: Data Abstractions

M = 3 L = Spring CSE332: Data Abstractions

Delete(14) M = 3 L = 3 Spring CSE332: Data Abstractions

Delete(18) M = 3 L = Spring CSE332: Data Abstractions

M = 3 L = Spring CSE332: Data Abstractions

M = 3 L = 3 Spring CSE332: Data Abstractions

M = 3 L = Spring CSE332: Data Abstractions

Deletion Algorithm 1.Remove the data from its leaf 2.If the leaf now has  L/2  - 1, underflow! –If a neighbor has >  L/2  items, adopt and update parent –Else merge node with neighbor Guaranteed to have a legal number of items Parent now has one less node 3.If step (2) caused the parent to have  M/2  - 1 children, underflow! –… Spring CSE332: Data Abstractions

Deletion algorithm continued 3.If an internal node has  M/2  - 1 children –If a neighbor has >  M/2  items, adopt and update parent –Else merge node with neighbor Guaranteed to have a legal number of items Parent now has one less node, may need to continue up the tree If we merge all the way up through the root, that’s fine unless the root went from 2 children to 1 –In that case, delete the root and make child the root –This is the only case that decreases tree height Spring CSE332: Data Abstractions

Efficiency of delete Find correct leaf: O( log 2 M log M n) Remove from leaf: O(L) Adopt from or merge with neighbor: O(L) Adopt or merge all the way up to root: O(M log M n) Total: O(L + M log M n) But it’s not that bad: –Merges are not that common –Remember disk accesses were the name of the game: O( log M n) Note: Worth comparing insertion and deletion algorithms Spring CSE332: Data Abstractions

B Trees in Java? For most of our data structures, we have encouraged writing high- level, reusable code, such as in Java with generics It is worth knowing enough about “how Java works” to understand why this is probably a bad idea for B trees –Assuming our goal is efficient number of disk accesses –Java has many advantages, but it wasn’t designed for this –If you just want a balanced tree with worst-case logarithmic operations, no problem If M=3, this is called a 2-3 tree If M=4, this is called a tree The key issue is extra levels of indirection… Spring CSE332: Data Abstractions

Naïve approaches Even if we assume data items have int keys, you cannot get the data representation you want for “really big data” Spring CSE332: Data Abstractions interface Keyed { int key(E); } class BTreeNode > { static final int M = 128; int[] keys = new int[M-1]; BTreeNode [] children = new BTreeNode[M]; int numChildren = 0; … } class BTreeLeaf { static final int L = 32; E[] data = (E[])new Object[L]; int numItems = 0; … }

What that looks like Spring CSE332: Data Abstractions BTreeNode (3 objects with “header words”) M M 70 BTreeLeaf (data objects not in contiguous memory) 20 … (larger array) L

The moral The whole idea behind B trees was to keep related data in contiguous memory All the red references on the previous slide are inappropriate –As minor point, beware the extra “header words” But that’s “the best you can do” in Java –Again, the advantage is generic, reusable code –But for your performance-critical web-index, not the way to implement your B-Tree for terabytes of data C# may have better support for “flattening objects into arrays” –C and C++ definitely do Levels of indirection matter! Spring CSE332: Data Abstractions

Possible “fixes” Don’t use generics –No help: all non-primitive types are reference types For internal nodes, use an array of pairs of keys and references –No, that’s even worse! Instead of an array, have M fields (key1, key2, key3, …) –Gets the flattening, but now the code for shifting and binary search can’t use loops (tons of code for large M) –Similar issue for leaf nodes Spring CSE332: Data Abstractions

Conclusion: Balanced Trees Balanced trees make good dictionaries because they guarantee logarithmic-time find, insert, and delete –Essential and beautiful computer science –But only if you can maintain balance within the time bound AVL trees maintain balance by tracking height and allowing all children to differ in height by at most 1 B trees maintain balance by keeping nodes at least half full and all leaves at same height Other great balanced trees (see text; worth knowing they exist) –Red-black trees: all leaves have depth within a factor of 2 –Splay trees: self-adjusting; amortized guarantee; no extra space for height information Spring CSE332: Data Abstractions

Hash Tables Aim for constant-time (i.e., O(1)) find, insert, and delete –“On average” under some reasonable assumptions A hash table is an array of some fixed size Basic idea: Spring CSE332: Data Abstractions 0 … TableSize –1 hash function: index = h(key) hash table key space (e.g., integers, strings)

Hash tables There are m possible keys (m typically large, even infinite) but we expect our table to have only n items where n is much less than m (often written n << m) Many dictionaries have this property –Compiler: All possible identifiers allowed by the language vs. those used in some file of one program –Database: All possible student names vs. students enrolled –AI: All possible chess-board configurations vs. those considered by the current player –… Spring CSE332: Data Abstractions

Hash functions An ideal hash function: Is fast to compute “Rarely” hashes two “used” keys to the same index –Often impossible in theory; easy in practice –Will handle collisions in next lecture Spring CSE332: Data Abstractions 0 … TableSize –1 hash function: index = h(key) hash table key space (e.g., integers, strings)

Who hashes what? Hash tables can be generic –To store elements of type E, we just need E to be: 1.Comparable: order any two E (like with all dictionaries) 2.Hashable: convert any E to an int When hash tables are a reusable library, the division of responsibility generally breaks down into two roles: Spring CSE332: Data Abstractions We will learn both roles, but most programmers “in the real world” spend more time as clients while understanding the library E inttable-index collision? collision resolution client hash table library

More on roles Spring CSE332: Data Abstractions Two roles must both contribute to minimizing collisions (heuristically) Client should aim for different ints for expected items –Avoid “wasting” any part of E or the 32 bits of the int Library should aim for putting “similar” int s in different indices –conversion to index is almost always “mod table-size” –using prime numbers for table-size is common E inttable-index collision? collision resolution client hash table library Some ambiguity in terminology on which parts are “hashing” “hashing”?

What to hash? In lecture we will consider the two most common things to hash: integers and strings –If you have objects with several fields, it is usually best to have most of the “identifying fields” contribute to the hash to avoid collisions –Example: class Person { String first; String middle; String last; int age; } –An inherent trade-off: hashing-time vs. collision-avoidance Bad idea(?): Only use first name Good idea(?): Only use middle initial Admittedly, what-to-hash is often an unprincipled guess  Spring CSE332: Data Abstractions

Hashing integers key space = integers Simple hash function: h(key) = key % TableSize –Client: f(x) = x –Library g(x) = x % TableSize –Fairly fast and natural Example: –TableSize = 10 –Insert 7, 18, 41, 34, 10 –(As usual, ignoring data “along for the ride”) Spring CSE332: Data Abstractions

Hashing integers Spring CSE332: Data Abstractions key space = integers Simple hash function: h(key) = key % TableSize –Client: f(x) = x –Library g(x) = x % TableSize –Fairly fast and natural Example: –TableSize = 10 –Insert 7, 18, 41, 34, 10 –(As usual, ignoring data “along for the ride”)

Collision-avoidance With “ x % TableSize ” the number of collisions depends on –the ints inserted (obviously) –TableSize Larger table-size tends to help, but not always –Example: 7, 18, 41, 34, 10 with TableSize = 10 and TableSize = 7 Technique: Pick table size to be prime. Why? –Real-life data tends to have a pattern, and “multiples of 61” are probably less likely than “multiples of 60” –Next time we’ll see that one collision-handling strategy does provably better with prime table size Spring CSE332: Data Abstractions

More on prime table size If TableSize is 60 and… –Lots of data items are multiples of 5, wasting 80% of table –Lots of data items are multiples of 10, wasting 90% of table –Lots of data items are multiples of 2, wasting 50% of table If TableSize is 61… –Collisions can still happen, but 5, 10, 15, 20, … will fill table –Collisions can still happen but 10, 20, 30, 40, … will fill table –Collisions can still happen but 2, 4, 6, 8, … will fill table In general, if x and y are “co-prime” (means gcd(x,y)==1 ), then (a * x) % y == (b * x) % y if and only if a % y == b % y –So good to have a TableSize that has not common factors with any “likely pattern” x Spring CSE332: Data Abstractions

Okay, back to the client If keys aren’t int s, the client must convert to an int –Trade-off: speed and distinct keys hashing to distinct int s Very important example: Strings –Key space K = s 0 s 1 s 2 …s m-1 (where s i are chars: s i  [0,52] or s i  [0,256] or s i  [0,2 16 ]) –Some choices: Which avoid collisions best? 1.h(K) = s 0 % TableSize 2.h(K) = % TableSize 3.h(K) = % TableSize Spring CSE332: Data Abstractions

Specializing hash functions How might you hash differently if all your strings were web addresses (URLs)? Spring CSE332: Data Abstractions

Combining hash functions A few rules of thumb / tricks: 1.Use all 32 bits (careful, that includes negative numbers) 2.Use different overlapping bits for different parts of the hash –This is why a factor of 37 i works better than 256 i –Example: “abcde” and “ebcda” 3.When smashing two hashes into one hash, use bitwise-xor –bitwise-and produces too many 0 bits –bitwise-or produces too many 1 bits 4.Rely on expertise of others; consult books and other resources 5.If keys are known ahead of time, choose a perfect hash Spring CSE332: Data Abstractions