CSE 326: Data Structures Lecture #14

CSE 326: Data Structures Lecture #14
More, Bigger Hash Please Bart Niswonger Summer Quarter 2001

Today’s Outline Hashing Project Rules of competition Comments?
Probing continued Rehashing Extendible Hashing Case Study Hashing A whole lotta hashing going on! We’ll get to the hashing – but first the project stuff – new assignment – communication Rules for maze runners? Comments?

Cost of a Database Query
This is a trace of how much time a simple database operation takes: this one lists all employees along with their job information, getting the employee and job information from separate databases. The thing to notice is that disk access takes something like 100 times as much time as processing. I told you disk access was expensive! BTW: the “index” in the picture is a B-Tree. - SQLServer I/O to CPU ratio is 300!

Extendible Hashing Hashing technique for huge data sets Table contains
optimizes to reduce disk accesses each hash bucket fits on one disk block better than B-Trees if order is not important Table contains buckets, each fitting in one disk block, with the data a directory that fits in one disk block used to hash to the correct bucket OK, extendible hashing takes hashing and makes it work for huge data sets. The idea is very similar to B-Trees.

Extendible Hash Table Directory - entries labeled by k bits & pointer to bucket with all keys starting with its bits Each block contains keys & data matching on the first j  k bits directory for k = 3 000 001 010 011 100 101 110 111 Notice that those are the hashed keys I’m representing in the table. Actually, the buckets would store the original key/value pair. (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11011 11100 11110

Inserting (easy case) insert(11011) 000 001 010 011 100 101 110 111
(2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11100 11110 Inserting if there’s room in the bucket is easy. Just like in B-Trees. This takes 2 disk accesses. One to grab the directory, and one for the bucket. But what if there isn’t room? 000 001 010 011 100 101 110 111 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11011 11100 11110

Splitting insert(11000) 000 001 010 011 100 101 110 111 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (2) 11001 11011 11100 11110 Then, as in B-Trees, we need to split a node. In this case, that turns out to be easy because we have room in the directory for another node. How many disk accesses does this take? 3: one for the directory, one for the old bucket, and one for the new bucket. 000 001 010 011 100 101 110 111 (2) 00001 00011 00100 00110 (2) 01001 01011 01100 (3) 10001 10011 (3) 10101 10110 10111 (3) 11000 11001 11011 (3) 11100 11110

Rehashing insert(10010) No room to insert and no adoption!
11 (2) 01101 (2) 10000 10001 10011 10111 (2) 11001 11110 Now, even though there’s tons of space in the table, we have to expand the directory. Once we’ve done that, we’re back to the case on the previous page. How many disk accesses does this take? 3! Just as for a normal split. The only problem is if the directory ends up too large to fit in a block. So, what does this say about the hash function we want for extendible hashing? We want something parameterized so we can rehash if necessary to get the buckets evened out. Expand directory 000 001 010 011 100 101 110 111 Now, it’s just a normal split.

When Directory Gets Too Large
Store only pointers to the items (potentially) much smaller M fewer items in the directory one extra disk access! Rehash potentially better distribution over the buckets fewer unnecessary items in the directory can’t solve the problem if there’s simply too much data What if these don’t work? use a B-Tree to store the directory! If the directory gets too large, we’re in trouble. In that case, we have two ways to fix it. One puts more keys in each bucket. The other evens out the buckets. Neither is guaranteed to work.

Rehash of Hashing Hashing is a great data structure for storing unordered data that supports insert, delete & find Both separate chaining (open) and open addressing (closed) hashing are useful separate chaining flexible closed hashing uses less storage, but performs badly with load factors near 1 extendible hashing for very large disk-based data Hashing pros and cons + very fast + simple to implement, supports insert, delete, find - lazy deletion necessary in open addressing, can waste storage - does not support operations dependent on order: min, max, range

Case Study Spelling dictionary Goals Practical notes 30,000 words
static arbitrary(ish) preprocessing time Goals fast spell checking minimal storage Practical notes almost all searches are successful words average about 8 characters in length 30,000 words at 8 bytes/word is 1/4 MB pointers are 4 bytes there are many regularities in the structure of English words Why? Alright, let’s move on to the case study. Here’s the situation. Why are most searches successful? Because most words will be spelled correctly.

Solutions Solutions sorted array + binary search open hashing
closed hashing + linear probing Remember that we get arbitrary preprocessing time. What can we do with that? We should use a universal hash function (or at least a parameterized hash function) because that will let use rehash in preprocess until we’re happy with the distribution. What kind of hash function should we use?

Storage … Assume words are strings & entries are pointers to strings
Array + binary search … Closed hashing (open addressing) Open hashing Notice that all of these use the same amount of memory for the strings. How many pointers does each one use? N for array. Just store a pointer to each string. n (pointer to each string) + n/lambda (null pointer ending each list). Note, I’m playing a little game here; without an optimization, this would actually be 2n + n/lambda. N/lambda. How many pointers does each use?

Analysis Binary search Open hashing Closed hashing
storage: n pointers + words = 360KB time: log2n  15 probes/access, worst case Open hashing storage: n + n/ pointers + words ( = 1  600KB) time: 1 + /2 probes/access, average ( = 1  1.5) Closed hashing storage: n/ pointers + words ( = 0.5  480KB) time: probes/access, average ( = 0.5  1.5) The answer, of course, is it depends! However, given that we want fast checking, either hash table is better than the BST. And, given that we want low memory usage, we should use the closed hash table. Can anyone argue for the binary search? Which one should we use?

To Do Start Project III (only 1 week!)
Start reading chapter 8 in the book Even if you don’t yet have a team you should absolutely start thinking about project III. The code should not be hard to write, I think, but the algorithm may take you a bit

Coming Up Disjoint-set union-find ADT Quiz (Thursday)
Next time, We’ll look at extendible hashing for huge data sets.

CSE 326: Data Structures Lecture #14

Similar presentations

Presentation on theme: "CSE 326: Data Structures Lecture #14"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 326: Data Structures Lecture #14

Similar presentations

Presentation on theme: "CSE 326: Data Structures Lecture #14"— Presentation transcript:

Similar presentations

About project

Feedback