CSE 326: Data Structures Lecture #15 When Keys Collide

CSE 326: Data Structures Lecture #15 When Keys Collide
Steve Wolfman Winter Quarter 2000 Today we’ll finish off that hash function stuff from Monday and move on to the exciting stuff! What happens… WHEN KEYS COLLIDE

Today’s Outline Hash functions Collisions and collision resolution
Rehashing After that, we’ll discuss rehashing.

Alternate Universal Hash Function
Parameterized by k, a, and b: k * size should fit into an int a and b must be less than size hk,a,b(x) = Here is an alternative universal hash function to the one we talked about on Monday. Note that, like the other one, this is parameterized. The key to these universal functions is that we can fiddle with the parameters to get a new (good) hash function. That means that in some sense there are no bad input cases for these (they’re not always bad). This one is based on pseudo-random number generation. Essentially, it does what your computer does when calculating the next “random” number in a sequence.

Alternate Universal Hash Function: Example
Context: hash integers in a table of size 16 let k = 32, a = 100, b = 200 hk,a,b(1000) = ((100* ) % (32*16)) / 32 = ( % 512) / 32 = 360 / 32 = 11 Here’s an example of how it works.

Thinking about the Alternate Universal Hash Function
Strengths: if we’re building a static table, we can try many parameter values random a,b has guaranteed good properties no matter what we’re hashing can choose any size table very efficient if k and size are powers of 2 Weaknesses still need to turn non-integer keys into integers We’ll see later that we can actually do even better than this with universal hashing. We can “adapt” to input as we go!

Hash Function Summary Goals of a hash function Hash functions
reproducible mapping from key to table entry evenly distribute keys across the table separate commonly occurring keys (neighboring keys?) complete quickly Hash functions h(n) = n % size h(n) = string as base 128 number % size Multiplication hash: compute percentage through the table Universal hash function #1: dot product with random vector Universal hash function #2: next pseudo-random number The idea of neighboring keys here may change from application to application. In one context, neighboring keys may be those with the same last characters or first characters… say, when hashing names in a school system. Many people may have the same last names or first names (but few will have the same of both).

How to Design a Hash Function
Know what your keys are Study how your keys are distributed Try to include all important information in a key in the construction of its hash Try to make “neighboring” keys hash to very different places Prune the features used to create the hash until it runs “fast enough” (very application dependent)

Collisions Pigeonhole principle says we can’t avoid all collisions
try to hash without collision m keys into n slots with m > n try to put 6 pigeons into 5 holes What do we do when two keys hash to the same entry? open hashing: put little dictionaries in each entry closed hashing: pick a next entry to try shove extra pigeons in one hole! The pigeonhole principle is a vitally important mathematical principle that asks what happens when you try to shove k+1 pigeons into k pigeon sized holes. Don’t snicker. But, the fact is that no hash function can perfectly hash m keys into fewer than m slots. They won’t fit. What do we do? 1) Shove the pigeons in anyway. 2) Try somewhere else when we’re shoving two pigeons in the same place. Does closed hashing solve the original problem?

Open Hashing or Hashing with Chaining
h(a) = h(d) h(e) = h(b) Put a little dictionary at each entry choose type as appropriate common case is unordered linked list (chain) Properties  can be greater than 1 performance degrades with length of chains 1 a d 2 3 e b 4 What decides what is appropriate? Memory requirements Speed requirements Expected size of dictionaries How easy is comparison (< vs. ==) Why unordered ll then? Small mem. requirement; near zero if empty dictionary! Fast enough if small Only need == comparison Where should I put a new entry? (Think splay trees) What _might_ I do on a successful search? 5 c 6

Open Hashing Code Dictionary & findBucket(const Key & k) {
return table[hash(k)%table.size]; } void insert(const Key & k, const Value & v) { findBucket(k).insert(k,v); void delete(const Key & k) { findBucket(k).delete(k); } Value & find(const Key & k) return findBucket(k).find(k); Here’s the code for an open hashing table. You can see that as long as we already have a dictionary implementation, this is pretty simple.

Load Factor in Open Hashing
Search cost unsuccessful search: successful search: Desired load factor: Let’s analyze it. How long does an unsuccessful search take? Well, we have to traverse the whole list wherever we hash to. How long is the whole list? On average, . How about successful search? On average we traverse only half of the list before we find the one we’re looking for (but we still have to check that one): /2 + 1 So, what load factor might we want? Obviously, ZERO! But we can shoot for between 1/2 and 1.

Closed Hashing or Open Addressing
What if we only allow one Key at each entry? two objects that hash to the same spot can’t both go there first one there gets the spot next one must go in another spot Properties   1 performance degrades with difficulty of finding right spot h(a) = h(d) h(e) = h(b) 1 a 2 d 3 e Alright, what if we stay inside the table? Then, we have to try somewhere else when two things collide. That means we need a strategy for finding the next spot. Moreover, that strategy needs to be deterministic why? It also means that we cannot have a load factor larger than 1. 4 b 5 c 6

Probing Probing how to: Probing properties X
-FILES Probing Probing how to: First probe - given a key k, hash to h(k) Second probe - if h(k) is occupied, try h(k) + f(1) Third probe - if h(k) + f(1) is occupied, try h(k) + f(2) And so forth Probing properties we force f(0) = 0 the ith probe is to (h(k) + f(i)) mod size if i reaches size - 1, the probe has failed depending on f(), the probe may fail sooner long sequences of probes are costly! Probing is our technique for finding a good spot. Note: no X-Files probing: strictly above the belt (thanks to Corey :) One probe goes to the normal hashed spot; the rest are our search for the true spot, and, if the load factor is less than 1, we hope that the true spot is out there. Here is what we need out of our probing function.

Linear Probing f(i) = i Probe sequence is
h(k) mod size h(k) + 1 mod size h(k) + 2 mod size … findEntry using linear probing: One simple strategy is to just scan through the table. The code at the bottom does this, stopping when we find an empty spot or the key we’re looking for. bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash1(k); do { entry = &table[probePoint]; probePoint = (probePoint + 1) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty(); }

Linear Probing Example
insert(76) 76%7 = 6 insert(93) 93%7 = 2 insert(40) 40%7 = 5 insert(47) 47%7 = 5 insert(10) 10%7 = 3 insert(55) 55%7 = 6 47 47 47 1 1 1 1 1 1 55 2 2 93 2 93 2 93 2 93 2 93 3 3 3 3 3 10 3 10 4 You can see that this works pretty well for an empty table and gets worse as the table fills up. There’s another problem here. If a bunch of elements hash to the same spot, they mess each other up. But, worse, if a bunch of elements hash to the same area of the table, they mess each other up! (Even though the hash function isn’t producing lots of collisions!) This phenomenon is called primary clustering. 4 4 4 4 4 5 5 5 5 5 5 40 40 40 40 6 6 6 6 6 6 76 76 76 76 76 76 probes: 1 1 1 3 1 3

Load Factor in Linear Probing
For any  < 1, linear probing will find an empty slot Search cost (for large table sizes) successful search: unsuccessful search: Linear probing suffers from primary clustering Performance quickly degrades for  > 1/2 There’s some really nasty math that shows that (because of things like primary clustering), each successful search w/linear probing costs… That’s 2.5 probes per unsuccessful search for lambda = 0.5. 50.5 comparisons for lambda = 0.9 We don’t want to let lambda get above 1/2 for linear probing.

Quadratic Probing f(i) = i2 Probe sequence is
h(k) mod size (h(k) + 1) mod size (h(k) + 4) mod size (h(k) + 9) mod size … findEntry using quadratic probing: bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash1(k), numProbes = 0; do { entry = &table[probePoint]; numProbes++; probePoint = (probePoint + 2*numProbes - 1) % size; } while (!entry->isEmpty() && entry->key != key); return !entry->isEmpty(); } Quadratic probing gets rid of primary clustering. Here, we just go to the next, then fourth, then ninth, then sixteenth, and so forth element. You can see in the code that we don’t actually have to compute squares; still, this is a bit more complicated than linear probing (NOT MUCH!)

Quadratic Probing Example 
76 3 2 1 6 5 4 insert(76) 76%7 = 6 76 3 2 1 6 5 4 insert(40) 40%7 = 5 40 40 76 3 2 1 6 5 4 insert(48) 48%7 = 6 48 47 40 76 3 2 1 6 5 4 insert(5) 5%7 = 5 5 40 55 3 2 1 6 4 insert(55) 55%7 = 6 76 47 And, it works reasonably well. Here it does fine until the table is more than half full. Let’s take a closer look at what happens when the table is half full. probes:

Quadratic Probing Example 
76 3 2 1 6 5 4 insert(76) 76%7 = 6 76 3 2 1 6 5 4 insert(93) 93%7 = 2 93 93 76 3 2 1 6 5 4 insert(40) 40%7 = 5 40 93 40 76 3 2 1 6 5 4 insert(35) 35%7 = 0 35 35 93 40 76 3 2 1 6 5 4 insert(47) 47%7 = 5  What happens when we get to inserting 47? It goes at slot 5. So we try 5, 6, 2, 0, 6, 2, … We never find an empty slot! Now, it’s not as bad as all that in inserting 47. Actually, after size-1 extra probes, we can tell we’re not going to get anywhere, and we can quit. What do we do then? probes:

Quadratic Probing Succeeds (for   ½)
If size is prime and   ½, then quadratic probing will find an empty slot in size/2 probes or fewer. show for all 0  i, j  size/2 and i  j (h(x) + i2) mod size  (h(x) + j2) mod size by contradiction: suppose that for some i, j: (h(x) + i2) mod size = (h(x) + j2) mod size i2 mod size = j2 mod size (i2 - j2) mod size = 0 [(i + j)(i - j)] mod size = 0 but how can i + j = 0 or i + j = size when i  j and i,j  size/2? same for i - j mod size = 0 Well, we can guarantee that this succeeds for loads under 1/2. And, in fact, for large tables it _usually_ succeeds as long as we’re not close to 1. But, it’s not guaranteed!

Quadratic Probing May Fail (for  > ½)
For any i larger than size/2, there is some j smaller than i that adds with i to equal size (or a multiple of size). D’oh!

Load Factor in Quadratic Probing
For any   ½, quadratic probing will find an empty slot; for greater , quadratic probing may find a slot Quadratic probing does not suffer from primary clustering Quadratic probing does suffer from secondary clustering How could we possibly solve this? At lambda = 1/2 about 1.5 probes per successful search That’s actually pretty close to perfect… but there are two problems. First, we might fail if the load factor is above 1/2. Second, quadratic probing still suffers from secondary clustering. That’s where multiple keys hashed to the same spot all follow the same probe sequence. How can we solve that?

Double Hashing f(i) = i  hash2(x) Probe sequence is
h1(k) mod size (h1(k) + 1  h2(x)) mod size (h1(k) + 2  h2(x)) mod size … Code for finding the next linear probe: What we need to do is not take the same size steps for keys that hash to the same spot. So, let’s do linear probing, but let’s use another hash function to choose the number of steps to take each probe. bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash1(k), hashIncr = hash2(k); do { entry = &table[probePoint]; probePoint = (probePoint + hashIncr) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty(); }

A Good Double Hash Function…
is quick to evaluate. differs from the original hash function. never evaluates to 0 (mod size). One good choice is to choose a prime R < size and: hash2(x) = R - (x mod R) We have to be careful here. We still need a good hash function. Plus, the hash function has to differ substantially from the original (because things that hashed to the same place in the original should hash totally differently in the new one). And, we can never evaluate to 0 (or we’d just spin in place). One good possibility if the keys are already numerical is …

Double Hashing Example
insert(76) 76%7 = 6 insert(93) 93%7 = 2 insert(40) 40%7 = 5 insert(47) 47%7 = 5 5 - (47%5) = 3 insert(10) 10%7 = 3 insert(55) 55%7 = 6 5 - (55%5) = 5 1 1 1 1 47 1 47 1 47 2 2 93 2 93 2 93 2 93 2 93 3 3 3 3 3 10 3 10 4 Here’s an example. I use R = 5. Notice that I don’t have the problem I had with linear probing where the 47 and 55 inserts were bad because of unrelated inserts. 4 4 4 4 4 55 5 5 5 5 5 5 40 40 40 40 6 6 6 6 6 6 76 76 76 76 76 76 probes: 1 1 1 2 1 2

Load Factor in Double Hashing
For any  < 1, double hashing will find an empty slot (given appropriate table size and hash2) Search cost appears to approach optimal (random hash): successful search: unsuccessful search: No primary clustering and no secondary clustering One extra hash calculation We can actually calculate things more precisely than this, but let’s not. Basically, this gets us optimal performance (actually, we _can_ do better). Quadratic may be a better choice just to avoid the extra hash, however.

Deletion in Closed Hashing
1 2 7 3 6 5 4 delete(2) 1 7 3 2 6 5 4 find(7) Where is it?! You know, I never mentioned deletion for closed hashing. What happens if we use normal physical deletion? If we delete 2 then try to find 7, we can’t! Fortunately, we can use lazy deletion. But, we pay the usual penalties. Plus, we can actually run out of space in a hash table! Must use lazy deletion! On insertion, treat a deleted item as an empty slot

The Squished Pigeon Principle
An insert using closed hashing cannot work with a load factor of 1 or more. An insert using closed hashing with quadratic probing may not work with a load factor of ½ or more. Whether you use open or closed hashing, large load factors lead to poor performance! How can we relieve the pressure on the pigeons? This brings us to what happens when space does get tight. In other words, how do we handle the squished pigeon principle: it’s hard to fit lots of pigeons into not enough extra holes. What ever will we do? Remember what we did for circular arrays and d-Heaps? Just resize the array, right? But we can’t just copy elements over since their hash values change with the table size. Hint: remember what happened when we overran a d-Heap’s array!

Rehashing When the load factor gets “too large” (over a constant threshold on ), rehash all the elements into a new, larger table: takes O(n), but amortized O(1) as long as we (just about) double table size on the resize spreads keys back out, may drastically improve performance gives us a chance to retune parameterized hash functions avoids failure for closed hashing techniques allows arbitrarily large tables starting from a small table clears out lazily deleted items So, we rehash the table. I said that over a constant threshold is the time to rehash. There are lots of alternatives: too many deleted items a failed insertion too many collisions too long a collision sequence Regardless, the effect is the same. Are there any problems here? In section, Zasha will talk about why this runs in good amortized time.

To Do Continue Project III Read chapter 5 in the book
Even if you don’t yet have a team you should absolutely start looking at the project III code. Hopefully, you’ve already started on your solution. Also, read that chapter 5. If it makes you feel any better, none of the universal hashing stuff is in chapter 5. Nor will it be on the exam.

Coming Up Case study Extendible hashing (hashing for HUGE data sets)
CIDR student interview (February 11th) Disjoint-set union-find ADT Fourth Quiz (February 10th!) Project III due (February 16th) Next time, we’ll talk through deciding what kind of table to use for an example. We’ll look at extendible hashing for huge data sets. And, we’ll have the CIDR student interview… so, come with your think caps on.

CSE 326: Data Structures Lecture #15 When Keys Collide

Similar presentations

Presentation on theme: "CSE 326: Data Structures Lecture #15 When Keys Collide"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 326: Data Structures Lecture #15 When Keys Collide

Similar presentations

Presentation on theme: "CSE 326: Data Structures Lecture #15 When Keys Collide"— Presentation transcript:

Similar presentations

About project

Feedback