CSE 221: Algorithms and Data Structures Lecture #8 What do you get when you put too many pigeons in one hole? Pigeon Hash Steve Wolfman 2009W1.

Slides:



Advertisements
Similar presentations
CSCE 3400 Data Structures & Algorithm Analysis
Advertisements

Hashing CS 3358 Data Structures.
CSC 2300 Data Structures & Algorithms February 27, 2007 Chapter 5. Hashing.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 9 Hash Tables (continued) Reminder Examples.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Tirgul 8 Hash Tables (continued) Reminder Examples.
Hashing General idea: Get a large array
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
CSE 221: Algorithms and Data Structures Lecture #8 The Constant Struggle for Hash Steve Wolfman 2014W1 1.
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
CSE 326: Data Structures Lecture #13 Bart Niswonger Summer Quarter 2001.
CSE 326: Data Structures Lecture #12
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CSE 326: Data Structures Lecture #14 Whoa… Good Hash, Man Steve Wolfman Winter Quarter 2000.
CMSC 341 Hashing Readings: Chapter 5. Announcements Midterm II on Nov 7 Review out Oct 29 HW 5 due Thursday CMSC 341 Hashing 2.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Fundamental Structures of Computer Science II
CE 221 Data Structures and Algorithms
Hashing (part 2) CSE 2011 Winter March 2018.
May 3rd – Hashing & Graphs
CSE 326: Data Structures Lecture #15 When Keys Collide
Data Structures Using C++ 2E
Hash table CSC317 We have elements with key and satellite data
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables
CSCI 210 Data Structures and Algorithms
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
Cse 373 April 24th – Hashing.
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Hash functions Open addressing
Hash tables Hash table: a list of some fixed size, that positions elements according to an algorithm called a hash function … hash function h(element)
Design and Analysis of Algorithms
Advanced Associative Structures
Hash Table.
Instructor: Lilian de Greef Quarter: Summer 2017
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Computer Science 2 Hashing
Hash Tables.
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Data Structures and Algorithms
CSCE 3110 Data Structures & Algorithm Analysis
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CSE 326: Data Structures Hashing
Hashing Alexandra Stefan.
UNIT – V PART - II HASH TABLES By B VENKATESWARLU, CSE Dept.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Pseudorandom number, Universal Hashing, Chaining and Linear-Probing
EE 312 Software Design and Implementation I
Data Structures – Week #7
CSE 326: Data Structures Lecture #12 Hashing II
Collision Handling Collisions occur when different elements are mapped to the same cell.
Hashing.
Data Structures and Algorithm Analysis Hashing
EE 312 Software Design and Implementation I
CSE 326: Data Structures Lecture #14
CSE 326: Data Structures Lecture #10 Hashing I
Lecture-Hashing.
CSE 373: Data Structures and Algorithms
Presentation transcript:

CSE 221: Algorithms and Data Structures Lecture #8 What do you get when you put too many pigeons in one hole? Pigeon Hash Steve Wolfman 2009W1

Learning Goals After this unit, you should be able to: Define various forms of the pigeonhole principle; recognize and solve the specific types of counting and hashing problems to which they apply. Provide examples of the types of problems that can benefit from a hash data structure. Compare and contrast open addressing and chaining. Evaluate collision resolution policies. Describe the conditions under which hashing can degenerate from O(1) expected complexity to O(n). Identify the types of search problems that do not benefit from hashing (e.g. range searching) and explain why. Manipulate data in hash structures both irrespective of implementation and also within a given implementation.

Today’s Outline Constant-Time Dictionaries? Hash Table Outline Hash Functions Collisions and the Pigeonhole Principle Collision Resolution: Chaining Open-Addressing OK, so you didn’t need to know BSTs before now, but after today and Monday, you will!

Reminder: Dictionary ADT midterm would be tastier with brownies prog-project so painful… who invented templates? wolf the perfect mix of oomph and Scrabble value Dictionary operations create destroy insert find delete Stores values associated with user-specified keys values may be any (homogenous) type keys may be any (homogenous) comparable type insert brownies - tasty find(wolf) wolf - the perfect mix of oomph and Scrabble value Dictionaries associate some key with a value, just like a real dictionary (where the key is a word and the value is its definition). In this example, I’ve stored 221 data associated with text reviews. This is probably the most valuable and widely used ADT we’ll hit. I’ll give you an example in a minute that should firmly entrench this concept.

Implementations So Far insert find delete Unsorted list O(1) O(n) O(n) Trees O(log n) O(log n) O(log n) Special case: O(1) O(1) O(1) integer keys between 0 and k LL: O(1), O(n), O(n) Uns: O(1), O(n), O(n) Sorted: O(n), O(log n), O(n) Sorted array is oh-so-close. O(log n) find time and almost O(log n) insert time. What’s wrong? Let’s look at how that search goes: Draw recursive calls (and potential recursive calls) in binary search. Note how it starts looking like a binary tree where the left subtrees have smaller elements and the right subtrees have bigger elements. What if we could store the whole thing in the structure this recursive search is building? How about O(1) insert/find/delete for any key type?

… … Hash Table Goal We can do: a[2] = “Andrew” We want to do: “Vicki” We can do: a[2] = “Andrew” We want to do: a[“Steve”] = “Andrew” 1 “Dan” 2 “Steve” Andrew Andrew 3 “Shervin” … “Ben” … k-1 “Martin”

Today’s Outline Constant-Time Dictionaries? Hash Table Outline Hash Functions Collisions and the Pigeonhole Principle Collision Resolution: Chaining Open-Addressing OK, so you didn’t need to know BSTs before now, but after today and Monday, you will!

Hash Table Approach Zasha Steve f(x) Nic Brad Ed But… is there a problem is this pipe-dream?

Hash Table Dictionary Data Structure Hash function: maps keys to integers result: can quickly find the right spot for a given entry Unordered and sparse table result: cannot efficiently list all entries, f(x) Zasha Steve Nic Brad Ed

Hash Table Terminology hash function Vicki f(x) Steve Dan collision Ben Martin keys load factor  = # of entries in table tableSize

Hash Table Code First Pass Value & find(Key & key) { int index = hash(key) % tableSize; return Table[tableSize]; } What should the hash function be? What should the table size be? How should we resolve collisions?

Today’s Outline Constant-Time Dictionaries? Hash Table Outline Hash Functions Collisions and the Pigeonhole Principle Collision Resolution: Chaining Open-Addressing OK, so you didn’t need to know BSTs before now, but after today and Monday, you will!

A Good Hash Function… is easy (fast) to compute (O(1) and practically fast). distributes the data evenly (hash(a) % size  hash(b) % size). uses the whole hash table (for all 0  k < size, there’s an i such that hash(i) % size = k).

Good Hash Function for Integers Choose tableSize is prime hash(n) = n Example: tableSize = 7 insert(4) insert(17) find(12) insert(9) delete(17) 1 2 3 4 5 6

Good Hash Function for Strings? Let s = s1s2s3s4…s5: choose hash(s) = s1 + s2128 + s31282 + s41283 + … + sn128n Problems: hash(“really, really big”) = well… something really, really big hash(“one thing”) % 128 = hash(“other thing”) % 128 Think of the string as a base 128 number.

Making the String Hash Easy to Compute Use Horner’s Rule int hash(String s) { h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (si + 128*h) % tableSize; } return h;

Making the String Hash Cause Few Conflicts Ideas? Just make sure the tableSize is not a multiple of 128 (e.g., is prime!).

Hash Function Summary Goals of a hash function Sample hash functions: reproducible mapping from key to table entry evenly distribute keys across the table separate commonly occurring keys (neighboring keys?) complete quickly Sample hash functions: h(n) = n % size h(n) = string as base 128 number % size Multiplication hash: compute percentage through the table Universal hash function #1: dot product with random vector Universal hash function #2: next pseudo-random number The idea of neighboring keys here may change from application to application. In one context, neighboring keys may be those with the same last characters or first characters… say, when hashing names in a school system. Many people may have the same last names or first names (but few will have the same of both).

How to Design a Hash Function Know what your keys are or Study how your keys are distributed. Try to include all important information in a key in the construction of its hash. Try to make “neighboring” keys hash to very different places. Prune the features used to create the hash until it runs “fast enough” (very application dependent).

Today’s Outline Constant-Time Dictionaries? Hash Table Outline Hash Functions Collisions and the Pigeonhole Principle Collision Resolution: Chaining Open-Addressing OK, so you didn’t need to know BSTs before now, but after today and Monday, you will!

Collisions Pigeonhole principle says we can’t avoid all collisions try to hash without collision m keys into n slots with m > n try to put 6 pigeons into 5 holes The pigeonhole principle is a vitally important mathematical principle that asks what happens when you try to shove k+1 pigeons into k pigeon sized holes. Don’t snicker. But, the fact is that no hash function can perfectly hash m keys into fewer than m slots. They won’t fit. What do we do? 1) Shove the pigeons in anyway. 2) Try somewhere else when we’re shoving two pigeons in the same place. Does closed hashing solve the original problem?

The Pigeonhole Principle (informal) You can’t put k+1 pigeons into k holes without putting two pigeons in the same hole. This place just isn’t coo anymore. The pigeonhole image is licensed under the Creative Commons Attribution ShareAlike 3.0 License. In short: you are free to share and make derivative works of the file under the conditions that you appropriately attribute it, and that you distribute it only under a license identical to this one. Official license; ORIGINAL AUTHOR: en:User:McKay Image by en:User:McKay, used under CC attr/share-alike.

The Pigeonhole Principle (formal) Let X and Y be finite sets where |X| > |Y|. If f : X→Y, then f(x1) = f(x2) for some x1, x2 ∈ X, where x1 ≠ x2. f Now that’s coo! X x1 Y x2 f(x1) = f(x2)

The Pigeonhole Principle (Example #1) Suppose we have 5 colours of Halloween candy, and that there’s lots of candy in a bag. How many pieces of candy do we have to pull out of the bag if we want to be sure to get 2 of the same colour? 2 4 6 8 None of these 6. 6 candies, 5 possible colours, some colour must have two candies.

The Pigeonhole Principle (Example #2) If there are 1000 pieces of each colour, how many do we need to pull to guarantee that we’ll get 2 black pieces of candy (assuming that black is one of the 5 colours)? 2 4 6 8 None of these 4002. This is not an appropriate problem for the pigeonhole principle! We don’t know which hole has two pigeons!

The Pigeonhole Principle (Example #3) If 5 points are placed in a 6cm x 8cm rectangle, argue that there are two points that are not more than 5 cm apart. 8cm Hint: How long is the diagonal? Consider this as four boxes, and it really is the classic PHP. Five darts, four boxes, two must land in the same box. Therefore, they are at most 5 cm apart (opposite corners of the box). 6cm

The Pigeonhole Principle (Example #4) For a, b ∈ Z, we write a divides b as a|b, meaning ∃ c ∈ Z such that b = ac. Consider n +1 distinct positive integers, each ≤ 2n. Show that one of them must divide on of the others. For example, if n = 4, consider the following sets: {1, 2, 3, 7, 8} {2, 3, 4, 7, 8} {2, 3, 5, 7, 8} Hint: Any integer can be written as 2k * q where k is an integer and q is odd. E.g., 129 = 20 * 129; 60 = 22 * 15. There are only n odd numbers between 0 and 2n. By the PHP, with n+1 numbers, two of them, when written as 2^k*q, must use the same odd number q but different k’s. So, one is some multiple of 2 times the other.

The Pigeonhole Principle (Full Glory) Let X and Y be finite sets with |X| = n, |Y| = m, and k = n/m. If f : X → Y, then ∃ k values x1, x2, …, xk ∈ X such that f(x1) = f(x2) = … f(xk). Informally: If n pigeons fly into m holes, at least 1 hole contains at least k = n/m pigeons. Proof: Assume there’s no such hole. Then, there are at most (n/m – 1)*m pigeons in all the holes, which is fewer than (n/m + 1 – 1)*m = n/m*m = n, but that is a contradiction. QED

Today’s Outline Constant-Time Dictionaries? Hash Table Outline Hash Functions Collisions and the Pigeonhole Principle Collision Resolution: Chaining Open-Addressing OK, so you didn’t need to know BSTs before now, but after today and Monday, you will!

Collision Resolution Pigeonhole principle says we can’t avoid all collisions try to hash without collision m keys into n slots with m > n try to put 6 pigeons into 5 holes What do we do when two keys hash to the same entry? chaining: put little dictionaries in each entry open addressing: pick a next entry to try shove extra pigeons in one hole! The pigeonhole principle is a vitally important mathematical principle that asks what happens when you try to shove k+1 pigeons into k pigeon sized holes. Don’t snicker. But, the fact is that no hash function can perfectly hash m keys into fewer than m slots. They won’t fit. What do we do? 1) Shove the pigeons in anyway. 2) Try somewhere else when we’re shoving two pigeons in the same place. Does closed hashing solve the original problem?

Hashing with Chaining Put a little dictionary at each entry Properties h(a) = h(d) h(e) = h(b) Put a little dictionary at each entry choose type as appropriate common case is unordered linked list (chain) Properties  can be greater than 1 performance degrades with length of chains 1 a d 2 3 e b 4 What decides what is appropriate? Memory requirements Speed requirements Expected size of dictionaries How easy is comparison (< vs. ==) Why unordered ll then? Small mem. requirement; near zero if empty dictionary! Fast enough if small Only need == comparison Where should I put a new entry? (Think splay trees) What _might_ I do on a successful search? 5 c 6

Chaining Code Dictionary & findBucket(const Key & k) { return table[hash(k)%table.size]; } void insert(const Key & k, const Value & v) { findBucket(k).insert(k,v); void delete(const Key & k) { findBucket(k).delete(k); } Value & find(const Key & k) return findBucket(k).find(k); Here’s the code for an open hashing table. You can see that as long as we already have a dictionary implementation, this is pretty simple.

Load Factor in Chaining Search cost unsuccessful search: successful search: Desired load factor: Let’s analyze it. How long does an unsuccessful search take? Well, we have to traverse the whole list wherever we hash to. How long is the whole list? On average, . How about successful search? On average we traverse only half of the list before we find the one we’re looking for (but we still have to check that one): /2 + 1 So, what load factor might we want? Obviously, ZERO! But we can shoot for between 1/2 and 1.

Today’s Outline Constant-Time Dictionaries? Hash Table Outline Hash Functions Collisions and the Pigeonhole Principle Collision Resolution: Chaining Open-Addressing

Open Addressing What if we only allow one Key at each entry? two objects that hash to the same spot can’t both go there first one there gets the spot next one must go in another spot Properties   1 performance degrades with difficulty of finding right spot h(a) = h(d) h(e) = h(b) 1 a 2 d 3 e Alright, what if we stay inside the table? Then, we have to try somewhere else when two things collide. That means we need a strategy for finding the next spot. Moreover, that strategy needs to be deterministic why? It also means that we cannot have a load factor larger than 1. 4 b 5 c 6

Probing Probing how to: Probing properties X -FILES Probing Probing how to: First probe - given a key k, hash to h(k) Second probe - if h(k) is occupied, try h(k) + f(1) Third probe - if h(k) + f(1) is occupied, try h(k) + f(2) And so forth Probing properties the ith probe is to (h(k) + f(i)) mod size where f(0) = 0 if i reaches size, the insert has failed depending on f(), the insert may fail sooner long sequences of probes are costly! Probing is our technique for finding a good spot. Note: no X-Files probing: strictly above the belt (thanks to Corey :) One probe goes to the normal hashed spot; the rest are our search for the true spot, and, if the load factor is less than 1, we hope that the true spot is out there. Here is what we need out of our probing function.

Linear Probing f(i) = i Probe sequence is h(k) mod size h(k) + 1 mod size h(k) + 2 mod size … findEntry using linear probing: One simple strategy is to just scan through the table. The code at the bottom does this, stopping when we find an empty spot or the key we’re looking for. bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash1(k); do { entry = &table[probePoint]; probePoint = (probePoint + 1) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty(); }

Linear Probing Example insert(76) 76%7 = 6 insert(93) 93%7 = 2 insert(40) 40%7 = 5 insert(47) 47%7 = 5 insert(10) 10%7 = 3 insert(55) 55%7 = 6 47 47 47 1 1 1 1 1 1 55 2 2 93 2 93 2 93 2 93 2 93 3 3 3 3 3 10 3 10 4 4 4 4 4 4 You can see that this works pretty well for an empty table and gets worse as the table fills up. There’s another problem here. If a bunch of elements hash to the same spot, they mess each other up. But, worse, if a bunch of elements hash to the same area of the table, they mess each other up! (Even though the hash function isn’t producing lots of collisions!) This phenomenon is called primary clustering. 5 5 5 5 5 5 40 40 40 40 6 6 6 6 6 6 76 76 76 76 76 76 probes: 1 1 1 3 1 3

Load Factor in Linear Probing For any  < 1, linear probing will find an empty slot Search cost (for large table sizes) successful search: unsuccessful search: Linear probing suffers from primary clustering Performance quickly degrades for  > 1/2 Values hashed close to each other probe the same slots. There’s some really nasty math that shows that (because of things like primary clustering), each successful search w/linear probing costs… That’s 2.5 probes per unsuccessful search for lambda = 0.5. 50.5 comparisons for lambda = 0.9 We don’t want to let lambda get above 1/2 for linear probing.

Quadratic Probing f(i) = i2 Probe sequence is h(k) mod size (h(k) + 1) mod size (h(k) + 4) mod size (h(k) + 9) mod size … findEntry using quadratic probing: bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash1(k), numProbes = 0; do { entry = &table[probePoint]; numProbes++; probePoint = (probePoint + 2*numProbes - 1) % size; } while (!entry->isEmpty() && entry->key != key); return !entry->isEmpty(); } Quadratic probing gets rid of primary clustering. Here, we just go to the next, then fourth, then ninth, then sixteenth, and so forth element. You can see in the code that we don’t actually have to compute squares; still, this is a bit more complicated than linear probing (NOT MUCH!)

Quadratic Probing Example  76 3 2 1 6 5 4 insert(76) 76%7 = 6 76 3 2 1 6 5 4 insert(40) 40%7 = 5 40 40 76 3 2 1 6 5 4 insert(48) 48%7 = 6 48 47 40 76 3 2 1 6 5 4 insert(5) 5%7 = 5 5 40 55 3 2 1 6 4 insert(55) 55%7 = 6 76 47 And, it works reasonably well. Here it does fine until the table is more than half full. Let’s take a closer look at what happens when the table is half full. probes:

Quadratic Probing Example  76 3 2 1 6 5 4 insert(76) 76%7 = 6 76 3 2 1 6 5 4 insert(93) 93%7 = 2 93 93 76 3 2 1 6 5 4 insert(40) 40%7 = 5 40 93 40 76 3 2 1 6 5 4 insert(35) 35%7 = 0 35 35 93 40 76 3 2 1 6 5 4 insert(47) 47%7 = 5  What happens when we get to inserting 47? It goes at slot 5. So we try 5, 6, 2, 0, 6, 2, … We never find an empty slot! Now, it’s not as bad as all that in inserting 47. Actually, after size-1 extra probes, we can tell we’re not going to get anywhere, and we can quit. What do we do then? probes:

Quadratic Probing Succeeds (for   ½) If size is prime and   ½, then quadratic probing will find an empty slot in size/2 probes or fewer. show for all 0  i, j  size/2 and i  j (h(x) + i2) mod size  (h(x) + j2) mod size by contradiction: suppose that for some i, j: (h(x) + i2) mod size = (h(x) + j2) mod size i2 mod size = j2 mod size (i2 - j2) mod size = 0 [(i + j)(i - j)] mod size = 0 but how can i + j = 0 or i + j = size when i  j and i,j  size/2? same for i - j mod size = 0 Well, we can guarantee that this succeeds for loads under 1/2. And, in fact, for large tables it _usually_ succeeds as long as we’re not close to 1. But, it’s not guaranteed!

Quadratic Probing May Fail (for  > ½) For any i larger than size/2, there is some j smaller than i that adds with i to equal size (or a multiple of size). D’oh!

Load Factor in Quadratic Probing For any   ½, quadratic probing will find an empty slot; for greater , quadratic probing may find a slot Quadratic probing does not suffer from primary clustering Quadratic probing does suffer from secondary clustering How could we possibly solve this? Values hashed to the SAME index probe the same slots. At lambda = 1/2 about 1.5 probes per successful search That’s actually pretty close to perfect… but there are two problems. First, we might fail if the load factor is above 1/2. Second, quadratic probing still suffers from secondary clustering. That’s where multiple keys hashed to the same spot all follow the same probe sequence. How can we solve that?

Double Hashing f(i) = i  hash2(x) Probe sequence is h1(k) mod size (h1(k) + 1  h2(x)) mod size (h1(k) + 2  h2(x)) mod size … Code for finding the next linear probe: What we need to do is not take the same size steps for keys that hash to the same spot. So, let’s do linear probing, but let’s use another hash function to choose the number of steps to take each probe. bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash1(k), hashIncr = hash2(k); do { entry = &table[probePoint]; probePoint = (probePoint + hashIncr) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty(); }

A Good Double Hash Function… is quick to evaluate. differs from the original hash function. never evaluates to 0 (mod size). One good choice is to choose a prime R < size and: hash2(x) = R - (x mod R) We have to be careful here. We still need a good hash function. Plus, the hash function has to differ substantially from the original (because things that hashed to the same place in the original should hash totally differently in the new one). And, we can never evaluate to 0 (or we’d just spin in place). One good possibility if the keys are already numerical is …

Double Hashing Example insert(76) 76%7 = 6 insert(93) 93%7 = 2 insert(40) 40%7 = 5 insert(47) 47%7 = 5 5 - (47%5) = 3 insert(10) 10%7 = 3 insert(55) 55%7 = 6 5 - (55%5) = 5 1 1 1 1 47 1 47 1 47 2 2 93 2 93 2 93 2 93 2 93 3 3 3 3 3 10 3 10 4 4 4 4 4 4 55 Here’s an example. I use R = 5. Notice that I don’t have the problem I had with linear probing where the 47 and 55 inserts were bad because of unrelated inserts. 5 5 5 5 5 5 40 40 40 40 6 6 6 6 6 6 76 76 76 76 76 76 probes: 1 1 1 2 1 2

Load Factor in Double Hashing For any  < 1, double hashing will find an empty slot (given appropriate table size and hash2) Search cost appears to approach optimal (random hash): successful search: unsuccessful search: No primary clustering and no secondary clustering One extra hash calculation We can actually calculate things more precisely than this, but let’s not. Basically, this gets us optimal performance (actually, we _can_ do better). Quadratic may be a better choice just to avoid the extra hash, however.

Deletion in Open Addressing 1 2 7 3 6 5 4 delete(2) 1 7 3 2 6 5 4 find(7) Where is it?! You know, I never mentioned deletion for closed hashing. What happens if we use normal physical deletion? If we delete 2 then try to find 7, we can’t! Fortunately, we can use lazy deletion. But, we pay the usual penalties. Plus, we can actually run out of space in a hash table! Must use lazy deletion! On insertion, treat a deleted item as an empty slot

The Squished Pigeon Principle An insert using open addressing cannot work with a load factor of 1 or more. An insert using open addressing with quadratic probing may not work with a load factor of ½ or more. Whether you use chaining or open addressing, large load factors lead to poor performance! How can we relieve the pressure on the pigeons? This brings us to what happens when space does get tight. In other words, how do we handle the squished pigeon principle: it’s hard to fit lots of pigeons into not enough extra holes. What ever will we do? Remember what we did for circular arrays and d-Heaps? Just resize the array, right? But we can’t just copy elements over since their hash values change with the table size. Hint: think resizable arrays!

Rehashing When the load factor gets “too large” (over a constant threshold on ), rehash all the elements into a new, larger table: takes O(n), but amortized O(1) as long as we (just about) double table size on the resize spreads keys back out, may drastically improve performance gives us a chance to retune parameterized hash functions avoids failure for open addressing techniques allows arbitrarily large tables starting from a small table clears out lazily deleted items So, we rehash the table. I said that over a constant threshold is the time to rehash. There are lots of alternatives: too many deleted items a failed insertion too many collisions too long a collision sequence Regardless, the effect is the same. Are there any problems here? In section, Zasha will talk about why this runs in good amortized time.

To Do Start HW#3 (out soon) Start Project 4 (out soon… our last project!) Read Epp Section 7.3 and Koffman & Wolfgang Chapter 9

Coming Up Counting. What? Counting?! Yes, counting, like: If you want to find the average depth of a binary tree as a function of its size, you have to count how many different binary trees there are with n nodes. Or: How many different ways are there to answer the question “How many fingers am I holding up, and which ones are they?”

Extra Slides: Some Other Hashing Methods These are parameterized methods, which is handy if you know the keys in advance. In that case, you can randomly set the parameters a few times and pick a hash function that performs well. (Would that ever happen? How about when building a spell-check dictionary?)

Good Hashing: Multiplication Method Hash function is defined by size plus a parameter A hA(k) = size * (k*A mod 1) where 0 < A < 1 Example: size = 10, A = 0.485 hA(50) = 10 * (50*0.485 mod 1) = 10 * (24.25 mod 1) = 10 * 0.25 = 2 no restriction on size! if we’re building a static table, we can try several As more computationally intensive than a single mod

Good Hashing: Universal Hash Function Parameterized by prime size and vector: a = <a0 a1 … ar> where 0 <= ai < size Represent each key as r + 1 integers where ki < size size = 11, key = 39752 ==> <3,9,7,5,2> size = 29, key = “hello world” ==> <8,5,12,12,15,23,15,18,12,4> ha(k) =

Universal Hash Function: Example Context: hash strings of length 3 in a table of size 131 let a = <35, 100, 21> ha(“xyz”) = (35*120 + 100*121 + 21*122) % 131 = 129

Universal Hash Function Strengths: works on any type as long as you can form ki’s if we’re building a static table, we can try many a’s a random a has guaranteed good properties no matter what we’re hashing Weaknesses must choose prime table size larger than any ki

Alternate Universal Hash Function Parameterized by k, a, and b: k * size should fit into an int a and b must be less than size Hk,a,b(x) =

Alternate Universe Hash Function: Example Context: hash integers in a table of size 16 let k = 32, a = 100, b = 200 hk,a,b(1000) = ((100*1000 + 200) % (32*16)) / 32 = (100200 % 512) / 32 = 360 / 32 = 11

Universal Hash Function Strengths: if we’re building a static table, we can try many a’s random a,b has guaranteed good properties no matter what we’re hashing can choose any size table very efficient if k and size are powers of 2 Weaknesses still need to turn non-integer keys into integers