SETS, HASH TABLES, AND DICTIONARIES CS16: Introduction to Data Structures & Algorithms Tuesday February 10,
Outline 1. Set ADT 2. Dictionary ADT 3. Hash Tables 4. Example: JUMBLE Tuesday February 10,
Set A set is a collection of distinct elements (no repeats) Unlike a list or an array, a set doesn’t maintain any particular order of its elements Tuesday February 10, {,, }
Set ADT add(obj): adds an element to the set, if it is not there already. remove(obj): removes an element from the set, if it is there. boolean contains(obj): checks whether an object is in the set int size(): returns the number of elements in the set boolean isEmpty(): checks whether the set is empty list enumerate(): returns a list of all the elements in some arbitrary order Tuesday February 10,
Simple Set Implementation We could use an (expandable) array add : add to end of array O(1) contains : step through array O(n) remove : find, then compress O(n) Can we do any better? hold that thought… Tuesday February 10,
Dictionary A dictionary is used to store (key, value) pairs, where the key is used to lookup its corresponding value Also known as a map Applications: address book (name address) …a dictionary (word definition) Tuesday February 10,
Dictionary ADT add(key, val): adds a (key, value) pair to the dictionary V get(key): returns the value mapped to by the key remove(key): removes the key and its corresponding value from the dictionary int size(): returns the number of (key, value) pairs in the dictionary boolean isEmpty(): checks whether the dictionary is empty GOAL: Implement a dictionary so that all of these methods run in O(1) time! Tuesday February 10,
Hash Tables A hash table is an implementation of a dictionary Hash tables are built using an array h(key) is a “hash function” that takes in a key and returns an index into the array, where the key’s corresponding value will be stored However, it’s possible multiple keys will “hash” to the same index. How can we store multiple values at a single index? Let’s make the array an array of “buckets”, where each bucket is a list of the values whose keys hash to that index In fact, we’ll store the (key, value) pair itself in the bucket – not just the value. Think about why this may be Note: it is important that h(key) runs in constant time! Tuesday February 10,
Hash Tables (2) Tuesday February 10, table = array of some size h = some hash function function add(key, val): index = h(key) table[index].append(key, val) function get(key): index = h(key) for (k, v) in table[index]: if k = key: return v error(“key not found”) O(1), as long as h() is constant depends on the size of the bucket!
Hash Table Illustrated Tuesday February 10, B David Laidlaw B Leah Steinberg B Patrick Maiden B Sarah Parker B Marley Rafson B Luke Fiorante B Surbhi Madan B B B keys: Banner ID # hash function: h(key) = key % 7 array of “buckets” with (key, val) pairs:
Hash Functions In the example on the last slide, the hash table had size 7, and the hash function used was: h(key) = key % 7 If we expect ~150 students to be stored in our hash table, then we’re bound to have lots of collisions. If we’re lucky, the IDs will distribute themselves uniformly so each bucket will contain about 150 /7 students But we’d still have to look through a list of length n/7 to find the right one, which is O(n) How can we do better? Tuesday February 10,
Hash Functions (2) Solution: bigger table! We know Banner IDs have 8 digits. That means the largest possible ID is 99,999,999. Let’s make an array of size 100,000,000 and use the hash function: h(key) = key Since every ID gets its own index in the array, we’re guaranteed to have no collisions. All functions run in O(1) ! But if we only need to keep track of 150 students, then …% of our array goes to waste Besides, we might not even have enough memory for these kinds of shenanigans! Tuesday February 10,
Hash Functions (3) Solution: smaller bigger table! Since we only expect to store ~150 students, let’s only allocate the space we need Make an array of size 150, and use the hash function: h(key) = key % 150 This would be great if we were guaranteed that the IDs were randomly distributed But what if next year the registrar assigned new Banner IDs in multiples of 150? Now we’re screwed! Since we can’t count on our keys to be random, we’ll just have to make our hash function random! Tuesday February 10,
Universal Hashing Magical universal hash function: 1. Pick a prime number greater than your expected capacity: 151 This is your array size 2. Fix 4 random numbers between 0 and 151 a 1, a 2, a 3, a 4 These stay constant for the life of the hash table 3. Break keys (Banner IDs) into 4 chunks x 1, x 2, x 3, x 4 e.g. B 00, 23, 89, h(key) = (a 1 x 1 + a 2 x 2 + a 3 x 3 + a 4 x 4 ) % 151 Tuesday February 10,
Universal Hashing (2) Tuesday February 10,
Universal Hashing Proof: Background Remember fractions and their inverses? The inverse of 3/4 is 4/3, because (3/4)*(4/3) = 1 Sometimes we write it like this: (3/4) -1 = 4/3 Normally, integers don’t have (multiplicative) inverses, because you can’t multiply an integer i by anything to get 1. (Unless i = 1 … duh.) But as soon as we enter modulo world, suddenly integers can have inverses too! Take the integers mod 7 : The inverse of 2 is 4, because 2*4 = 8 ≅ 1 mod 7 The inverse of 5 is 3, because 5*3 = 15 ≅ 1 mod 7 But does every integer always have an inverse under any modulo? What about the integers mod 4 ? Does 2 have an inverse? 2*0 = 0 ≅ 0 mod 4 2*1 = 2 ≅ 2 mod 4 2*2 = 4 ≅ 0 mod 4 2*3 = 6 ≅ 2 mod 4 Turns out, an integer i (mod n) only has an inverse if i and n are relatively prime, which means the only positive integer that evenly divides both of them is 1 Then it definitely has an inverse, and that inverse is unique. Take Abstract Algebra to find out why! Woo!! So if we’re talking about the integers mod n, where n is a prime number, then every integer has an inverse—because they’re all relatively prime to n ! Oh, except for 0. Because 0 × anything is still 0. Wow, we just talked about modular stuff AND prime numbers. Sounds likes some serious foreshadowing!!!!!! Tuesday February 10, Crap! No inverse!
Universal Hashing Proof Now for the actual proof: Let n be the prime size of our array Choose any 2 distinct Banner IDs, broken into their 4 chunks: (x 1, x 2, x 3, x 4 ) and (y 1, y 2, y 3, y 4 ) Because the IDs are distinct, we know that they must differ by at least 1 chunk. Without loss of generality, we can assume that they differ by the last one. That is, x 4 ≠ y 4 Fix 4 random numbers for our hash function, h : a 1, a 2, a 3, a 4 The probability that these 2 IDs will hash to the same bucket is the probability that: h(x 1, x 2, x 3, x 4 ) = h(y 1, y 2, y 3, y 4 ) Tuesday February 10,
Universal Hashing Proof (2) Tuesday February 10, This is just some number, c subtract stuff from both sides multiply both sides by (x 4 – y 4 ) -1 Now let’s simplify that last expression:
…Therefore, the probability that 2 distinct IDs will collide is the probability that a 4 ≅ c(x 4 – y 4 ) -1 mod n Because x 4 ≠ y 4, we know that (x 4 – y 4 ) ≠ 0 And since we chose n to be prime, (x 4 – y 4 ) is guaranteed to have a unique inverse mod n. Therefore, there is only one possible value that c(x 4 – y 4 ) -1 could take, and only one value of a 4 that would satisfy this congruence. Since a 4 is randomly selected from n possible values, the probability that a 4 was chosen “right” is 1/n. Therefore, the probability that a particular ID, x, will collide with another given ID is 1/n = 1/151. This means the expected number of collisions between x and all other IDs is 149/151 ≈ 1. So the expected size of x ’s bucket is ≈ 2 Universal Hashing Proof (3) Tuesday February 10, OMG WE DID IT.
Back to Sets We can also use hashing to implement a set! There are no key-value pairs, just elements. Also called a Hash Set Tuesday February 10, function add(obj): index = h(obj) table[index].append(obj) function contains(obj): index = h(obj) for elt in table[index]: if elt == obj: return true return false
HashMap vs HashSet Tuesday February 10, Hash MapHash Set Maps keys to values There is no ordering No keys, just values. That is, it is like a HashMap where the key and value are the same. There is no ordering
Example: JUMBLE Leah is making a Jumble for the Daily Herald. There should only be one solution for a set of jumbled letters. How can she find all 5-letter words for which there is no other valid permutation? Input: list of all 5-letter words in English (each word represented as an array of 5 characters) Output: all words for which no other permutation is a word Tuesday February 10,
Example: JUMBLE Plan Naive solution: for every valid word, find ALL of its permutations, and check if each permutation is an English word. Keep track of a list for each word and return which words have only a single valid permutation. The problem with this: generating every permutation for every word is very expensive! For a 5-letter word, there are as many as 5! permutations we would have to check. The better solution: sort the letters of each valid word alphabetically. Use the sorted letter combination as the keys in the hashmap. Therefore, every two words that are permutations of each other will have the same key, so they'll be mapped to the same "value", a list of permutations of the same letter combination. We use only the valid English word to generate this, so we're never touching the tons and tons of non-valid letter combinations. Tuesday February 10,
Example: JUMBLE Solution function jumble(words): // Input: list of words // Output: list of all words for which no other // permutation is a word output = [] permutations = dictionary() for each word in words: sortedKey = sort the letters of “word” alphabetically permlist = permutations.get(sortedKey) or [] // [] if empty permlist.append(word) permutations.add(sortedKey, permlist) for each word in words: sortedKey = sort the letters of word alphabetically if permutations.get(sortedKey).length == 1: output.append(word) return output Tuesday February 10,
Readings Dasgupta section 1.5 covers universal hashing, pages Dasgupta “Randomized algorithms: a virtual chapter” on page 39 motivates algorithms like hashing. Tuesday February 10,