Hash Maps Rem Collier Room A1.02 School of Computer Science and Informatics University College Dublin, Ireland
Hash Tables An array based approach to implementing the Map ADT. Key Features: An array of size N, A hash function, denoted h(k), which maps keys to integer values, known as hash values, in the range 0 to N-1: Algorithm hashFunction(k): return k % N A collision handling strategy, which deals with the case where two keys have the same hash value Two main types of collision handling strategy: separate chaining and open addressing.
Separate Chaining Basic strategy in which entries with the same hash value are “chained” together. Approach: Use an array of Lists. Collisions are resolved by adding the new entry to the end of the associated List. Example Problem: Create a hash table of size 13 that uses the following hash function: h(x) = x mod 13 Insert entries with the following keys: 18, 44, 41, 22, 59, 32, 31, 73
Pseudo Code Algorithm put(k, v): h hashFunction(k) temp null entry new Entry(k,v) if (A[h] = null) then A[h] new List() A[h].insertLast(entry) else P find(A[h], k) if (P = null) then A[h].insertLast(entry) else e A[h].replace(P, entry) temp e.value() size size + 1 return temp Algorithm remove(k): h hashFunction(k) if (A[h] = null) then return null P find(A[h], k) if (P = null) then return null e A[h].remove(P) size size - 1 return e.value() Algorithm get(k): h hashFunction(k) if (A[h] = null) then return null P find(A[h], k) if (P = null) then return null return P.element().value()
Performance The performance of get(), put() and remove() depends on the number of collisions: Best Case: no collisions occur means O(1) running-time! Worst Case: every key has the same hash value means O(n) running-time! Normally, Hash Table performance is measured as expected running- time (see table). In practice, we try to achieve this by choosing a good hash function… OperationTime size()O(1) isEmpty()O(1) get(k)O(1) put(k,v)O(1) remove(k)O(1) keys()O(1) values()O(1) entries()O(1)
Hash Functions Hash Functions convert keys to integer hash values in the range 0 to N-1. Any data type / object can be a key (e.g. strings, doubles, bank accounts, …) To handle this Hash Functions combine two basic maps: Hash Code Map: Assigns an integer value to each key Compression Map: Converts the integer to an integer in the range 0 to N-1. Previous example used a compression map known as the division method (% N): N should be prime Need to be wary of patterns in the hash codes of the form: pN + q
The MAD Method A better compression map is the Multiply Add and Divide (MAD) method. This method takes the hash code: Multiplies it by a constant value, known as the scale factor, Adds a second constant value, known as the shift, and then Returns the remainder when this value is divided by N. For a given hash code, i, this method takes the form: (ai + b) mod N Constraint: a % N should not equal 0…
Hash Code Maps Primitive Data Types Integer Cast: re-interpret the bits as an integer value e.g. for a byte, k, use (int) k Component Sum: break the bits into integer size blocks, cast each block as an integer, and sum the values: e.g. for a long, k, (int) (k >> 32) + (int) k Polynomial Sum: same as component sum, but multiply each term by a constant polynomial coefficient: e.g. for a sequence S= c 1 c 2..c n, use Objects: Use the objects memory address (or adapt one of the above) Has proven to be a simple but effective general solution
Hash Code Maps & Strings String = sequence of characters Character encodings are integer numbers (typically 8 / 16 bit) Naïve solution: Use component sum h(“dog”) = (int) ‘d’ + (int) ‘o’ + (int) ‘g’ = = 314 h(“god”) = (int) ‘g’ + (int) ‘o’ + (int) ‘d’ = = 314 !?!?! Better solution: Use polynomial sum (p=3): h(“god”) = * *9 = 1,336 h(“dog”) = * *9 = 1,360 Experimental Note: For 50,000 English words, a value of p = 33, results in less than 7 collisions!!!
Separate Chaining Separate Chaining: Use an array of Lists. Collisions result in new entries being added to the end of the corresponding list. In theory, offers infinite capacity. Drawbacks: Uses an auxiliary data structure (List). In practice, the number of collisions increases as the number of entries increases. Open Addressing: Do not require an auxiliary data structure Have finite capacity but support rehashing
Linear Probing Strategy Create an array of entries. Use the hash value h(k) as an index into this array. A collision occurs when h(k) is occupied. Resolve collision by placing the entry in the next (circularly) available array position. This is done by “probing” consecutive positions in the array (e.g. h(k) + 1, h(k) + 2, …) Lets explore how this works through the following example: Assume a hash table of size 13 that uses linear probing, together with the following hash function: h(x) = x mod 13 Insert entries with the following keys: 18, 44, 41, 22, 59, 32, 31, 73
Algorithm get(k): i hashFunction(k) p 0 repeat c A[i] if c = null return null else if c.key () = k return c.value() else i (i + 1) mod N p p + 1 until p = N return null Retrieval with Linear Probing Consider a hash table, A, that uses linear probing get(k) We start at cell h(k) We probe consecutive locations until one of the following occurs An item with key k is found, or An empty cell is found, or N cells have been unsuccessfully probed
Removal of Entries One issue that we still need to resolve is how to remove entries from a linear probing hash table implementation: Search is the key operation. Current search algorithm terminates when either N entries have been checked, or a “gap” is found. Problem: If we simply remove entries, they will be replaced by “gaps”. These “gaps” would cause the search algorithm to stop. Solution: special token (object) called AVAILABLE. Removed entries are replaced by the AVAILABLE token. A modified search algorithm could check whether each probe detects a valid entry, or the token.
Updates with Linear Probing Algorithm put(k, v): h hashFunction(k) p 0 available -1 while (p < N) do e A[h] if (e = null) then if (available = -1) then A[h] new Entry(k, v) size size + 1 else A[available] new Entry(k, v) size size + 1 return null if (e = AVAILABLE) and (available == -1) then available h else if (e.key() = k) then temp e.value() A[h] = new Entry(k, v) return temp h (h + 1) mod N p p + 1 return null
Updates with Linear Probing Algorithm remove(k): h hashFunction(k) p 0 while (p < N) do e A[h] if e = null then return null if e.key() = k then temp e.value() A[h] = AVAILABLE size size – 1 return temp h (h + 1) mod N p p + 1 return null
Double Hashing Idea: Use a secondary hash function d(k): Probing is not linear, but based on the following equation: (i + jd(k)) mod Nfor j = 0, 1, …, N - 1 Restrictions: The secondary hash function d(k) cannot have zero values The table size N must be a prime to allow probing of all the cells Common choice of d(k): d(k) = q - (k mod q)whereq < N and q is a prime The possible values for d(k) are 1, 2, …, q Example: Implementation: N = 13, h(k) = k mod 13, d(k) = 7 - k mod 7 Insert keys: 18, 41, 22, 44, 59, 32, 31, 73
Performance of Hashing In the worst case, searches, insertions and removals on a hash table take O(n) time This occurs when all the keys inserted into the dictionary collide The load factor a = n/N also affects the performance of a hash table Assuming hash values are like random numbers, the expected number of probes for (open addressing) insertion is: 1 / (1 - a) The expected running time of all the dictionary ADT operations in a hash table is O(1) In practice, hashing is very fast provided the load factor is not close to 100% Java HashMap’s rehash at 75% Applications of hash tables: small databases compilers browser caches
Rehashing Rehashing is the process of expanding the capacity of a hash table. It’s a lot like an extendable array (I.e. Vector) Rehashing is performed when the load factor moves above a certain threshold. We rehash by: Creating a new array (> 2N in size) Specifying a new compression map (e.g. update the division method to work with the new size) Inserting each entry into the new array. Given insertion is O(1), rehashing is an O(N) operation: We have to check each index in the old array…