Hashing Amihood Amir Bar Ilan University 2014
Direct Addressing In old days: LD 1,1 LD 2,2 AD 1,2 ST 1,3 Today: C <- A+B
Direct Addressing Compiler keeps track of: C <- A+B A in location 1 B in location 2 C in location 3 How? Tables? Linked lists? Skip lists? AVL trees? B-trees?
Encode? Consider ASCII code of alphabet: (American Standard Code for Information Interchange)
Encode In ASCII code: A to Z are to
Encode In ASCII code: A to Z are to In decimal: 65 to 90. So if we subtract 64 from the ASCII We will get locations 1 to 26.
Map In general: Consider a function h: U {0,…,m}. Where U is the universe of keys, m is number of keys. When we want to access the record of key kєU just go to location h(k) and it will point to the location of the record.
Problems: 1.What happens if keys are not consecutive? E.g. parameters: pointer, variable, idnumber 2.What happens if not all keys are used? E.g. ID numbers of students: , , What happens if h(k) is not bijective?
Hash Tables: If U is larger than m, h(k) can not be bijective, thus we may encounter collisions: k 1 ≠k 2 where h(k 1 )= h(k 2 ) What do we do then? Solution 1: chaining.
Chaining Chaining puts elements that hash to the same slot in a linked list: —— T k4k4 k2k2 k3k3 k1k1 k5k5 U (universe of keys) K (actual keys) k6k6 k8k8 k7k7 k1k1 k4k4 —— k5k5 k2k2 k3k3 k8k8 k6k6 k7k7
Hash operations: CHAINED-HASH-INSERT(T,x) Insert x at end of list T[h(key[x])] CHAINED-HASH-SEARCH(T,k) Search for an element with key k in list T[h(k)] CHAINED-HASH-DELETE(T,x) Delete x from list T[h(key[x])]
Time: Depends on chains. We assume: simple uniform hashing. h hashes into any slot with equal uniform likelihood.
Load Factor: We will analyze average search time. This depends on how large is the hash table relative to the number of keys. Let n= number of keys, m= hash table size n/m is the load factor. Write n/m = α.
Average Complexity of Search Notation: length of list T[j] is n j.Σ n j = n. Average value of n j = E(n j ) = n/m = α Assumptions: h(k) computed in constant time. h(k) simple uniform hash function.
Searching for new key Compute h(k) in time O(1). Get to list T[h(k)] (uniformly distributed) whose length is n j. Need to search list T[h(k)]. Time for search: n j. E(n j ) = α. Conclude: Average time to search for new key is Θ(1+α). What is worst case time?
Searching for Existing Key Do we gain anything by searching for a key that is already in list? Note: Expected number of elements in successful search for k = Expected length of list T[h(k)] when k was inserted + 1. (since we insert elements at end)
Searching for Existing Key Let k 1, …, k n be the keys in order of their insertion. When k i is inserted, the expected list length is (i-1)/m. So the expected length of a successful search is the average of all such lists:
Calculate: = = =
Conclude: Average time for searching in a hash table with chaining is Θ(1+α) Note: Generally m is chosen as O(n) Thus the average operation on a hash table with chaining takes constant time.
Choosing a Hash Function Assume keys are natural numbers: {0, 1, 2, … } A natural mapping of natural numbers to {0, 1, …, m} is by dividing by m+1 and taking the remainder, i.e. h(k) = k mod (m+1)
The Division Method What are good m’s? In ASCII example: A to Z are to In decimal: 65 to 90. We subtracted 64 from the ASCII and got locations 1 to 26. If we choose m=32 we achieve the same result.
What does it Mean? 32 = 2 5 Dividing by 32 and taking the remainder means taking the 5 LSBs in the binary representation. A is = 1 Z is = 26 Is this good? – Here, yes. In general?
Want Even Distribution Wewant even distribution of the bits. If m=2 b then only the b least significant bits participate in the hashing. We would like all bits to participate. Solution: Use for m a prime number not too close to a power of 2. Always good idea: Check distribution of real application data.
The Multiplication Method Choose: 0 < σ < 1 h(k) = Meaning: 1. multiply key by σ and take fraction. 2. then multiply by m and take floor.
Multiplication Method Example Choose: σ = m=32 (Knuth recommends (√5 -1)/2 ) k=2391 h(k) = 2391 x = = x 32 = So h(2391)=20.
Why does this involve all bits? Assume: k uses w bits, which is a computer word. Word operations take constant time. Assume: m = 2 p. Easy Implementation of h(k):
Implementing h(k) Easy Implementation of h(k): X k h(k) p bits w bits σ2wσ2w Getting rid of this Means dividing by 2 w.
Worst Case Problem: Malicious adversary can make sure all keys hash to same entry. Usual solution: randomize. But then how do we get to correct hash table entry upon search?
Universal Hashing We want to make sure that: Not every time we hash keys k 1, k 2, …, k n They hash to same table entry. How can we make sure of that? Use a number of different hash functions and employ them at random.
Universal Hashing A collection H of hash functions from universe U to {0,…,m} is universal if for every pair of keys x,yєU, where x≠y the number of hash functions hє H for which h(x)=h(y) is |H|/m. This means: If we choose a function hє H at random the chance of collision between x and y where x≠y is 1/m.
Constructing a Universal Collection of Hash Functions Choose m to be prime. Decompose a key k as follows: k = [k 0, k 1, …, k r ] r+1 pieces log m bits value < m
Universal Hashing Construction Choose randomly m r+1 sequences of length r+1 of elements from {0,…,m-1}. Each sequence is a=[a 0, …, a r ], a i є {0,…,m-1}. Each sequence defines a hash function The universal class is: H = U{h a } a
This is a Universal Class Need to show that the probability that x≠y collide is 1/m. Because x≠y then there is some i for which x i ≠y i. Wlog assume i=0. Since m is prime then for any fixed [a 1,…,a r ] there is exactly one value that a 0 can get that satisfies h(x)=h(y). Why?
Proof Continuation… h(x)-h(y)=0 means: But x 0 -y 0 is non-zero. For prime m there is a unique multiplicative inverse modulo m. Thus there is only one value between 0 and m-1 that a 0 can get.
End of Proof Conclude: Each pair x≠y may collide once for each [a 1,…,a r ]. But there are m r possible values for [a 1,…,a r ]. However: Since there are m r+1 different hash functions, the keys x and y collide with probability:
Open Addressing Solution 2 to the collision problem. We don’t want to use chaining. (save space and complexity of pointers.) Where should we put the collision? Inside hash table, in an empty slot. Problem: Which empty slot? -- We assume a probing hashing function.
Idea behind Probing Hashing function of the form: h : U x {0,…,m} {0,…,m} The initial hashing function: h(k,0). h(k,i) for i=1,…,m, gives subsequent values. We try locations h(k,0), h(k,1), etc. until an open slot is found in the hash table. If none found we report an overflow error.
Average Complexity of Probing Assume uniform hashing – i.e. A probe sequence for key k is equally likely to be any permutation of {0,…,m}. What is the expected number of probes with load factor α? Note: since this is open address hashing, α = n/m <1
Probing Complexity – Case I Case I : key not in hash table. Key not in table every probe, except for last, is to an occupied slot. Let p i =Pr{exactly i probes access occupied slot} Then The expected number of probes is
Probing Complexity – Case I Note: For i>n, p i =0. Claim:, Where q i =Pr{at least i probes access occupied slots} Why?
Probing Complexity – Case I q2q2 q3q3 q4q4
What is: q i ? q 1 = n/m there are n elements in table, the probability of having something in the first slot accessed is therefore n/m q 2 = n/m ((n-1)/(m-1))
Probing Complexity – Case I Conclude: The expected complexity of insering an element into the hash table is: Example: If the hash table is half full, the expected number of probes is 2. If it is 90% full, the expected number is 10.
Probing Complexity – Case II Case II : key in hash table. As in the chaining case. Expected search time for key in table = Expected insertion time. We have shown: Expected time to insert key i+1 is:
Probing Complexity – Case II Conclude: Expected key insertion time: Need to compute
Probing Complexity – Case II In General: What is ?
Probing Complexity – Case II Consider any monotonic function f(x)>0. Approximate By the area under the function.
Probing Complexity – Case II Consider any monotonic function f(x)>0. a-1abb+1 f(x)
Probing Complexity – Case II Conclude: Approximate By
Probing Complexity – Case II Conclude: Approximate By
Probing Complexity – Case II Conclude: Expected key insertion time: Example: If the hash table is half full, the expected number of probes is 2. If it is 90% full, the expected number is 2.6.
Types of Probing What Functions can be used for probing? 1.Linear 2.Quadratic 3.Double Hashing
Linear Probing j <- h(k) If T[j] occupied, then repeat until finding an empty slot: j <- j+1 End T[j] <- k Attention: 1.Wrap around when end of table reached. 2.If table full then overflow error.
Linear Probing - Discussion Pro: Easy to implement. Con: Clustering. an empty slot following a cluster of length i has probability (i+1)/m to be filled.
Idea behind Probing Hashing function of the form: h : U x {0,…,m} {0,…,m} The initial hashing function, h’(k), gives First value, but if this one is taken, we try h(k,i) for i=0,…,m. In linear probing we have: h(k,i) = (h’(k)+i) mod (m+1)
For Uniformity: There are m! different paths of length m, we should be able to generate them all. Linear probing: Generates m paths, so clearly not uniform. The more different paths generated, the better.
Quadratic Probing h(k,i) = (h’(k) + c 1 i + c 2 i 2 ) mod m Works better but: 1.Still only m different paths. 2.Note that if h’(x)=h’(y) then h(x,i)=h(y,i) for all i. This causes what is called secondary clustering.
Double Hashing h(k,i)=(h 1 (k)+ih 2 (k))mod m To get permutation: h 2 (k) must be relative prime to m. Possibility: h 1 (k)=k mod m h 2 (k)=1+(k mod m’) Either m power of 2 and m’ odd or m and m’ both prime with distance 2 between them.
Double Hashing Example h(k,i)=(h 1 (k)+ih 2 (k))mod m h 1 (k)=k mod 13 h 2 (k)=1+(k mod 11 ) Probing Seq. of 14: 14 mod 13 = 1 (1+4) mod 13 = 5 (1+8) mod 13 = 9. Probing Seq. of mod 13 = 1 (1+6) mod 13 = 7 (1+12) mod 13 = 0.
Double Hashing Example Advantage: m 2 different sequences generated.
Deletions Problem: If an element is deleted, we may think a key is not in the table! Solution: When an element is deleted, mark it with flag. Meaning: It can cause long searches if many deletions. Therefore: In very dynamic setting use chaining.