1 © Love Ekenberg Hashing Love Ekenberg

2 © Love Ekenberg In General These slides provide an overview of different hashing techniques that are used to store data efficiently. The use of hash tables and hash functions is shown The three hashing techniques treated here are: separate linking, linear probing, and double hashing.

3 © Love Ekenberg Storage The representation of arrays requires too much space. Suppose we want to represent all words in Swedish that are shorter than 10 letters. Because the Swedish alphabet contains 29 letters, we would have to store up to 29 10 + 29 9 + … + 29 words, i.e, all 10-letter words and all 9-letter words, etc. Each and every word would then be placed in an array. There are about 250,000 words in Swedish so only a fraction would be meaningful.

4 © Love Ekenberg Projection 100 milion words can be projected into a number of boxes Problem: There would be some difficulty in addressing the elements. This would become almost as demanding memory-wise as storing the words. Uneven storage may arise in the boxes. Solution: List all the meaningful words in each box. The name of the box then becomes a headinng for the list. Distribute the words as randomly as possible in the boxes in order to achieve even storage.

5 © Love Ekenberg Hash Tables Hash tables attempt to find an appropriate path between memory and efficiency requirements. Headings are set for addressing the boxes. Then each box can be searched sequentially. In the example below, there are M boxes, where M is an appropriate number. The number (heading) for an element x is generated from the element x via a hash function - h(x). Heading: 0 1 2 … h(x)element 1->element 2-> …-> element n … M-1

6 © Love Ekenberg Hash Functions A hash function takes and argument x and generates a value between 0 and M-1, where M is the number of boxes (headings) in the table. The value h(x) is where element x is put. The idea is to combine direct access with searching in a list, but where the list (in the best case) only has 1/M times as many elements as the original set and where the elements are more or less evenly distributed amongst the boxes. Given 100 words evenly sorted into 10 boxes, then 10, (100/10) elements are in each box.

7 © Love Ekenberg Example Let ORD be a function that yields position in the Swedish alphabet. If the elements to be stored are words in Swedish the element x can be a 1 a 2 …a k, where a i is a letter. Then f(x) can be defined as ORD(a 1 ) + ORD(a 2 ) + … + ORD(a k ). Lastly let h(x) be the hash function f(x) MOD M. (x MOD M yields the remainder of x divided by M. For example, 75 MOD 10 = 5 since 75 = 7  10 + 5.)

8 © Love Ekenberg Example (cont.) Store the following string. anyone lived in a pretty how town word = array [1..10] of characters function h(x: word): integer sum := 0 for i := 1 to 10 do sum := sum + ORD(x[i]) h := sum MOD M

9 © Love Ekenberg Example (cont.) WordSumBucket anyone7783 lived6922 in4711 a3850 pretty8083 how5583 town6483 Here the choice of hash function is important. The example displays a certain uneveness because too many elements come under heading 3.

10 © Love Ekenberg Operations on Hash Tables We can now operate on hash tables in various ways. Common such operations are: –inserting elements –deleting elements –checking elements (Look up) The algorithm for performing one of these operations is: 1. Calculate h(x). 2. Use the array of pointers to find the list h(x) of elements. 3. Carry out the operation.

11 © Love Ekenberg Example Proc bucketInsert(x:Word; L:List) Proc inserts element x if L = NIL then NIL is the end of the list new(L) Here a new element is defined in list L L.element := x the element is x := NIL the element after x is not found so the list ends here else if L.element <>x then bucketInsert(x, If the element x differs form the current element in the list then the procedure is called again with the next element in the list. Suppose we want to delete ‘pretty’. Calculate h(pretty) which is 3. The second cell contains pretty and is deleted.

12 © Love Ekenberg Complexity of Operations on Hash Tables Finding the hash number is O(1) Naturally this assumes the hash function is not too complicated Furthermore O(N/M) is required, where N is the number of elements stored. This holds since, for example, insertBucket requires time proportional to the number of elements in the list, which on average is the total number of elements N divided by the number of boxes M.

13 © Love Ekenberg Separate linking The technique described here is called separate linking. Separate linking is a technique which divides a number of elements in boxes within these boxes the elements are sequentially linked to each other. It should now be easy to accept the following theorem. Theorem Separate linking reduces the number of comparisons for a sequential search by a factor of M (on average).

14 © Love Ekenberg Some Observations Let N be the total number of elements and M be the number of headings. If N and M are close then the result is about O(1). If M > N then O(1) still holds if at most one element is sorted under each heading. It is therefore pointless to extend the table. If N is much larger than M, then a larger M can (and should) be chosen and all the elements moved to the new table. This takes time O(N), but this is no longer than the time it takes to insert the elements into the original table O(N*O(1)).

15 © Love Ekenberg Linear Probing If the number of elements in the table can be assessed in advance, then M > N can be chosen and so called ‘open addressing methods’ used. This means that we know that there is room for an element in each box and therefore do not need linked lists. The advantage of this is direct access to the elements, never requiring a search through the linked lists. A suitable technique in this case is linear probing. If a collision occurs then the next box is used If there is free space: insert (or delete, or check) the element and finish Otherwise continue

16 © Love Ekenberg Example Let M = 19. Sort the string ASEARCHINGEXAMPLE using the hash function ORD(x) MOD 19 as below. ASEARCHINGEXAMPLE 1051183891475511316125 Clearly several elements come under the same heading, which should be avoided. When such collisions occur a simple trick is to move the element to the next available space, i.e, test the next box. If there is an element there then test the next box etc. Continue in this way until an empty box is found.

17 © Love Ekenberg Example (cont.) The first collision occurs when trying to place the second A, i.e, upon reaching ASEA. The hash function prescribes sorting it under heading 1. 0123456789101112131415161718 SAE However, heading 1is taken and since there are no elements under heading 2, the A can be put under there. 0123456789101112131415161718 SAAE

18 © Love Ekenberg Example (cont.) Continuing like this will gradually yíeld the following table. 0123456789101112131415161718 SAACEGHINR The next element is a new E. Heading 5 is taken, but heading 6 is free. So E can be put under heading 6. 0123456789101112131415161718 SAACEEGHINR The next element is X. The hashing function ORD(x) MOD 19 projects X onto 5, which is taken so the algorithm tries 6, which is also taken so it then tries 7 which is also taken. Continuing in this way, finally 10 is found to be free and X is placed there. 0123456789101112131415161718 SAACEEGHIXNR

19 © Love Ekenberg Theorem The following holds but need not be proved. Linear probing uses 1/2 + 1/2(1 - N/M)^2 operations in the worst case and 1/2 + 1/2(1 - N/M) on average.

20 © Love Ekenberg Double Hashing As can be seen from the example above, linear probing is inefficient when nearby boxes begin to fill up. This is termed clustering. An alternative is double hashing. Double hashing is used to avoid clustering, and uses a function h 2 (v) to shunt elements along. Instead of moving one step ((h 1 ( x) + 1) MOD M) as in linear probing, h 1 ( x) + h 2 (h 1 (x)) MOD M steps are moved, where h 1 (x) is the first hash function. A good function is h 2 (h 1 (x)) = M - 2 - (h 1 (x)) MOD (M-2) Another is h 2 (h 1 (x)) = 8 - ((h 1 (x)) MOD 8). (See the example below)

21 © Love Ekenberg Example The table below shows the projections of the functions h 1 and h 2. ASEARCHINGEXAMPLE 1051183891475511316125 h 1 75376587213873843 h 2 When a collison occurs the first the square to be examined is that at position x + h 2 (h 1 (x)) MOD 19, where h 2 (h 1 (x)) = 8 - ((h 1 (x)) MOD 8). For example: h 2 (h 1 (A)) = h 2 (1) = 8 - (1 MOD 8) = 8 - 1 = 7. h 2 (h 1 (P)) = h 2 (16) = 8 - (16 MOD 8) = 8.

22 © Love Ekenberg Example (cont.) The first collision occurs upon trying to insert the second A, i.e, upon arriving at ASEA. The hash function prescribes placing it under heading 1. 0123456789101112131415161718 SAE h 2 (h 1 (A)) = h 2 (1) = 8 - (1 MOD 8) = 8 - 1 = 7. 1 + 7 = 8 and since there no elements under heading 8, the new A is put there. 0123456789101112131415161718 SAE A In this way the elements can be spread out using both functions.

23 © Love Ekenberg The Choice of Function Naturally h 2 (x) should be chosen wisely. For example neither M nor h 2 (x) should be divisors of the other. Example: Let M be 10 and h 2 (x) = x MOD 6. Now try to sort the string EEEE. ORD(E) = 5, h 2 (5) = 5. The first E is put under heading 1. Because ORD(E) = 5 och h 2 (5) = 5, the second E comes under heading 5+5 = 10. Since 0 = 10 MOD 10, E comes under heading 0. The third E comes under 0+5 = 5. This heading is taken so E is sent on to heading 5+5 MOD 10 = 0. With this heading also taken, E is sent to 5 again. The algorithm has returned without h 2 able to find a free space, in spite of several available spaces remaining. Similar behaviour occurs for any function that is a divisor of M. 0123456789EE0123456789EEE

24 © Love Ekenberg Theorem The following holds but need not be proved. Double hashing uses 1/(1 - N/M) operations in the worst case and ln(1 - N/M)/(N/M) on average.

