CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better if we could use the key to index directly into the array. Let’s consider some fairly simple examples: Example: Suppose we are running a club and want to keep data (such as telephone number, address, etc.) for each member. If membership numbers are in the range 1..N, we can store data for member j in slot j-1 of an array. Hence the membership number ( the key ) can be used to index into the array and access the member’s data ( the value ) in O(1) time.
CPSC 252 Hashing Page 2 Example: Suppose we want to count the frequency with which characters appear in a file. We can use the character’s ASCII code ( the key ) to index into an array of frequencies ( the values ). In this case, our array of frequencies will have dimension 256 – one for each of the different characters. It could be argued that not all files will have occurances of all these different characters so we are potentially wasting memory. This is true but we have to realize that we gain the benefit of accessing frequencies in O(1) time. Suppose we make the call: int freq = getFrequency( ‘A’ ); Since the ASCII code for ‘A’ is 65, the getFrequency function will return the frequency found at array index 65.
CPSC 252 Hashing Page 3 In these first couple of examples, it has been easy to see how a key can be constructed that can be used to index into an array. In general the process is not so straightforward… Example: Suppose we want to keep track of VISA accounts. VISA card numbers are 16 digits long allowing for a total of different numbers. If we were to dimension an array of accounts of size we would have more accounts than there are people on the planet – way more! (Not to mention the difficulty of finding a machine with enough memory…) It is obviously not reasonable to consider using the VISA number to index into an array.
CPSC 252 Hashing Page 4 Example: What if the key is a string? Social insurance numbers in the UK, for example, are alphanumeric. We cannot use a string to index into an array. Hashing solves these problems. We define a function that transforms keys into an array index – such a function is called a hash function.
CPSC 252 Hashing Page 5 Hash Functions When we design a hash function it should: - be easy to compute – note that every time we want to retrieve a value we have to invoke the hash function to map the key value into a corresponding index into the array. We would lose some of the benefits of our O(1) access time if the hash function was very inefficient. - even distribution of key values across the range of array indices – mapping all key values to one particular index makes for a very easy hash function but the key values will all collide. We will discuss collisions a little later on.
CPSC 252 Hashing Page 6 Strategies for building hash functions Truncation: use only part of the key as the hash index. For example, if we want to build a table of student records at UBC, rather than using the entire student number (8 digits long) as the key value, we could use only the first 5 digits. We would therefore require an array of student records dimensioned to have size Note that some students may have the first 5 digits of their student number in common and so collisions could occur (more later).
CPSC 252 Hashing Page 7 Strategies for building hash functions Folding: partition the key into parts and combine the parts using arithmetic operations. For example, we could take a VISA number and partition it into one 6 digit number plus two 5 digit numbers. We could then add these three numbers together to come up with a hash index.
CPSC 252 Hashing Page 8 Modular Arithmetic: use the modulus operator to produce a hash index in a required range. For example, suppose a company having about 150 employees wants to use an employee’s social insurance number as a key value. We might dimension the array of employee data to be 200 (leaving room for expansion). We can then generate a hash index from the social insurance number as follows: hashIndex = SINumber % 200; Again we would expect that this hash function could lead to collisions.
CPSC 252 Hashing Page 9 Note: in general we must be sure to use the most significant parts of a key. Suppose for example that a company assigns employee ID numbers as follows: bpqrcs where b is a branch number, c is a check digit and pqrs are particular to the employee. We want to construct a hash function that extracts digits pqrs from the employee ID: int hash( int key ) { int hashCode; hashCode = ( ( key / 100 ) % 1000 ); hashCode = hashCode * 10 + ( key % 10 ); return hashCode; }
CPSC 252 Hashing Page 10 Non-integer keys and Alphanumeric Keys Suppose we have a table dimensioned to hold 5000 records and keys that consist of strings that are 6 characters long. We can apply numeric operations to the ASCII codes of the characters in the string in order to determine a hash index: int hash( unsigned char* key ) { int hashCode = 0; int index = 0; while( key[ index ] != ‘\0’ ) hashCode += int( key[ index++ ] ); return hashCode % 5000; } This function returns hash codes in the range 0 to 1530 (6 * 255) so only the first 1530 slots in the array will be used!
CPSC 252 Hashing Page 11 The hash function on the previous slide fails to distribute key values uniformly across the available range of indices. We obviously want to increase the range of hash codes that are produced… while( key[ index ] != ‘\0’ ) hashCode = 2 * hashCode + int( key[ index++ ] ); Now, before the % operation, the hashCode is in the range 0 to When we apply the % operation we therefore end up with hash codes in the range 0 to 4999.
CPSC 252 Hashing Page 12 Collision Resolution Many of the hash functions that we have seen will result in more than one key mapped to the same index. This is known as a collision. A collision resolution strategy is an algorithm for specifying what to do when a collision occurs. Linear Probing Suppose we want to insert an item into the table but the slot at the hashed index is already occupied. The linear probing algorithm increases the hashed index linearly until an empty slot is found. If we reach the end of the array, we wrap around to the beginning and continue the search. Example: suppose we have 10 items each having a 2-digit key. We define the hash code as follows: hashCode = key % 10; // use last digit of key
CPSC 252 Hashing Page 13 Let’s consider how an array of dimension 10 will be filled if the following key values are used to insert data into the table in the order presented: Key: 32, 47, 26, 34, 87, 39, 78, 61, 48, 66 Hash: # probes: Note that as the table starts to fill, the number of probes increases. In the last case, we perform O(N) probes and so we appear to have lost the benefits of hashing.
CPSC 252 Hashing Page 14 A major drawback to linear probing is clustering. This occurs when occupied slots cluster together in groups. This condition is known as primary clustering. Once clustering has started to occur, the number of probes needed to find an empty slot starts to increase. It can be determined empirically that: - when a table is 10% full there is no clustering - when a table is 30-50% full, clusters start to form - when a table reaches 70% full, clusters start colliding with each other resulting in secondary clustering. By reducing clustering, we reduce the number of probes necessary to find an empty slot when a collision occurs. The results above suggest that with linear probing we should keep the table no more than about half full.
CPSC 252 Hashing Page 15 We now see that the time efficiency gained from hashing comes at the cost of space efficiency. If we want to create a table of 10,000 employee records, we should dimension the array to have size 20,000 so that the array is always at most half full. There are other drawbacks to linear probing: - it is very expensive to expand the table if it starts to become full. We have to create a larger array and then re- hash and re-insert all of the items into the new table in order to re-distribute them uniformly throughout the table. - when the table reaches about 75% full, the search time degrades to O(N) at which point we might as well revert to a linear search (array or list) which is much simpler to implement.
CPSC 252 Hashing Page 16 Deleting an item from a hash table When linear probing is used to insert items into a hash table, we have to be careful about how we remove items from that table. Suppose that the following keys are used to insert data into a table in the order given: , , , , And that the hash code is generated as: hashCode = key % so we use only the last four digits of the key. We will assume that we can mark each slot in the array as either empty or occupied.
CPSC 252 Hashing Page 17 Partial picture of hash table: Now suppose that we delete entry by marking array slot 4474 as empty. If we again try to search for entry using the strategy above, the search will end at slot 4474 (which is now empty) and we can’t find entry Now suppose that we search for entry – we start at location 4473 and search until we either find the item or we reach an empty slot, whichever happens first
CPSC 252 Hashing Page 18 One way to address this problem is to assign to each location a flag which can have one of the values: empty, occupied or dirty. Initially, every location is marked as empty. When an item is inserted into a location that location is marked as occupied. When an item is deleted, we mark the location as dirty. Now, when we search for an item, we treat all the dirty locations as occupied and hence our search continues until we find a slot that is marked as empty. However, when we want to insert an item, we treat all the dirty locations as empty.
CPSC 252 Hashing Page 19 Partial picture of hash table: On deleting , we mark the slot as dirty. When we search for item , on finding the dirty slot, we keep searching. We stop searching when we find a slot that is marked as empty or we have searched the entire array. If we now try to insert item we insert it the first slot that is marked either empty or dirty