CSE 373, Copyright S. Tanimoto, 2002 Hashing -
CSE 373, Copyright S. Tanimoto, 2002 Hashing - Motivation Many applications need to store "associations." Rapid retrieval is sometimes more important than storage efficiency. Hashing is flexible family of techniques for implementing associations between keys and values. CSE 373, Copyright S. Tanimoto, 2002 Hashing -
Mathematical Description Suppose we have a function mapping keys to values. Its domain is a set KEYS For example, KEYS = {0, 1, ..., m-1}, or KEYS = the set of all possible ASCII strings. Its range is a set of possible values. For example, the values may themselves be ASCII strings. How can we represent such a function? CSE 373, Copyright S. Tanimoto, 2002 Hashing -
A Dictionary Abstract Data Type A function expressed as a finite set of (key,value) pairs. A: KEYS VALUES { (key1, value1), ..., (keyn, valuen)} Methods: PUT: DICTIONARIES KEYS VALUES DICTIONARIES GET: DICTIONARIES KEYS VALUES GETALLKEYS: DICTIONARIES 2KEYS For any set S, the set of all possible subsets of S is written 2S, and is called the power set of S. CSE 373, Copyright S. Tanimoto, 2002 Hashing -
One Implementation of a Dictionary: An Association List ( (key1, value1), ..., (keyn, valuen) ) (key1,value1) (key2,value2) (keyn,valuen) Worst case time for GET is (n) cell examinations. Expected case time for a successful GET is n/2 cell examinations, also (n). CSE 373, Copyright S. Tanimoto, 2002 Hashing -
Hashing: Practical Implementations of the Dictionary ADT. A hash table is a 1-dimensional array in which each cell stores zero or more associations of a dictionary. The array index for (keyi, valuei) is determined by applying a "hash function" to the key: h(keyi), and then possibly taking additional steps, depending on the particular hashing method and whether there are any "collisions". CSE 373, Copyright S. Tanimoto, 2002 Hashing -
CSE 373, Copyright S. Tanimoto, 2002 Hashing - Hashing with Chains h(keyi) (keyi1,valuei1) (keyi2,valuei2) (keyin,valuein) keyi (keyj1,valuej1) (keyj2,valuej2) Each table entry is the head of a linked list of elements all of which share the same hash value h(keyi). This is sometimes called "open" hashing. CSE 373, Copyright S. Tanimoto, 2002 Hashing -
CSE 373, Copyright S. Tanimoto, 2002 Hashing - Closed Hashing h(keyi) keyi valuei keyi The associations are all stored within the hash table; not on linked lists. CSE 373, Copyright S. Tanimoto, 2002 Hashing -
CSE 373, Copyright S. Tanimoto, 2002 Hashing - Hashing Example Keys: 4-digit numbers. Values: names. Let h(d1d2d3d4) = (d1 + d2 + d3 + d4) mod 10. E.g., h(1978) = 5. Data: (1978, VAX-11/780), (1982, IBM-PC), (1984, Macintosh) 0: 1982 IBM-PC 1: 2: 1984 Macintosh 3: 4: 5: 1978 VAX-11/780 6: 7: 8: 9: CSE 373, Copyright S. Tanimoto, 2002 Hashing -
Hashing Example (continued) Let h(d1d2d3d4) = (d1 + d2 + d3 + d4) mod 10. E.g., h(1978) = 5. Put (1993, Java). h(1993) = 2. Collision! 0: 1982 IBM-PC 1: 2: 1984 Macintosh 3: 4: 5: 1978 VAX-11/780 6: 7: 8: 9: CSE 373, Copyright S. Tanimoto, 2002 Hashing -
Hashing Example (continued) Linear Probing: Try h(1993) + 1 mod 10. If we keep getting collisions, the formula is h(d1d2d3d4) + k mod 10, in the kth attempt. If all 10 positions are full, a new hash table must be created and all the old elements placed in the new table. 0: 1982 IBM-PC 1: 2: 1984 Macintosh 3: 1993 Java 4: 5: 1978 VAX-11/780 6: 7: 8: 9: CSE 373, Copyright S. Tanimoto, 2002 Hashing -
Collision Resolution Methods Linear Probing: (h(key) + ck) mod n Quadratic Probing: (h(key) + ck2) mod n Double hashing: (i h2(key)) mod n Rehashing: Create a larger hash table and try again. (Only perform when the load factor is at least 0.5). Note: n should normally be a prime number to help avoid collisions. The constant c is normally small, and typically is 1. CSE 373, Copyright S. Tanimoto, 2002 Hashing -
Deletion in Closed Hash Tables Simply removing an association can break "chains" formed after collisions and make it difficult to perform GET operations on associations that collided with now-deleted associations. Therefore, it is common to use a "delete bit" to mark an association as deleted. When the table gets too full of associations and deleted associations, rehashing is necessary. The rehashing may use a table of the same size as or smaller than before (if many of the entries are deleted entries), or it may use a larger table. CSE 373, Copyright S. Tanimoto, 2002 Hashing -