Hash Tables CSIT 402 Data Structures II
Hashing Goal Perform inserts, deletes, and finds in constant average time Topics Hash table, hash function, collisions Collision handling Separate chaining Open addressing: linear probing, quadratic probing, double hashing Rehashing Load factor
Goal Develop a structure that will allow user to insert/delete/find records in constant average time –structure will be a table (relatively small) –table completely contained in memory –implemented by an array –capitalizes on ability to access any element of the array in constant time
Hash Table Applications Symbol table in compilers Accessing tree or graph nodes by name –E.g., city names in Google maps Maintaining a transposition table in games –Remember previous game situations and the move taken (avoid re- computation) Dictionary lookups –Spelling checkers –Natural language understanding (word sense) Heavily used in text processing languages –E.g., Perl, Python, etc.
Hash Function Assume table (array) size is N Function f(x) maps any key x to an int between 0 and N−1 For example, assume that N=15, that key x is a non-negative int between 0 and MAX_INT, and hash function f(x) = x % 15
Hash Function Thus, since f(x) = x % 15, if x = f(x) = Storing the keys in the array is not a problem. Array: _ _ 47 _ _ _ _ _ _ _
Hash Function What happens when you try to insert: x = 65 ? x = 65 f(x) = 5 Array: _ _ 47 _ _ _ _ _ _ _ 65(?)
Hash Function What happens when you try to insert: x = 65 ? x 65 f(x) 5 Array: (?) This is called a collision.
Handling Collisions Separate Chaining Open Addressing –Linear Probing –Quadratic Probing –Double Hashing
Handling Collisions Separate Chaining
Let each array element be the head of a chain. Array: 35 Where would you store: 29, 16, 14, 99, 127 ?
Separate Chaining Let each array element be the head of a chain: Array: Where would you store: 29, 16, 14, 99, 127 ? Note that we use the insertAtHead() method when inserting new keys.
Handling Collisions Linear Probing
Let key x be stored in element f(x)=t of the array Array: (?) What do you do in case of a collision? If the hash table is not full, attempt to store key in array elements (t+1)%N, (t+2)%N, (t+3)%N … until you find an empty slot.
Linear Probing Where do you store 65 ? Array: attempts
Linear Probing If the hash table is not full, attempt to store key in array elements (t+1)%N, (t+2)%N, … Array: Where would you store: 29, 16, 14, 99, 127 ?
Linear Probing If the hash table is not full, attempt to store key in array elements (t+1)%N, (t+2)%N, … Array: Where would you store: 16, 14, 99, 127 ?
Linear Probing If the hash table is not full, attempt to store key in array elements (t+1)%N, (t+2)%N, … Array: attempt Where would you store: 14, 99, 127 ?
Linear Probing If the hash table is not full, attempt to store key in array elements (t+1)%N, (t+2)%N, … Array: attempts Where would you store: 99, 127 ?
Linear Probing If the hash table is not full, attempt to store key in array elements (t+1)%N, (t+2)%N, … Array: attempts Where would you store: 127 ?
Linear Probing Leads to problem of clustering. Elements tend to cluster in dense intervals in the array.
Drawbacks of Linear Probing Works until array is full, but as number of items N approaches TableSize, access time approaches O(N) Very prone to cluster formation (as in our example) –If a key hashes anywhere into a cluster, finding a free cell involves going through the entire cluster – and making it grow! –Primary clustering – clusters grow when keys hash to values close to each other Can have cases where table is empty except for a few clusters –Does not satisfy good hash function criterion of distributing keys uniformly
Handling Collisions Quadratic Probing
Let key x be stored in element f(x)=t of the array Array: (?) What do you do in case of a collision? If the hash table is not full, attempt to store key in array elements (t+1 2 )%N, (t+2 2 )%N, (t+3 2 )%N … until you find an empty slot.
Quadratic Probing Where do you store 65 ? f(65)=t=5 Array: t t+1 t+4 t+9 attempts
Quadratic Probing If the hash table is not full, attempt to store key in array elements (t+1 2 )%N, (t+2 2 )%N … Array: t+1 t attempts Where would you store: 29, 16, 14, 99, 127 ?
Quadratic Probing If the hash table is not full, attempt to store key in array elements (t+1 2 )%N, (t+2 2 )%N … Array: t attempts Where would you store: 16, 14, 99, 127 ?
Quadratic Probing If the hash table is not full, attempt to store key in array elements (t+1 2 )%N, (t+2 2 )%N … Array: t+1 t+4 t attempts Where would you store: 14, 99, 127 ?
Quadratic Probing If the hash table is not full, attempt to store key in array elements (t+1 2 )%N, (t+2 2 )%N … Array: t t+1 t+4 attempts Where would you store: 99, 127 ?
Quadratic Probing If the hash table is not full, attempt to store key in array elements (t+1 2 )%N, (t+2 2 )%N … Array: t attempts Where would you store: 127 ?
Quadratic Probing Tends to distribute keys better Alleviates problem of clustering
Handling Collisions Double Hashing
Let key x be stored in element f(x)=t of the array Array: (?) What do you do in case of a collision? Define a second hash function f 2 (x)=d. Attempt to store key in array elements (t+d)%N, (t+2d)%N, (t+3d)%N … until you find an empty slot.
Double Hashing Collision Resolution Strategy –Apply if hash function produces collision Typical second hash function f 2 (x)=R − ( x % R ) where R is a prime number, R < N
Double Hashing Where do you store 65 ? f(65)=t=5 Collision at 5 means apply double hashing Let f 2 (x)= 11 − (x % 11) f 2 (65)=d=1 Note: R=11, N=15 Attempt to store key in array elements (t+d)%N, (t+2d)%N, (t+3d)%N … Array: t t=(t+2)%15 attempts t=(t+d)%15 =(t+1)%15
Double Hashing If the hash table is not full, attempt to store key in array elements (t+d)%N, (t+2d)%N … f 2 (x)= 11 − (x % 11) f 2 (29)=d=4 But f(29)= 14 => No collision => No f 2 this time Array: t attempt Where would you store: 29, 16, 14, 99, 127 ?
Double Hashing If the hash table is not full, attempt to store key in array elements (t+d)%N, (t+2d)%N … Let f 2 (x)= 11 − (x % 11) f 2 (16)=d=6 But f(16)= 1 => No collision Array: t attempt Where would you store: 16, 14, 99, 127 ?
Double Hashing If the hash table is not full, attempt to store key in array elements (t+d)%N, (t+2d)%N … Let f 2 (x)= 11 − (x % 11) f 2 (14)=d=8 f(14)= 14 => Collision Array: t+16=(14+16)%15 t+8=(14+8)%15 t attempts Where would you store: 14, 99, 127 ? Initially hashes to 14%15=14
Double Hashing If the hash table is not full, attempt to store key in array elements (t+d)%N, (t+2d)%N … Let f 2 (x)= 11 − (x % 11) f 2 (99)=d=11 Array: (t+22)%15 (t+11)%15 t (t+33)%15 attempts Where would you store: 99, 127 ? f(99)= 9 => No collision First application of 2ndary hash function
Double Hashing If the hash table is not full, attempt to store key in array elements (t+d)%N, (t+2d)%N … Let f 2 (x)= 11 − (x % 11) f 2 (127)=d=5 Array: (t+10)%15 t (t+5)%15 attempts Where would you store: 127 ?
Delete Element Collision Ramifications Chaining Linear Probing Quadratic Probing Double Hashing
Chaining/Buckets Deletion from Linked List No Issues Linear Probing Issue: Removal of element within cluster Deletion Issues
[ 0 ][ 1 ][ 2 ][ 3 ][ 4 ][ 5 ] [ 700] Number Number Number Number Number The location where the element was must not be left as an ordinary "empty spot" since that could interfere with searches and deletions (why not insertions?). The location must be marked in some special way so that a search can tell that the spot used to have something in it.
Deletion Issues General Solutions Each slot can be marked as empty, deleted, or occupied Cascade following elements one slot back, according to the collision handling scheme Remove all successor elements and reinsert Second table of deleted items. Commonly used in search engines. Rebuild hash table Move element at end of cluster to fill slot Some of these are not foolproof!
Searching For a Key Elephant in the room If there’s a collision, how do you know when you found your item, since all mapped to the same slot? The key is assumed to be a unique value Keep in mind: The hash function maps sparse data into an array or similar It is applied to the key
Hash Tables in C++ STL STL has hash table containers –hash_set –hash_map
Hash Set in STL #include struct eqstr { bool operator()(const char* s1, const char* s2) const { return strcmp(s1, s2) == 0; } }; void lookup(const hash_set, eqstr>& Set, const char* word) { hash_set, eqstr>::const_iterator it = Set.find(word); cout << word << ": " << (it != Set.end() ? "present" : "not present") << endl; } int main() { hash_set, eqstr> Set; Set.insert("kiwi"); lookup(Set, “kiwi"); } KeyHash fn Key equality test
Hash Map in STL #include struct eqstr { bool operator() (const char* s1, const char* s2) const { return strcmp(s1, s2) == 0; } }; int main() { hash_map, eqstr> months; months["january"] = 31; months["february"] = 28; … months["december"] = 31; cout " << months[“january"] << endl; } KeyDataHash fn Key equality test Internally treated like insert (or overwrite if key already present)
12/26/03Hashing - Lecture 1049 Simple Hashes It's possible to have very simple hash functions if you are certain of your keys For example, –suppose we know that the keys s will be real numbers uniformly distributed over 0 s < 1 –Then a very fast, very good hash function is hash(s) = floor(s·m) where m is the size of the table
12/26/03Hashing - Lecture 1050 Nonnumerical Keys Many hash functions assume that the universe of keys is the natural numbers N={0,1,…} Need to find a function to convert the actual key to a natural number quickly and effectively before or during the hash calculation Generally work with the ASCII character codes when converting strings to numbers
12/26/03Hashing - Lecture 1051 Load Factor of a Hash Table Let N = number of items to be stored Load factor = N/TableSize –TableSize = 101 and N =505, then = 5 –TableSize = 101 and N = 10, then = 0.1 Average length of chained list = and so average time for accessing an item = O(1) + O( ) –Want to be smaller than 1 but close to 1 if good hashing function (i.e. TableSize N) –With chaining hashing continues to work for > 1
Conclusions Best to choose N=prime number Issue of Load Factor = fraction of hash table occupied –should rehash when load factor between 0.5 and 0.7 Rehashing –approximately double the size of hash table (select N=nearest prime) –redefine hash function(s) –rehash keys into new table