Hash Tables Buckets/Chaining CS 261 – Data Structures Hash Tables Buckets/Chaining
Hash Tables Hash tables are similar to Vectors except… Elements can be indexed by values other than integers A single position may hold more than one element Arbitrary values (hash keys) map to integers by means of a hash function Computing a hash function is usually a two-step process: Transform the value (or key) to an integer Map that integer to a valid hash table index Example: storing names Compute an integer from a name Map the integer to an index in a table (i.e., a vector, array, etc.)
Hash Tables Say we’re storing names: Angie Joe Abigail Linda Mark Max Robert John 0 Angie, Robert Hash Function 1 Linda 2 Joe, Max, John 3 4 Abigail, Mark
Hash Tables: Resolving Collisions There are two general approaches to resolving collisions: Open address hashing: if a spot is full, probe for next empty spot Chaining (or buckets): keep a collection at each table entry Caching: save most recently accessed values, slower search otherwise Today we will examine Chaining/Buckets
Resolving Collisions: Chaining / Buckets Maintain a collection (i.e., a Bag ADT) at each table entry: Chaining/buckets: maintain a linked list (or other collection type data structure, such as an AVL tree) at each table entry Overflow: Keep a separate overflow area that is linked (maintains a link to the next free position) 0 Angie Robert 0 Angie 1 Linda 1 Linda 2 Joe Max John 2 Joe Max 3 3 Robert 4 Abigail Mark 4 Abigail John Mark Next free
Hash Table Implementation: Initialization struct HashTable { struct List **table; /* Hash table Array of Lists. */ int cnt; int size; } void initHashTable(struct HashTable *ht, int size) { int i; ht->size = size; ht->cnt = 0; ht->table = (struct List **) malloc(size * sizeof(struct List *)); assert(ht->table != 0); for(i = 0; i < size; i++) ht->table[i] = newList();
Hash Table Implementation: Add void addHashTable(struct HashTable *ht, TYPE val) { /* Compute hash table bucket index. */ int idx = HASH(val) % ht->size; if (idx < 0) idx += ht->size; /* Add to bucket. */ addList(ht->table[idx], val); ht->cnt++; /* Next step: Reorganize if load factor to large. */ }
Hash Table: Contains & Remove Both just use linked list functions on the correct bucket Contains: find correct bucket, then see if element is there Remove: slightly more tricky, because you only want to decrement the count only if element is actually in list Alternatives: instead of keeping count in hash table, can call count on each list. What are pro/con for this?
Hash Table Size Load factor: l = n / m Load factor represents average number of elements at each table entry For chaining, load factor can be greater than 1 Want the load factor to remain small Same as open table hashing: if load factor becomes larger than some fixed limit (say, 8) double table size # of elements Load factor Size of table
Hash Tables: Algorithmic Complexity Assumptions: Time to compute hash function is constant Chaining uses a linked list Worst case analysis All values hash to same position Best case analysis Hash function uniformly distributes the values (all buckets have the same number of objects in them) Find element operation: Worst case for open addressing O( ) Worst case for chaining O( ) Best case for open addressing O( ) Best case for chaining O( ) n n O(log n) if use AVL tree 1 1
Hash Tables: Average Case Assume hash function distributes elements uniformly (a BIG if) Average case for all operations: O() Want to keep the load factor relatively small Resize table (doubling its size) if load factor is larger than some fixed limit (e.g., 8) Only improves things IF hash function distributes values uniformly What happens if hash value is always zero?
When should you use hash tables? Data values must have good hash functions defined (e.g., string, double) Or write your own hash function Need to know that values are uniformly distributed Otherwise, a skip list or AVL tree is often faster
Your Turn Worksheet 25: Hash Tables using Buckets Questions?? Use linked list for buckets Keep track of number of elements Resize table if load factor is bigger than 8 Questions??
Hash Tables Hash-like Sorting CS 261 – Data Structures Hash Tables Hash-like Sorting
Hash Tables: Sorting Can create very fast sort programs using hash tables Unfortunately, these sorts are not general purpose: Only work with positive integer values (or other data that is readily mapped into positive integer values) Examples: Counting sort Radix sort
Hash Table Sorting: Counting Sort Quickly sort positive integer values from a limited range Count (tally) the occurrences of each value Recreate sorted values according to tally Example: Sort 1,000 integer elements with values between 0 and 19 Count (tally) the occurrences of each value: 0 - 47 4 - 32 8 - 41 12 - 43 16 - 12 1 - 92 5 - 114 9 - 3 13 - 17 17 - 15 2 - 12 6 - 16 10 - 36 14 - 132 18 - 63 3 - 14 7 - 37 11 - 92 15 - 93 19 - 89 Recreate sorted values according to tally: 47 zeros, 92 ones, 12 twos, …
Counting Sort: Implementation /* Sort an array of integers, each element no larger than max. */ void countSort(int data[], int n, int max) { int i, j, k; /* Array of all possible values. */ int *cnt = (int *)calloc(max + 1, sizeof(int)); for (i = 0; i < n; i++) /* Count the occurrences */ cnt[data[i]]++; /* of each value. */ /* Count holds the number of occurrences of numbers from 0 to max. */ i = 0; /* Now put values */ for (j = 0; j <= max; j++) /* back into the array. */ for (k = cnt[j]; k > 0; k--) data[i++] = j; }
Radix Sort Another specialized sorting algorithm Has historical ties to punch cards
Sorting Punch Cards It was far to easy to drop a tray of cards, which could be a disaster Convention became to put a sequence number on card, typically in positions 72-80 Could then be rebuilt by sorting on these positions A machine called a sorter used to resort the cards
Mechanical Sorter: Sorts a Single Column
Mechanical Sorter First sort on column 80 Then collect piles, keeping them in order, and sort on column 79 Repeat for each of the columns down to 72 At the end, the result is completely sorted Try it
Hash Table Sorting: Radix Sort Sorts positive integer values over any range Hash table size of 10 (0 through 9) Values are hashed according to their least significant digit (the “ones” digit) Values then rehashed according to the next significant digit (the tens digit) while keeping their relative ordering Process is repeated until we run out of digits Can also sort by hashing on: Characters in a String table size of 26 (‘A’ through ‘Z’) Bytes in an integer table size of 256 (as opposed to 10 above)
Radix Sort: Example Data: 624 762 852 426 197 987 269 146 415 301 730 78 593 Bucket Pass1 Pass2 Pass3 _ 0 730 301 78 1 301 415 146 - 197 2 762 - 852 624 - 426 269 3 593 730 301 4 624 146 415 - 426 5 415 852 593 6 426 - 146 762 - 269 624 7 197 - 987 78 730 - 762 8 78 987 852 9 269 593 - 197 987
Your Turn Worksheet 26: Radix Sorting Questions