Download presentation
Presentation is loading. Please wait.
Published byJune Anderson Modified over 8 years ago
1
1 Joe Meehean 1
2
BST easy to implement average-case times O(LogN) worst-case times O(N) AVL Trees harder to implement worst case times O(LogN) Can we do better in the average-case? 2
3
“Dictionary” ADT average-case time O(1) for lookup, insert, and delete Idea stores keys (and associated values) in an array compute each key’s array index as a function of its value take advantage of array’s fast random access Alternative implementation for sets and maps 3
4
Goal Store info about a company’s 50 employees Each employee has a unique employee ID in range 100-200 Approach use an array of size 101 (range of IDs) store employee E’s info in array[E-100] Result insert, lookup, delete each O(1) Wasted space, 51 locations 4
5
Less functionality than trees Hash tables cannot efficiently find min find max print entire table in sorted order Must be very careful how we use them 5
6
Hashtable the underlying array Hash function function that converts a key to an index in example: hash(x) = x – 100 TableSize size of underlying array or vector Bucket single cell of a hash table array Collision when two keys hash to the same bucket 6
7
Keys we are using have a hash function or we can define good hash functions for them Keys overload the following operators == != 7
8
How do we make a good hash function? What should we do about collisions? How large should we make our hash table? 8
9
Hash function should be fast Keys should be evenly distributed different keys should have different hash values Should reduce space needed e.g., student IDs are 10 digits do not need an array size of 10,000,000,000 there are only ~3,000 students 9
10
Convert key to an int n scramble up the data ensure the data spreads over the entire integer space Return n % TableSize ensures that n doesn’t fall off the end of the table 10
11
Method 1 convert each char to an int sum them return sum % TableSize Advantages simple time is O(key length) 11
12
Method 1 convert each char to an int sum them return sum % TableSize Problems short keys may not reach end of table sum of characters < TableSize (by a lot) maps all permutations to same hash hash(“able”) = hash(“bale”) Time is O(key length) 12
13
Method 2 Multiply individual chars by different values Then sum a[0] * 37 n + a[1] * 37 n-1 + … + a[n-1] * 37 a[i] * 37 n-i Advantages produces big range of values permutations hash to different values 13
14
Method 2 Multiply individual chars by different values Then sum Disadvantages relies on integer overflow need to worry about negative hashes Handling negative hash hash = hash % TableSize if(hash < 0) hash += TableSize 14
15
Fast hash vs. evenly distributed hash often faster leads to less evenly distributed even distribution leads to slower String example could use only some of the characters faster, but more collisions likely 15
16
How do we make a good hash function? What should we do about collisions? How large should we make our hash table? 16
17
What if two keys hash to the same bucket (array entry)? Array entries are linked lists (or trees) different keys with same hash value stored in same list (or tree) commonly called chained bucket hashing, or just chaining 17
18
TableSize = 10 keys: 10 digit student IDs hashfn = sum of digits % TableSize ID (Key)ValueSumHash Code 9014638161A399 9103287648B488 4757414352C422 8377690440D488 9031397831E444 18
19
ID (Key)ValueSumHash Code 9014638161A399 9103287648B488 4757414352C422 8377690440D488 9031397831E444 19 1 2 3 40 5 6 7 8 9 C C E E B B D D A A
20
During a lookup How can we tell which value we want if there are > 1 entries in the bucket? Compare the keys buckets store keys and values 20
21
How do we make a good hash function? What should we do about collisions? How large should we make our hash table? 21
22
22
23
Related to hashing function Some hashing functions lead to data clustered together Using a prime TableSize helps resolve this issue hashing function not like to share factor with table size 23
24
If number of keys known in advance make the hash table a little larger prime near 1.25 * the number of keys a little room to avoid collisions trades space for potentially faster lookup If number of keys not known in advance plan to expand array as needed coming up in another lecture 24
25
Lookup Key k 1. compute h = hash(k) 2. see if k is in the list in hashtable[h] Insert Key k 1. Compute h = hash(k) 2. Make sure k is not already in hashtable[h] 3. Add k to the list in hashtable[h] Delete Key k 1. Compute h = hash(k) 2. Remove k from list in hashtable[h] 25
26
26 template class HashSet{ private: vector > table; int currentSize; Hash hashfn; public: … bool contains(const K&) const; void insert(const K&); void remove(const K&); };
27
27
28
Recall chaining hash tables array cells stored linked lists 2 keys with same hash end up in same list Chaining hash tables require 2 data structures hash table and linked list Can we solve collisions with more hashing? use just one data structure 28
29
No linked list in array cells Collisions handled using alternative hash try cells h 0 (x), h 1 (x), h 2 (x),… until an empty cell is found h i (x) = hash(x) + f(i) f(i) is collision resolution strategy Probing looking for alternative hash locations 29
30
30
31
f(i) is a linear function often f(i) = i If a collision occurs, look in the next cell hash(x) + 1 keep looking until an empty cell is found hash(x) + 2, hash(x) + 3, … use modulus to wrap around table Should eventually find an empty cell if the table is not full 31
32
Simple hash: h(x) = x % TableSize 32 1 2 3 40 5 6 7 8 89 9 Insert 89 h0(x)h0(x)
33
Simple hash: h(x) = x % TableSize 33 1 2 3 40 5 6 7 18 8 89 9 Insert 18 h0(x)h0(x)
34
Simple hash: h(x) = x % TableSize 34 1 2 3 40 5 6 7 18 8 89 9 Insert 49 h0(x)h0(x)
35
Simple hash: h(x) = x % TableSize 35 1 2 3 4 49 0 5 6 7 18 8 89 9 Insert 49 h1(x)h1(x)
36
Simple hash: h(x) = x % TableSize 36 1 2 3 4 49 0 5 6 7 18 8 89 9 Insert 58 h0(x)h0(x)
37
Simple hash: h(x) = x % TableSize 37 1 2 3 4 49 0 5 6 7 18 8 89 9 Insert 58 h1(x)h1(x)
38
Simple hash: h(x) = x % TableSize 38 1 2 3 4 49 0 5 6 7 18 8 89 9 Insert 58 h2(x)h2(x)
39
Simple hash: h(x) = x % TableSize 39 58 1 2 3 4 49 0 5 6 7 18 8 89 9 Insert 58 h3(x)h3(x)
40
Advantages no need for list collision resolution function is fast Disadvantages requires more book keeping primary clustering 40
41
What if an entry is deleted and we try to lookup another entry that collided with it? 41 58 1 2 3 4 49 0 5 6 7 18 8 89 9 Delete 89 h0(x)h0(x)
42
What if an entry is deleted an we try to lookup another entry that collided with it? 42 58 1 2 3 4 49 0 5 6 7 18 8 9 Lookup 49 h0(x)h0(x)
43
Need extra information per cell Differentiate between states ACTIVE: cell contains a valid key EMPTY: cell never contained a valid key DELETED: previously contained a valid key All cells start EMPTY Lookup keep looking until you find key or EMPTY cell 43
44
44 58 1 2 3 4 49 0 5 6 7 18 8 89 9 Delete 89 h0(x)h0(x) A E E EA E E E A A
45
45 58 1 2 3 4 49 0 5 6 7 18 8 89 9 Delete 89 h0(x)h0(x) A E E EA E E E A D
46
46 58 1 2 3 4 49 0 5 6 7 18 8 89 9 Lookup 49 h0(x)h0(x) A E E EA E E E A D
47
47 58 1 2 3 4 49 0 5 6 7 18 8 89 9 Lookup 49 h1(x)h1(x) A E E EA E E E A D
48
Should we? 48
49
Inserting into deleted cells Insert find 1 st empty cell to prevent duplicates find 1 st empty or deleted cell to insert doubles run time Special case insert a key, delete, reinsert can insert into deleted cell previously occupied lookup knows item is not in table when finds deleted entry 49
50
50 template class HashSet{ private: vector table; int currentSize;... }; class HashEntry{ public: enum EntryType{ACTIVE, EMPTY, DELETED}; private: K element; EntryType info; friend class HashSet; };
51
51
52
No more bucket lists Use collision resolution strategy h i (x) = hash(x) + f(i) If collision occurs, try the next cell f(i) = i repeat until you find an empty cell Need extra book keeping ACTIVE, EMPTY, DELETED 52
53
What could go wrong? How can we fix it? Professor Meehean, you haven’t told us what “it” is yet. 53
54
Clusters of data requires several attempts to resolve collisions makes cluster even bigger too many 9’s eat up all of 8’s space then the 8’s eat up 7’s space, etc… Inserting keys in space that should be empty results in collisions clusters have overrun the whole chunks of the hash table 54
55
55 58 1 29 2 3 4 49 0 5 6 7 18 8 89 9 Insert 30 h0(x)h0(x)
56
56 58 1 29 2 3 4 49 0 5 6 7 18 8 89 9 Insert 30 h1(x)h1(x)
57
57 58 1 29 2 3 4 49 0 5 6 7 18 8 89 9 Insert 30 h2(x)h2(x)
58
58 1 29 2 30 3 4 49 0 5 6 7 18 8 89 9 Insert 30 h3(x)h3(x)
59
Only gets worse as load factor gets larger As memory use gets more efficient Performance gets worse 59
60
Primary clustering caused by linear nature of linear probing collision end up right next to each other What if we jumped farther away on a collision? f(i) = i 2 If a collision occurs… hash(x) + 1, hash(x) + 4, hash(x) + 9, … 60
61
61
62
h i (x) = h(x) + i 2 62 1 2 3 4 49 0 5 6 7 18 8 89 9 Insert 58 h0(x)h0(x)
63
h 1 (x) = h(x) + 1 63 1 2 3 4 49 0 5 6 7 18 8 89 9 Insert 58 h1(x)h1(x)
64
h 2 (x) = h(x) + 4 64 1 58 2 3 4 49 0 5 6 7 18 8 89 9 Insert 58 h2(x)h2(x)
65
Quadratic probing eliminates primary clustering Keys with the same hash… probe the same alternative cells clusters still exist per bucket just spread out Called secondary clustering Can we beat secondary clustering? 65
66
If the first hashing function causes a collision, try a second hashing function h i (x) = hash(x) + f(i) f(i) = i hash 2 (x) h 0 (x) = hash(x) h 1 (x) = hash(x) + hash 2 (x) h 2 (x) = hash(x) + 2 hash 2 (x) h 3 (x) = hash(x) + 3 hash 2 (x) 66
67
hash 2 (x) must be carefully selected It can never be 0 h 1 (x) = hash(x) + 1 0 h 2 (x) = hash(x) + 2 0 h 1 (x) = h 2 (x) = h 3 (x) = h n (x) It must eventually probe all cells quadratic probed half requires TableSize to be prime 67
68
hash 2 (x) = R – (x % R) where R is a prime smaller than TableSize previous value of TableSize? 68
69
h i (x) = hash(x) + i hash 2 (x) hash 2 (x) = R – (x % R) R = 7 69 1 2 3 40 5 6 7 18 8 89 9 Insert 49 h0(x)h0(x)
70
h 1 (x) = 9+ 1 hash 2 (x) hash 2 (x) = 7 – (49 % 7) = 7 – 0 = 7 h 1 (x) = 16 70 1 2 3 40 5 49 6 7 18 8 89 9 Insert 49 h1(x)h1(x)
71
Why prime TableSize is important 71 1 2 58 3 4 69 0 5 49 6 7 18 8 89 9
72
Why prime TableSize is important h i (x) = (x % TableSize )+ i hash 2 (x)) % TableSize hash 2 (x) = 7 – (x % 7) 72 1 2 58 3 4 69 0 5 49 6 7 18 8 89 9 Insert 23
73
Why prime TableSize is important h i (x) = (3 + i 5) % 10 hash 2 (x) = 7 – (23 % 7) = 7 – 2 = 5 73 1 2 58 3 4 69 0 5 49 6 7 18 8 89 9 Insert 23 h0(x)h0(x)
74
Why prime TableSize is important h i (x) = (3 + i 5) % 10 h 1 (x) = (3 + 1 5) % 10 = 8 74 1 2 58 3 4 69 0 5 49 6 7 18 8 89 9 Insert 23 h1(x)h1(x)
75
Why prime TableSize is important h i (x) = (3 + i 5) % 10 h 2 (x) = (3 + 2 5) % 10 = 3 75 1 2 58 3 4 69 0 5 49 6 7 18 8 89 9 Insert 23 h2(x)h2(x)
76
Why prime TableSize is important h i (x) = (3 + i 5) % 10 h3(x) = (3 + 3 5) % 10 = 8 76 1 2 58 3 4 69 0 5 49 6 7 18 8 89 9 Insert 23 h2(x)h2(x)
77
Why prime TableSize is important h i (x) = (x % TableSize )+ i hash 2 (x)) % TableSize hash 2 (x) = 7 – (23 % 7) = 7 – 2 = 5 h i (x) = (3 + i 5) % 10 5 is a factor of 10 hash function will wrap infinitely, landing on same buckets if TableSize is prime, result of hash 2 (x) can never be factor 77
78
78
79
What to do when hash table gets too full? problem for both chained an probing HTs degrades performance may cause insert failure for quadratic probing 79
80
Create another table 2x the size really, nearest prime 2x table size Scan original table compute new hash for valid entries insert into new table 80
81
Assume quadratic probing hash(x) = x % TableSize 81 1 2 58 3 49 40 Insert 23 A E A E E h0(x)h0(x)
82
82 1 2 58 3 49 40 Insert 23 A E A E E h1(x)h1(x) Assume quadratic probing hash(x) = x % TableSize
83
83 1 2 58 3 49 40 Insert 23 A E A E E h2(x)h2(x) Assume quadratic probing hash(x) = x % TableSize
84
84 1 23 2 58 3 49 40 Insert 23 A A A E E h2(x)h2(x) Assume quadratic probing hash(x) = x % TableSize
85
85 1 23 2 58 3 49 40 A A A E E
86
86 1 23 2 58 3 49 40 A A A E E 1 2 3 40 6 7 8 95 10 E E E E E E E E E E E i
87
87 1 23 2 58 3 49 40 A A A E E 1 2 3 40 6 7 8 95 10 E E E E E E E E E E E i
88
88 1 23 2 58 3 49 40 A A A E E 23 1 2 3 40 6 7 8 95 10 E A E E E E E E E E E i
89
89 1 23 2 58 3 49 40 A A A E E 23 1 2 58 3 40 6 7 8 95 10 E A E A E E E E E E E i
90
90 1 23 2 58 3 49 40 A A A E E 23 1 2 58 3 40 6 7 8 9 49 5 10 E A E A E A E E E E E i
91
O(N) Initialization or offline (batch) cost is amortized at least N/2 inserts between rehash Interactive can cause periodic unresponsiveness program is snappy for N/2 – 1 operations N/2th causes rehash 91
92
92
93
93
94
94
95
How do we use C++ hash_maps and hash_sets? When should we use a map… backed by a hash table backed by a tree (e.g., BST, B+) When should we use a set backed by a hash table backed by a tree (e.g., BST, B+) 95
96
unordered_map and unordered_set alternative implementation of map and set use a hash table requires a hash unary functor requires an equals predicate functor C++11 only 96
97
Lookup Key k 1. compute h = hash(k) 2. see if k is in the list in hashtable[h] Time for lookup Time for step 1 + step 2 Worst-case for step 2 All keys hash to same index O(# keys in table) = O(N) 97
98
If hash function distributes keys uniformly probability that hash(k) = h is 1 /TableSize for all h in range 0 to TableSize Then probability of a collision = N/TableSize if N ≤ TableSize, then p(collision) ≤ 1 98
99
Loophole compacts to… If hash function distributes keys uniformly AND, subset of keys distributes uniformly AND, # of keys ≤ TableSize AND, hash function is O(1) Then, average time for lookup is O(1) 99
100
Insert Key k compute h = hash(k) put k in table at or near table[h] Complexity hash function: should be O(1) collision resolution: O(N) chained: must check all keys in list probing: probe may hit every other filled cell Worst case: O(N) Loophole average case: O(1) 100
101
Delete Key k compute h = hash(k) remove k from at or near table[h] Complexity same as lookup and insert O(N) in the worst case O(1) in the loophole average case 101
102
102
103
Loophole limited collisions O(1) average complexity for lookup, insert, and delete Worst case times insert: O(N) even with loophole rehash makes this possible lookup, delete: O(N) 103
104
Alternative implementation for sets and maps, but… Balanced tree, all operations are: O(LogN) safe middle of the road performance Gamble on hash implementations potential O(1) operations potential O(N) operations Some operations are not efficient print in sorted order find largest/smallest 104
105
Must be positive there will be a small # of hash key collisions not just small probability an actual worst-case small # of collisions 1. All keys are known in advance and hashing doesn’t cause a large # of collisions 2. The map/set will always store all keys no collisions due to modulus no key similarities due to select sample 105
106
106
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.