Data Structures & Algorithms Hash Tables

1 Data Structures & Algorithms Hash Tables
Dictionary get(theKey) find if an item with theKey is in dictionary
put(theKey, theElement) add an item to the dictionary remove(theKey) delete (an item) with theKey from dictionary

Unsorted Array get(theKey) O(N) time put(theKey, theElement)
c e d b get(theKey) O(N) time put(theKey, theElement) O(N) time to find duplicate, O(1) to add remove(theKey) O(N) time.

Sorted Array get(theKey) O(lg N) time put(theKey, theElement)
b c d e get(theKey) O(lg N) time put(theKey, theElement) O(lg N) time to find duplicate, O(N) to add remove(theKey) O(N) time.

Unsorted Chain get(theKey) O(N) time put(theKey, theElement)
b null firstNode get(theKey) O(N) time put(theKey, theElement) O(N) time to verify duplicate, O(1) to add remove(theKey) O(N) time.

Sorted Chain get(theKey) O(N) time put(theKey, theElement)
b c d e null firstNode get(theKey) O(N) time put(theKey, theElement) O(N) time to verify duplicate, O(1) to add remove(theKey) O(N) time.

Costs of Insertion and Search
Worst Case Avg Case Insert Search Select Search Hit Search Miss Key-Indexed Array 1 M Ordered Array N N/2 Ordered Linked List Unordered Array N lg N Unordered Linked List N = number of items, M = size of container

Costs of Insertion and Search
Worst Case Avg Case Insert Search Select Search Hit Search Miss Binary Search N lg N 1 N/2 Binary Search Tree Red-Black Tree Randomized Tree N* Hashing N lg N N = number of items, M = size of container

9 Binary Search Trees get(theKey)
O(N) time worst case – O(lg N) time average put(theKey, theElement) O(N) (wc) – O(lg N) time (avg), O(1) to add remove(theKey) O(N) time worst case – O(lg N) time average. 9 9 9 9

Other BSTs Splay trees –

11 Other BSTs Splay trees –
Skip Lists Skip list are linked lists… Except with extra links

12 Skip Lists Skip list are linked lists… Except with extra links
Skip Lists A A C E E G H IA L M N P R

13 Skip Lists A A C E E G H IA L M N P R
Skip List Search Algo Sketch: Start at highest level

14 Skip List Search Algo Sketch: Start at highest level
Skip List Insert Algo Sketch: Find where new node should go

15 Skip List Insert Algo Sketch: Find where new node should go
Skip List Time Prop 13.10: Search and Insertion in a randomized skip list with parameter t take about (t logt N)/2 = (t/2lg t))lg N comparisons, on the average. We expect about logt levels, and that about t nodes were skipped on the previous level each link, and we go through about half the links on each level.

Skip List Space Prop 13.11: Skip lists with parameter t have about (t/(t-1))N links on the average. There are N links on the bottom, N/t on the next level, about N/t2 on the next, and so on. The total number of links is about N(1 + 1/t + 1/t2 + … ) = N/(1 – 1/t)

Skip List Tradeoff Picking the parameter t gives a time/space trade-off. When t = 2, skip lists need about lg N comparisons and 2N links on average, like the best BST types. Larger t give longer search and insert times, but uses less space. The choice t = e (base of natural log) minimized the expected number of comparisons (differentiate eq. in 13.10)

Skip List Other Functions

19 Skip List Other Functions
Remove, join, and select functions are straight-forward extensions. 19 19 19 19

Hash Tables Expected time for insert, find, remove is constant, but worst case is O(N) Idea is to squeeze big key space into smaller key space, so all keys fit into fairly small table Challenge is to avoid duplicate keys … … and to deal with them if they occur anyway So here they are ...

Ideal Hash Tables Uses a 1D array (or table) table[0:b-1].

22 Ideal Hash Tables Uses a 1D array (or table) table[0:b-1].
Ideal Hash Tables Pairs are: (22,a), (33,c), (3,d), (73,e), (85,f).

23 Ideal Hash Tables Pairs are: (22,a), (33,c), (3,d), (73,e), (85,f).
What Could Go Wrong Where to put (26,g)?

24 What Could Go Wrong Where to put (26,g)?
What Could Go Wrong A collision occurs when two items with different keys have the same home bucket A bucket may be able to store more than one item... If bucket is full, then we have an overflow If buckets are of size 1, then overflows occur on every collision We must deal with these somehow!

Hash Table Issues What is size of table?

26 Hash Table Issues What is size of table?
Hash Function First – convert to integer if not already

27 Hash Function First – convert to integer if not already
Hash Function Let KeySpace be the set of all possible keys

28 Hash Function Let KeySpace be the set of all possible keys
Hash Function Uniform Hash Functions make collisions (hence overflows) unlikely when keys are randomly chosen For any table size b, if keyspace is 32-bit integers, k%b will uniformly distribute In practice, keys tend to be non-uniform and correlated So want hash function to help break up correlations What effect does modulus b have???

Selecting the Modulus The modulus is the table size

30 Selecting the Modulus The modulus is the table size
Selecting the Modulus If modulus b is odd, Then even keys will map to

31 Selecting the Modulus If modulus b is odd, Then even keys will map to
even and odd buckets, Odd keys will map to odd and even buckets Odd/even bias in keys does NOT lead to bias in buckets! So pick odd b! 32 32 32 32

Table Size Typically want table of size about twice the number of entries Depends on how much space you are willing to "waste" on empty buckets Depends also on how expensive it is to deal with collisions/overflows Also, subject to avoiding bias using b Which also depends on the hash function itself (if it maps pretty randomly, then may not worry about bias)

Collisions and Overflows

34 Collisions and Overflows
Linear Probing If collision, then overflow

35 Linear Probing If collision, then overflow
Linear Probing Example

36 Linear Probing Example
Linear Probing Example

37 Linear Probing Example
Linear Probing Example
modulus = b (number of buckets) = 17. Home bucket = key % 17. Find pairs whose keys are 26,18, 45 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 26→9; empty, hence a miss 18→1: filled, but key is not 18, so try 2 2: filled, but key is not 18, so try 3 3: empty, hence a miss; 45→11,12,13,14,15,16, 0,1,2 – all filled, none 45 – found it!!!!

38 Linear Probing Example
Linear Probing Example
modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 29 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 29→12: filled

39 Linear Probing Example
modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 0 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 0→0: filled, but key is not 0, so try 1 1: filled, and key is 1, so delete But now search 45 would find hole – a miss! Search rest of cluster for replacement 2: key is 45 → 11, “<=” 1, so Move 45 to replace 0 item 40 40 40 40

40 Linear Probing Example
modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 29 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 29→12: filled, but key is not 29, so try 13 13: filled, and key is 29, so delete But now search 11 would find hole – a miss! Search rest of cluster for replacement 14: key is 11, “less than” 12, so Move 11 to replace 29 item 41 41 41 41

41 Linear Probing Example
modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 29 4 8 12 16 34 45 6 23 7 28 12 11 30 33 Can we stop? No – continue to search cluster 15: key is 30→13 so shift left and continue 16: key is 33 →16, what do we do? We can't shift it. Are we done? 42 42 42 42

42 Linear Probing Example
modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 29 4 8 12 16 34 45 6 23 7 28 12 11 30 33 No – continue to search cluster 0: key is 34→0 can't shift; continue 1: key is 0→0, what do we do? Can't shift it past 0, so it stays. Not yet – still non-empty buckets 2: key is 45 → 11, so shift! Are we done? 3: empty – done! 43 43 43 43

43 Linear Probing Example
modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 29 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 29→12: filled, but key is not 29, so try 13 13: filled, and key is 29, so delete But now search 11 would find hole – a miss! Search rest of cluster for replacement 14: key is 11, “less than” 12, so Move 11 to replace 29 item 44 44 44 44

44 Linear Probing Performance Worst case for insert/find/remove:
(N) where N is number of items When does this happen? All items in same bucket! Observations: insertion of key with one hash value can make search time for key with different hash value take much longer time!!! Clustering!!! 45 45 45 45

45 Linear Probing Expected Performance (large N)
Loading density  = #items / #buckets  = 12/17 in example SN = # buckets examined on hit UN = # buckets examined on miss Insertion and removal governed by UN 46 46 46 46

46 Linear Probing Loading density  = #items / #buckets
 = 12/17 in example  < .75 recommended SN ≈ (½) (1 + 1/(1-)) UN ≈ (½) (1 + 1/(1-)2) SN UN 0.5 1.5 2.5 0.75 8.5 0.9 5.5 50.5 47 47 47 47

47 Linear Probing Design Suppose you want at most 10 compares on a hit,
And at most 13 compares on a miss What is the most your load density should be? SN ≈ (½) (1 + 1/(1-)) <= 10 UN ≈ (½) (1 + 1/(1-)2) <= 13 Work it out. Left half do hits, right do misses. 48 48 48 48

48 Linear Probing Design SN ≈ (½) (1 + 1/(1-)) <= 10 1/(1-) <= 19
1/19 <= (1-)  <= 18/19 UN ≈ (½) (1 + 1/(1-)2) <= 13 1/(1-)2 <= 25 1/(1-) <= 5  <= 4/5 Take smaller of two, so  <= 4/5 49 49 49 49

49 Linear Probing Design Suppose you want at most 10 compares on a hit,
And at most 13 compares on a miss Your load density should be <= 4/5 So if you know there will be at most entries, design table of size..... b = 1000 * 5/4 b = 1250, ... but maybe better choice... Might pick 1259 as the smallest b >= that has no prime factors < 20 50 50 50 50

50 Linear Probing Design Suppose you want at most 10 compares on a hit,
And at most 13 compares on a miss Your load density should be <= 4/5 If you don't know how many entries there will be – then what? Start out with some “reasonable” size, And “double” table if load > 4/5 Easy to monitor load.... 51 51 51 51

51 Linear Probing Design Doubling table size...
But when we increase table size, we also change b, which changes the hash function! Must re-enter all items in hash table! When do we shrink hash table? Certainly not before load < (4/5)/2 = 0.4 Hysteresis => when load < 0.2 52 52 52 52

52 Non-Linear Probing Linear probing ends up making big clusters – increases time for everything! Remember – clusters are adjacent non- empty hash table entries If home bucket is in a cluster, you add to it – making big clusters even bigger! Increases time for non-synonyms!!! Other strategies: Quadratic probing Random probing 53 53 53 53

53 Non-Linear Probing Quadratic probing: instead of proceeding to the very next slot, H, H+1, H+2, H+3, ... Square the retry number to get the distance from the home bucket: H, H+12, H+22, H+32, ... This way you quickly escape the cluster in which the home bucket lives Still have some locality Used by Berkeley Fast File system 54 54 54 54

54 Non-Linear Probing Random probing: instead of proceeding to the very next slot, H, H+1, H+2, H+3, ... Pick a random distance from H H, H+R(1), H+R(2), H+R(3), ... All are modulo b Where does R(.) come from? 55 55 55 55

55 Non-Linear Probing Random probing:
Can produce (pseudo-) random permutation of all non-zero bucket indices R(.), setting R(0)=0. This way you quickly escape the cluster in which the home bucket lives But – what about collisions? All synonyms will follow same sequence Also - Poor locality 56 56 56 56

56 Non-Linear Probing Double Hashing: instead of proceeding to the very next slot, H, H+1, H+2, H+3, ... Pick a random probe stride R: H, H+R, H+2R, H+3R, ... This way you quickly escape the cluster in which the home bucket lives Keys with different home buckets are not likely to add to the same “cluster” now 57 57 57 57

57 Double Hashing But what could go wrong?
What if R is not relatively prime to the table size (hence the modulus b)? Then we don't consider all of the slots in the table! BAD So make sure R is relatively prime to b But where does R come from? Can use second hash function... ... subject to constraints above 58 58 58 58

58 Double Hashing Loading density  = #items / #buckets
SN ≈ (1/) ln(1/(1-)) UN ≈ 1/(1-) (analysis complex) SN UN 0.5 1.4 2 0.75 1.8 4 0.9 2.6 10 59 59 59 59

59 Hash Chains Alternative to probing
Use hash table entries to point to linked lists, with all the synonyms with that home bucket 60 60 60 60

60 Hash Chains Alternative to probing
Use hash table entries to point to linked lists, with all the synonyms with that home bucket Advantage: never, ever increase time of non- synonym by insertion Disadvantage: more complex structure, more space modulus = b (number of buckets) = 17. Home bucket = key % 17. Put in pairs whose keys are 6, 12, 34, 29, 28, 11, 23, 7, 0, 33, 30, 45 4 8 12 16 45 11 30 34 6 7 12 11 33 28 30 23 29 45 61 61 61 61

61 Hash Chains Advantage:
never, ever increase time of non- synonym by insertion Disadvantage: more complex structure, more space 62 62 62 62

