Data Structures & Algorithms Hash Tables

Data Structures & Algorithms Hash Tables
Richard Newman based on slides by S. Sahni and book by R. Sedgewick 1 1 1 1

Dictionary get(theKey) find if an item with theKey is in dictionary
put(theKey, theElement) add an item to the dictionary remove(theKey) delete (an item) with theKey from dictionary 2 2 2 2

Unsorted Array get(theKey) O(N) time put(theKey, theElement)
c e d b get(theKey) O(N) time put(theKey, theElement) O(N) time to find duplicate, O(1) to add remove(theKey) O(N) time. 3 3 3 3

Sorted Array get(theKey) O(lg N) time put(theKey, theElement)
b c d e get(theKey) O(lg N) time put(theKey, theElement) O(lg N) time to find duplicate, O(N) to add remove(theKey) O(N) time. 4 4 4 4

Unsorted Chain get(theKey) O(N) time put(theKey, theElement)
b null firstNode get(theKey) O(N) time put(theKey, theElement) O(N) time to verify duplicate, O(1) to add remove(theKey) O(N) time. 5 5 5 5

Sorted Chain get(theKey) O(N) time put(theKey, theElement)
b c d e null firstNode get(theKey) O(N) time put(theKey, theElement) O(N) time to verify duplicate, O(1) to add remove(theKey) O(N) time. 6 6 6 6

Costs of Insertion and Search
Worst Case Avg Case Insert Search Select Search Hit Search Miss Key-Indexed Array 1 M Ordered Array N N/2 Ordered Linked List Unordered Array N lg N Unordered Linked List N = number of items, M = size of container 7 7 7 7

Costs of Insertion and Search
Worst Case Avg Case Insert Search Select Search Hit Search Miss Binary Search N lg N 1 N/2 Binary Search Tree Red-Black Tree Randomized Tree N* Hashing N lg N N = number of items, M = size of container 8 8 8 8

Binary Search Trees get(theKey)
O(N) time worst case – O(lg N) time average put(theKey, theElement) O(N) (wc) – O(lg N) time (avg), O(1) to add remove(theKey) O(N) time worst case – O(lg N) time average. 9 9 9 9

Other BSTs Randomized – recursively make new node root of subtree as search to insert with uniform prob. AVL Tree – rotate subtrees to maintain balance 2-3-4 Tree – adjust number of children so depth is maintained Red-Black Tree – rotate to maintain same number of black edges in all paths to leaves, with no two consecutive reds 10 10 10 10

Other BSTs Splay trees –
good for non-uniform searching, or for searching that displays temporal correlations (i.e., search for something, likely to search for it again soon) ... skip lists – use linked list structure but with extra links to allow tree-like speeds but with simpler structures We will explore them now... 11 11 11 11

Skip Lists Skip list are linked lists… Except with extra links
That allow the ADT to skip over large portions of a list at a time during search Defn 13.5: A skip list is an ordered linked list where each node contains a variable number of links, with the ith links in the nodes implementing singly linked lists that skip nodes with < i links. 12 12 12 12

Skip Lists A A C E E G H IA L M N P R
Linked list – search for N: how many steps? Ans: 11 links followed Skip list with jumps of 3 – search for N: steps? Ans: 6 More links can help – how many steps now? Ans: 4 13 13 13 13

Skip List Search Algo Sketch: Start at highest level
Search linked list at that level If find item – great! If find end of list or larger key Then drop down a level and resume Until there are no more levels 14 14 14 14

Skip List Insert Algo Sketch: Find where new node should go
Build node with j links Connect each previous node in j lists to new node Connect new node to successor (if any) in j lists But what should j be? Want every tj nodes to have at least j+1 links to skip t – use random fcn to decide 15 15 15 15

Skip List Time Prop 13.10: Search and Insertion in a randomized skip list with parameter t take about (t logt N)/2 = (t/2lg t))lg N comparisons, on the average. We expect about logt levels, and that about t nodes were skipped on the previous level each link, and we go through about half the links on each level. 16 16 16 16

Skip List Space Prop 13.11: Skip lists with parameter t have about (t/(t-1))N links on the average. There are N links on the bottom, N/t on the next level, about N/t2 on the next, and so on. The total number of links is about N(1 + 1/t + 1/t2 + … ) = N/(1 – 1/t) 17 17 17 17

Skip List Tradeoff Picking the parameter t gives a time/space trade-off. When t = 2, skip lists need about lg N comparisons and 2N links on average, like the best BST types. Larger t give longer search and insert times, but uses less space. The choice t = e (base of natural log) minimized the expected number of comparisons (differentiate eq. in 13.10) 18 18 18 18

Skip List Other Functions
Remove, join, and select functions are straight-forward extensions. 19 19 19 19

Hash Tables So skip lists can also give good performance, similar to the best tree structures. Can we do better? Key-indexed searching has great performance – O(1) time for almost everything But big constraints: Array size = key range No duplicate keys 21 21 21 21

Hash Tables Expected time for insert, find, remove is constant, but worst case is O(N) Idea is to squeeze big key space into smaller key space, so all keys fit into fairly small table Challenge is to avoid duplicate keys … … and to deal with them if they occur anyway So here they are ... 22 22 22 22

Ideal Hash Tables Uses a 1D array (or table) table[0:b-1].
Each position of this array is a bucket. A bucket can normally hold only one dictionary pair. Uses a hash function f that converts each key k into an index in the range [0, b-1]. f(k) is the home bucket for key k. Every dictionary pair (key, element) is stored in its home bucket table[f[key]]. 23 23 23 23

Ideal Hash Tables Pairs are: (22,a), (33,c), (3,d), (73,e), (85,f).
Hash table is table[0:7], b = 8. Hash function is key/11. Pairs are stored in table as below: 22/11=2, 33/11=3, 3/11=0, 73/11=6, 85/11=7 Everything is fast – constant time! What could possibly go wrong? (3,d) [0] [1] [2] [3] [4] [5] [6] [7] (22,a) (33,c) (73,e) (85,f) 24 24 24 24

What Could Go Wrong Where to put (26,g)?
(22,a) is already in the 2 slot! Keys with the same home bucket are called synonyms. 22 and 26 are synonyms under /11 hash function (3,d) [0] [1] [2] [3] [4] [5] [6] [7] (22,a) (33,c) (73,e) (85,f) (26,g) 25 25 25 25

What Could Go Wrong A collision occurs when two items with different keys have the same home bucket A bucket may be able to store more than one item... If bucket is full, then we have an overflow If buckets are of size 1, then overflows occur on every collision We must deal with these somehow! 26 26 26 26

Hash Table Issues What is size of table?
Want it to be small for efficiency … big enough to reduce collisions What is hash function? Want it fast (so time const is small) … but “random” to avoid collisions How do we deal with overflows? 27 27 27 27

Hash Function First – convert to integer if not already
Char to int, e.g. Repeat and combine for string Use imagination for other objects Next – reduce space of integers to table size Divide by some number (example /11) More often – take modulo table size 28 28 28 28

Hash Function Let KeySpace be the set of all possible keys
Could be unbounded (with distribution) Uniform Hash Function maps keys over all of KeySpace to buckets so that the number of keys per bucket is about the same for all buckets Equivalently: any random key maps to any given bucket with probability 1/b, where b is the number of buckets 29 29 29 29

Hash Function Uniform Hash Functions make collisions (hence overflows) unlikely when keys are randomly chosen For any table size b, if keyspace is 32-bit integers, k%b will uniformly distribute In practice, keys tend to be non-uniform and correlated So want hash function to help break up correlations What effect does modulus b have??? 30 30 30 30

Selecting the Modulus The modulus is the table size
If modulus b is even, Then even keys will always map to even buckets, Odd keys will always map to odd buckets Not good Bias in keys leads to bias in buckets! 31 31 31 31

Selecting the Modulus If modulus b is odd, Then even keys will map to
even and odd buckets, Odd keys will map to odd and even buckets Odd/even bias in keys does NOT lead to bias in buckets! So pick odd b! 32 32 32 32

Selecting the Modulus Similar effects are seen with moduli that are multiples of small primes (3, 5, 7, ...) Effect diminishes as prime size grows Ideally, pick b to be a prime number! Or at least avoid any prime factors less than 20 in b For convenient resizing, may end up with just odd numbers (b → 2b + 1) 33 33 33 33

Table Size Typically want table of size about twice the number of entries Depends on how much space you are willing to “waste” on empty buckets Depends also on how expensive it is to deal with collisions/overflows Also, subject to avoiding bias using b Which also depends on the hash function itself (if it maps pretty randomly, then may not worry about bias) 34 34 34 34

Collisions and Overflows
Can handle collisions by making bucket larger Allow it to hold multiple pairs Array Linked list (hash chain) Or, may allow probing into table on overflow Linear probing Quadratic probing Random probing 35 35 35 35

Linear Probing If collision, then overflow
Walk through table until find empty bucket Place item in bucket To find, must not only look at bucket, but keep walking through table until … Find item, or … Find empty bucket Remove must take care to preserve linear probe search 36 36 36 36

Linear Probing Example
modulus = b (number of buckets) = 17. Home bucket = key % 17. Put in pairs whose keys are 6, 12, 34, 29, 28, 11, 23, 7, 0, 33, 30, 45 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 6→6; 12→12; 34→0; 29→12, then 13; 28→11; 11→11, 12, 13, 14; 23→6, then 7; 7→7, 8; 0→0, 1; 33→16; 30→13,14,15; 45→11,12,13,14,15,16,0,1,2 37 37 37 37

modulus = b (number of buckets) = 17. Home bucket = key % 17. Find pairs whose keys are 26,18, 45 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 26→9; empty, hence a miss 18→1: filled, but key is not 18, so try 2 2: filled, but key is not 18, so try 3 3: empty, hence a miss; 45→11,12,13,14,15,16, 0,1,2 – all filled, none 45 – found it!!!! 38 38 38 38

modulus = b (number of buckets) = 17. Home bucket = key % 17. Find pairs whose keys are 26,18, 45 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 26→9; empty, hence a miss 18→1: filled, but key is not 18, so try 2 2: filled, but key is not 18, so try 3 3: empty, hence a miss; 45→11,12,13,14,15,16, 0,1,2 – all filled, none 45 – found it!!!! 39 39 39 39

modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 0 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 0→0: filled, but key is not 0, so try 1 1: filled, and key is 1, so delete But now search 45 would find hole – a miss! Search rest of cluster for replacement 2: key is 45 → 11, “<=” 1, so Move 45 to replace 0 item 40 40 40 40

modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 29 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 29→12: filled, but key is not 29, so try 13 13: filled, and key is 29, so delete But now search 11 would find hole – a miss! Search rest of cluster for replacement 14: key is 11, “less than” 12, so Move 11 to replace 29 item 41 41 41 41

modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 29 4 8 12 16 34 45 6 23 7 28 12 11 30 33 Can we stop? No – continue to search cluster 15: key is 30→13 so shift left and continue 16: key is 33 →16, what do we do? We can't shift it. Are we done? 42 42 42 42

modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 29 4 8 12 16 34 45 6 23 7 28 12 11 30 33 No – continue to search cluster 0: key is 34→0 can't shift; continue 1: key is 0→0, what do we do? Can't shift it past 0, so it stays. Not yet – still non-empty buckets 2: key is 45 → 11, so shift! Are we done? 3: empty – done! 43 43 43 43

modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 29 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 29→12: filled, but key is not 29, so try 13 13: filled, and key is 29, so delete But now search 11 would find hole – a miss! Search rest of cluster for replacement 14: key is 11, “less than” 12, so Move 11 to replace 29 item 44 44 44 44

Linear Probing Performance Worst case for insert/find/remove:
(N) where N is number of items When does this happen? All items in same bucket! Observations: insertion of key with one hash value can make search time for key with different hash value take much longer time!!! Clustering!!! 45 45 45 45

Linear Probing Expected Performance (large N)
Loading density  = #items / #buckets  = 12/17 in example SN = # buckets examined on hit UN = # buckets examined on miss Insertion and removal governed by UN 46 46 46 46

Linear Probing Loading density  = #items / #buckets
 = 12/17 in example  < .75 recommended SN ≈ (½) (1 + 1/(1-)) UN ≈ (½) (1 + 1/(1-)2)  SN UN 0.5 1.5 2.5 0.75 8.5 0.9 5.5 50.5 47 47 47 47

Linear Probing Design Suppose you want at most 10 compares on a hit,
And at most 13 compares on a miss What is the most your load density should be? SN ≈ (½) (1 + 1/(1-)) <= 10 UN ≈ (½) (1 + 1/(1-)2) <= 13 Work it out. Left half do hits, right do misses. 48 48 48 48

Linear Probing Design SN ≈ (½) (1 + 1/(1-)) <= 10 1/(1-) <= 19
1/19 <= (1-)  <= 18/19 UN ≈ (½) (1 + 1/(1-)2) <= 13 1/(1-)2 <= 25 1/(1-) <= 5  <= 4/5 Take smaller of two, so  <= 4/5 49 49 49 49

And at most 13 compares on a miss Your load density should be <= 4/5 So if you know there will be at most entries, design table of size..... b = 1000 * 5/4 b = 1250, ... but maybe better choice... Might pick 1259 as the smallest b >= that has no prime factors < 20 50 50 50 50

And at most 13 compares on a miss Your load density should be <= 4/5 If you don't know how many entries there will be – then what? Start out with some “reasonable” size, And “double” table if load > 4/5 Easy to monitor load.... 51 51 51 51

Linear Probing Design Doubling table size...
But when we increase table size, we also change b, which changes the hash function! Must re-enter all items in hash table! When do we shrink hash table? Certainly not before load < (4/5)/2 = 0.4 Hysteresis => when load < 0.2 52 52 52 52

Non-Linear Probing Linear probing ends up making big clusters – increases time for everything! Remember – clusters are adjacent non- empty hash table entries If home bucket is in a cluster, you add to it – making big clusters even bigger! Increases time for non-synonyms!!! Other strategies: Quadratic probing Random probing 53 53 53 53

Non-Linear Probing Quadratic probing: instead of proceeding to the very next slot, H, H+1, H+2, H+3, ... Square the retry number to get the distance from the home bucket: H, H+12, H+22, H+32, ... This way you quickly escape the cluster in which the home bucket lives Still have some locality Used by Berkeley Fast File system 54 54 54 54

Non-Linear Probing Random probing: instead of proceeding to the very next slot, H, H+1, H+2, H+3, ... Pick a random distance from H H, H+R(1), H+R(2), H+R(3), ... All are modulo b Where does R(.) come from? 55 55 55 55

Non-Linear Probing Random probing:
Can produce (pseudo-) random permutation of all non-zero bucket indices R(.), setting R(0)=0. This way you quickly escape the cluster in which the home bucket lives But – what about collisions? All synonyms will follow same sequence Also - Poor locality 56 56 56 56

Non-Linear Probing Double Hashing: instead of proceeding to the very next slot, H, H+1, H+2, H+3, ... Pick a random probe stride R: H, H+R, H+2R, H+3R, ... This way you quickly escape the cluster in which the home bucket lives Keys with different home buckets are not likely to add to the same “cluster” now 57 57 57 57

Double Hashing But what could go wrong?
What if R is not relatively prime to the table size (hence the modulus b)? Then we don't consider all of the slots in the table! BAD So make sure R is relatively prime to b But where does R come from? Can use second hash function... ... subject to constraints above 58 58 58 58

Double Hashing Loading density  = #items / #buckets
SN ≈ (1/) ln(1/(1-)) UN ≈ 1/(1-) (analysis complex)  SN UN 0.5 1.4 2 0.75 1.8 4 0.9 2.6 10 59 59 59 59

Hash Chains Alternative to probing
Use hash table entries to point to linked lists, with all the synonyms with that home bucket 60 60 60 60

Hash Chains Alternative to probing
Use hash table entries to point to linked lists, with all the synonyms with that home bucket Advantage: never, ever increase time of non- synonym by insertion Disadvantage: more complex structure, more space modulus = b (number of buckets) = 17. Home bucket = key % 17. Put in pairs whose keys are 6, 12, 34, 29, 28, 11, 23, 7, 0, 33, 30, 45 4 8 12 16 45 11 30 34 6 7 12 11 33 28 30 23 29 45 61 61 61 61

Hash Chains Advantage:
never, ever increase time of non- synonym by insertion Disadvantage: more complex structure, more space 62 62 62 62

Data Structures & Algorithms Hash Tables

Similar presentations

Presentation on theme: "Data Structures & Algorithms Hash Tables"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Structures & Algorithms Hash Tables

Similar presentations

Presentation on theme: "Data Structures & Algorithms Hash Tables"— Presentation transcript:

Similar presentations

About project

Feedback