Data Structures & Algorithms Hash Tables

Slides:



Advertisements
Similar presentations
Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
Advertisements

Dictionaries Again Collection of pairs.  (key, element)  Pairs have different keys. Operations.  Search(theKey)  Delete(theKey)  Insert(theKey, theElement)
Skip List & Hashing CSE, POSTECH.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Dictionaries Collection of pairs.  (key, element)  Pairs have different keys. Operations.  get(theKey)  put(theKey, theElement)  remove(theKey) 5/2/20151.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Dictionaries Again Collection of pairs.  (key, element)  Pairs have different keys. Operations.  Get(theKey)  Delete(theKey)  Insert(theKey, theElement)
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Dictionaries Collection of pairs.  (key, element)  Pairs have different keys. Operations.  get(theKey)  put(theKey, theElement)  remove(theKey)
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 4 Search Algorithms.
TECH Computer Science Dynamic Sets and Searching Analysis Technique  Amortized Analysis // average cost of each operation in the worst case Dynamic Sets.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Search  We’ve got all the students here at this university and we want to find information about one of the students.  How do we do it?  Linked List?
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
Advanced Data Structure By Kayman 21 Jan Outline Review of some data structures Array Linked List Sorted Array New stuff 3 of the most important.
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
Dictionaries Collection of pairs.  (key, element)  Pairs have different keys. Operations.  find(theKey)  erase(theKey)  insert(theKey, theElement)
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
CE 221 Data Structures and Algorithms
Data Structures Using C++ 2E
Hash table CSC317 We have elements with key and satellite data
CSE373: Data Structures & Algorithms Lecture 6: Hash Tables
COP Introduction to Database Structures
CSCI 210 Data Structures and Algorithms
CS 332: Algorithms Hash Tables David Luebke /19/2018.
Hashing Alexandra Stefan.
Homework will be announced soon Midterm exam date announced
EEE2108: Programming for Engineers Chapter 8. Hashing
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Week 11 - Friday CS221.
Search by Hashing.
Hash functions Open addressing
Hash tables Hash table: a list of some fixed size, that positions elements according to an algorithm called a hash function … hash function h(element)
Design and Analysis of Algorithms
Advanced Associative Structures
Hash Table.
CMSC 341 Hashing (Continued)
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Hash Tables.
Data Structures and Algorithms
Dictionaries and Their Implementations
Dictionaries Collection of unordered pairs. Operations. (key, element)
Dictionaries Collection of pairs. Operations. (key, element)
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Data Structures and Algorithms
CSCE 3110 Data Structures & Algorithm Analysis
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Data Structures & Algorithms Hash Tables
Hashing Alexandra Stefan.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Dictionaries Collection of pairs. Operations. (key, element)
Advanced Implementation of Tables
Database Design and Programming
Advanced Implementation of Tables
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Overflow Handling An overflow occurs when the home bucket for a new pair (key, element) is full. We may handle overflows by: Search the hash table in some.
EE 312 Software Design and Implementation I
Overflow Handling An overflow occurs when the home bucket for a new pair (key, element) is full. We may handle overflows by: Search the hash table in some.
Overflow Handling An overflow occurs when the home bucket for a new pair (key, element) is full. We may handle overflows by: Search the hash table in some.
Hashing.
EE 312 Software Design and Implementation I
Lecture-Hashing.
CSE 373: Data Structures and Algorithms
Presentation transcript:

Data Structures & Algorithms Hash Tables Richard Newman based on slides by S. Sahni and book by R. Sedgewick 1 1 1 1

Dictionary get(theKey) find if an item with theKey is in dictionary put(theKey, theElement) add an item to the dictionary remove(theKey) delete (an item) with theKey from dictionary 2 2 2 2

Unsorted Array get(theKey) O(N) time put(theKey, theElement) c e d b get(theKey) O(N) time put(theKey, theElement) O(N) time to find duplicate, O(1) to add remove(theKey) O(N) time. 3 3 3 3

Sorted Array get(theKey) O(lg N) time put(theKey, theElement) b c d e get(theKey) O(lg N) time put(theKey, theElement) O(lg N) time to find duplicate, O(N) to add remove(theKey) O(N) time. 4 4 4 4

Unsorted Chain get(theKey) O(N) time put(theKey, theElement) b null firstNode get(theKey) O(N) time put(theKey, theElement) O(N) time to verify duplicate, O(1) to add remove(theKey) O(N) time. 5 5 5 5

Sorted Chain get(theKey) O(N) time put(theKey, theElement) b c d e null firstNode get(theKey) O(N) time put(theKey, theElement) O(N) time to verify duplicate, O(1) to add remove(theKey) O(N) time. 6 6 6 6

Costs of Insertion and Search Worst Case Avg Case Insert Search Select Search Hit Search Miss Key-Indexed Array 1 M Ordered Array N N/2 Ordered Linked List Unordered Array N lg N Unordered Linked List N = number of items, M = size of container 7 7 7 7

Costs of Insertion and Search Worst Case Avg Case Insert Search Select Search Hit Search Miss Binary Search N lg N 1 N/2 Binary Search Tree Red-Black Tree Randomized Tree N* Hashing N lg N N = number of items, M = size of container 8 8 8 8

Binary Search Trees get(theKey) O(N) time worst case – O(lg N) time average put(theKey, theElement) O(N) (wc) – O(lg N) time (avg), O(1) to add remove(theKey) O(N) time worst case – O(lg N) time average. 9 9 9 9

Other BSTs Randomized – recursively make new node root of subtree as search to insert with uniform prob. AVL Tree – rotate subtrees to maintain balance 2-3-4 Tree – adjust number of children so depth is maintained Red-Black Tree – rotate to maintain same number of black edges in all paths to leaves, with no two consecutive reds 10 10 10 10

Other BSTs Splay trees – good for non-uniform searching, or for searching that displays temporal correlations (i.e., search for something, likely to search for it again soon) ... skip lists – use linked list structure but with extra links to allow tree-like speeds but with simpler structures We will explore them now... 11 11 11 11

Skip Lists Skip list are linked lists… Except with extra links That allow the ADT to skip over large portions of a list at a time during search Defn 13.5: A skip list is an ordered linked list where each node contains a variable number of links, with the ith links in the nodes implementing singly linked lists that skip nodes with < i links. 12 12 12 12

Skip Lists A A C E E G H IA L M N P R Linked list – search for N: how many steps? Ans: 11 links followed Skip list with jumps of 3 – search for N: steps? Ans: 6 More links can help – how many steps now? Ans: 4 13 13 13 13

Skip List Search Algo Sketch: Start at highest level Search linked list at that level If find item – great! If find end of list or larger key Then drop down a level and resume Until there are no more levels 14 14 14 14

Skip List Insert Algo Sketch: Find where new node should go Build node with j links Connect each previous node in j lists to new node Connect new node to successor (if any) in j lists But what should j be? Want every tj nodes to have at least j+1 links to skip t – use random fcn to decide 15 15 15 15

Skip List Time Prop 13.10: Search and Insertion in a randomized skip list with parameter t take about (t logt N)/2 = (t/2lg t))lg N comparisons, on the average. We expect about logt levels, and that about t nodes were skipped on the previous level each link, and we go through about half the links on each level. 16 16 16 16

Skip List Space Prop 13.11: Skip lists with parameter t have about (t/(t-1))N links on the average. There are N links on the bottom, N/t on the next level, about N/t2 on the next, and so on. The total number of links is about N(1 + 1/t + 1/t2 + … ) = N/(1 – 1/t) 17 17 17 17

Skip List Tradeoff Picking the parameter t gives a time/space trade-off. When t = 2, skip lists need about lg N comparisons and 2N links on average, like the best BST types. Larger t give longer search and insert times, but uses less space. The choice t = e (base of natural log) minimized the expected number of comparisons (differentiate eq. in 13.10) 18 18 18 18

Skip List Other Functions Remove, join, and select functions are straight-forward extensions. 19 19 19 19

Hash Tables So skip lists can also give good performance, similar to the best tree structures. Can we do better? Key-indexed searching has great performance – O(1) time for almost everything But big constraints: Array size = key range No duplicate keys 21 21 21 21

Hash Tables Expected time for insert, find, remove is constant, but worst case is O(N) Idea is to squeeze big key space into smaller key space, so all keys fit into fairly small table Challenge is to avoid duplicate keys … … and to deal with them if they occur anyway So here they are ... 22 22 22 22

Ideal Hash Tables Uses a 1D array (or table) table[0:b-1]. Each position of this array is a bucket. A bucket can normally hold only one dictionary pair. Uses a hash function f that converts each key k into an index in the range [0, b-1]. f(k) is the home bucket for key k. Every dictionary pair (key, element) is stored in its home bucket table[f[key]]. 23 23 23 23

Ideal Hash Tables Pairs are: (22,a), (33,c), (3,d), (73,e), (85,f). Hash table is table[0:7], b = 8. Hash function is key/11. Pairs are stored in table as below: 22/11=2, 33/11=3, 3/11=0, 73/11=6, 85/11=7 Everything is fast – constant time! What could possibly go wrong? (3,d) [0] [1] [2] [3] [4] [5] [6] [7] (22,a) (33,c) (73,e) (85,f) 24 24 24 24

What Could Go Wrong Where to put (26,g)? (22,a) is already in the 2 slot! Keys with the same home bucket are called synonyms. 22 and 26 are synonyms under /11 hash function (3,d) [0] [1] [2] [3] [4] [5] [6] [7] (22,a) (33,c) (73,e) (85,f) (26,g) 25 25 25 25

What Could Go Wrong A collision occurs when two items with different keys have the same home bucket A bucket may be able to store more than one item... If bucket is full, then we have an overflow If buckets are of size 1, then overflows occur on every collision We must deal with these somehow! 26 26 26 26

Hash Table Issues What is size of table? Want it to be small for efficiency … big enough to reduce collisions What is hash function? Want it fast (so time const is small) … but “random” to avoid collisions How do we deal with overflows? 27 27 27 27

Hash Function First – convert to integer if not already Char to int, e.g. Repeat and combine for string Use imagination for other objects Next – reduce space of integers to table size Divide by some number (example /11) More often – take modulo table size 28 28 28 28

Hash Function Let KeySpace be the set of all possible keys Could be unbounded (with distribution) Uniform Hash Function maps keys over all of KeySpace to buckets so that the number of keys per bucket is about the same for all buckets Equivalently: any random key maps to any given bucket with probability 1/b, where b is the number of buckets 29 29 29 29

Hash Function Uniform Hash Functions make collisions (hence overflows) unlikely when keys are randomly chosen For any table size b, if keyspace is 32-bit integers, k%b will uniformly distribute In practice, keys tend to be non-uniform and correlated So want hash function to help break up correlations What effect does modulus b have??? 30 30 30 30

Selecting the Modulus The modulus is the table size If modulus b is even, Then even keys will always map to even buckets, Odd keys will always map to odd buckets Not good Bias in keys leads to bias in buckets! 31 31 31 31

Selecting the Modulus If modulus b is odd, Then even keys will map to even and odd buckets, Odd keys will map to odd and even buckets Odd/even bias in keys does NOT lead to bias in buckets! So pick odd b! 32 32 32 32

Selecting the Modulus Similar effects are seen with moduli that are multiples of small primes (3, 5, 7, ...) Effect diminishes as prime size grows Ideally, pick b to be a prime number! Or at least avoid any prime factors less than 20 in b For convenient resizing, may end up with just odd numbers (b → 2b + 1) 33 33 33 33

Table Size Typically want table of size about twice the number of entries Depends on how much space you are willing to “waste” on empty buckets Depends also on how expensive it is to deal with collisions/overflows Also, subject to avoiding bias using b Which also depends on the hash function itself (if it maps pretty randomly, then may not worry about bias) 34 34 34 34

Collisions and Overflows Can handle collisions by making bucket larger Allow it to hold multiple pairs Array Linked list (hash chain) Or, may allow probing into table on overflow Linear probing Quadratic probing Random probing 35 35 35 35

Linear Probing If collision, then overflow Walk through table until find empty bucket Place item in bucket To find, must not only look at bucket, but keep walking through table until … Find item, or … Find empty bucket Remove must take care to preserve linear probe search 36 36 36 36

Linear Probing Example modulus = b (number of buckets) = 17. Home bucket = key % 17. Put in pairs whose keys are 6, 12, 34, 29, 28, 11, 23, 7, 0, 33, 30, 45 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 6→6; 12→12; 34→0; 29→12, then 13; 28→11; 11→11, 12, 13, 14; 23→6, then 7; 7→7, 8; 0→0, 1; 33→16; 30→13,14,15; 45→11,12,13,14,15,16,0,1,2 37 37 37 37

Linear Probing Example modulus = b (number of buckets) = 17. Home bucket = key % 17. Find pairs whose keys are 26,18, 45 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 26→9; empty, hence a miss 18→1: filled, but key is not 18, so try 2 2: filled, but key is not 18, so try 3 3: empty, hence a miss; 45→11,12,13,14,15,16, 0,1,2 – all filled, none 45 – found it!!!! 38 38 38 38

Linear Probing Example modulus = b (number of buckets) = 17. Home bucket = key % 17. Find pairs whose keys are 26,18, 45 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 26→9; empty, hence a miss 18→1: filled, but key is not 18, so try 2 2: filled, but key is not 18, so try 3 3: empty, hence a miss; 45→11,12,13,14,15,16, 0,1,2 – all filled, none 45 – found it!!!! 39 39 39 39

Linear Probing Example modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 0 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 0→0: filled, but key is not 0, so try 1 1: filled, and key is 1, so delete But now search 45 would find hole – a miss! Search rest of cluster for replacement 2: key is 45 → 11, “<=” 1, so Move 45 to replace 0 item 40 40 40 40

Linear Probing Example modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 29 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 29→12: filled, but key is not 29, so try 13 13: filled, and key is 29, so delete But now search 11 would find hole – a miss! Search rest of cluster for replacement 14: key is 11, “less than” 12, so Move 11 to replace 29 item 41 41 41 41

Linear Probing Example modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 29 4 8 12 16 34 45 6 23 7 28 12 11 30 33 Can we stop? No – continue to search cluster 15: key is 30→13 so shift left and continue 16: key is 33 →16, what do we do? We can't shift it. Are we done? 42 42 42 42

Linear Probing Example modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 29 4 8 12 16 34 45 6 23 7 28 12 11 30 33 No – continue to search cluster 0: key is 34→0 can't shift; continue 1: key is 0→0, what do we do? Can't shift it past 0, so it stays. Not yet – still non-empty buckets 2: key is 45 → 11, so shift! Are we done? 3: empty – done! 43 43 43 43

Linear Probing Example modulus = b (number of buckets) = 17. Home bucket = key % 17. Remove pair whose key is 29 4 8 12 16 34 45 6 23 7 28 12 29 11 30 33 29→12: filled, but key is not 29, so try 13 13: filled, and key is 29, so delete But now search 11 would find hole – a miss! Search rest of cluster for replacement 14: key is 11, “less than” 12, so Move 11 to replace 29 item 44 44 44 44

Linear Probing Performance Worst case for insert/find/remove: (N) where N is number of items When does this happen? All items in same bucket! Observations: insertion of key with one hash value can make search time for key with different hash value take much longer time!!! Clustering!!! 45 45 45 45

Linear Probing Expected Performance (large N) Loading density  = #items / #buckets  = 12/17 in example SN = # buckets examined on hit UN = # buckets examined on miss Insertion and removal governed by UN 46 46 46 46

Linear Probing Loading density  = #items / #buckets  = 12/17 in example  < .75 recommended SN ≈ (½) (1 + 1/(1-)) UN ≈ (½) (1 + 1/(1-)2)  SN UN 0.5 1.5 2.5 0.75 8.5 0.9 5.5 50.5 47 47 47 47

Linear Probing Design Suppose you want at most 10 compares on a hit, And at most 13 compares on a miss What is the most your load density should be? SN ≈ (½) (1 + 1/(1-)) <= 10 UN ≈ (½) (1 + 1/(1-)2) <= 13 Work it out. Left half do hits, right do misses. 48 48 48 48

Linear Probing Design SN ≈ (½) (1 + 1/(1-)) <= 10 1/(1-) <= 19 1/19 <= (1-)  <= 18/19 UN ≈ (½) (1 + 1/(1-)2) <= 13 1/(1-)2 <= 25 1/(1-) <= 5  <= 4/5 Take smaller of two, so  <= 4/5 49 49 49 49

Linear Probing Design Suppose you want at most 10 compares on a hit, And at most 13 compares on a miss Your load density should be <= 4/5 So if you know there will be at most 1000 entries, design table of size..... b = 1000 * 5/4 b = 1250, ... but maybe better choice... Might pick 1259 as the smallest b >= 1250 that has no prime factors < 20 50 50 50 50

Linear Probing Design Suppose you want at most 10 compares on a hit, And at most 13 compares on a miss Your load density should be <= 4/5 If you don't know how many entries there will be – then what? Start out with some “reasonable” size, And “double” table if load > 4/5 Easy to monitor load.... 51 51 51 51

Linear Probing Design Doubling table size... But when we increase table size, we also change b, which changes the hash function! Must re-enter all items in hash table! When do we shrink hash table? Certainly not before load < (4/5)/2 = 0.4 Hysteresis => when load < 0.2 52 52 52 52

Non-Linear Probing Linear probing ends up making big clusters – increases time for everything! Remember – clusters are adjacent non- empty hash table entries If home bucket is in a cluster, you add to it – making big clusters even bigger! Increases time for non-synonyms!!! Other strategies: Quadratic probing Random probing 53 53 53 53

Non-Linear Probing Quadratic probing: instead of proceeding to the very next slot, H, H+1, H+2, H+3, ... Square the retry number to get the distance from the home bucket: H, H+12, H+22, H+32, ... This way you quickly escape the cluster in which the home bucket lives Still have some locality Used by Berkeley Fast File system 54 54 54 54

Non-Linear Probing Random probing: instead of proceeding to the very next slot, H, H+1, H+2, H+3, ... Pick a random distance from H H, H+R(1), H+R(2), H+R(3), ... All are modulo b Where does R(.) come from? 55 55 55 55

Non-Linear Probing Random probing: Can produce (pseudo-) random permutation of all non-zero bucket indices R(.), setting R(0)=0. This way you quickly escape the cluster in which the home bucket lives But – what about collisions? All synonyms will follow same sequence Also - Poor locality 56 56 56 56

Non-Linear Probing Double Hashing: instead of proceeding to the very next slot, H, H+1, H+2, H+3, ... Pick a random probe stride R: H, H+R, H+2R, H+3R, ... This way you quickly escape the cluster in which the home bucket lives Keys with different home buckets are not likely to add to the same “cluster” now 57 57 57 57

Double Hashing But what could go wrong? What if R is not relatively prime to the table size (hence the modulus b)? Then we don't consider all of the slots in the table! BAD So make sure R is relatively prime to b But where does R come from? Can use second hash function... ... subject to constraints above 58 58 58 58

Double Hashing Loading density  = #items / #buckets SN ≈ (1/) ln(1/(1-)) UN ≈ 1/(1-) (analysis complex)  SN UN 0.5 1.4 2 0.75 1.8 4 0.9 2.6 10 59 59 59 59

Hash Chains Alternative to probing Use hash table entries to point to linked lists, with all the synonyms with that home bucket 60 60 60 60

Hash Chains Alternative to probing Use hash table entries to point to linked lists, with all the synonyms with that home bucket Advantage: never, ever increase time of non- synonym by insertion Disadvantage: more complex structure, more space modulus = b (number of buckets) = 17. Home bucket = key % 17. Put in pairs whose keys are 6, 12, 34, 29, 28, 11, 23, 7, 0, 33, 30, 45 4 8 12 16 45 11 30 34 6 7 12 11 33 28 30 23 29 45 61 61 61 61

Hash Chains Advantage: never, ever increase time of non- synonym by insertion Disadvantage: more complex structure, more space 62 62 62 62