1 Joe Meehean 1.  BST easy to implement average-case times O(LogN) worst-case times O(N)  AVL Trees harder to implement worst case times O(LogN)  Can.

Slides:



Advertisements
Similar presentations
Hash Tables.
Advertisements

Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
Hashing as a Dictionary Implementation
© 2004 Goodrich, Tamassia Hash Tables1  
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Techniques.
Hashing CS 3358 Data Structures.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Hash Tables1 Part E Hash Tables  
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Cpt S 223 – Advanced Data Structures Hashing
L. Grewe. Computing hash function for a string Horner’s rule: (( … (a 0 x + a 1 ) x + a 2 ) x + … + a n-2 )x + a n-1 ) int hash( const string & key )
Hashing. Hashing as a Data Structure Performs operations in O(c) –Insert –Delete –Find Is not suitable for –FindMin –FindMax –Sort or output as sorted.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Hashing 1. Def. Hash Table an array in which items are inserted according to a key value (i.e. the key value is used to determine the index of the item).
COSC 2007 Data Structures II
ICS220 – Data Structures and Algorithms Lecture 10 Dr. Ken Cosh.
Hash Table March COP 3502, UCF.
1 Chapter 5 Hashing General ideas Methods of implementing the hash table Comparison among these methods Applications of hashing Compare hash tables with.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
1.  We’ll discuss the hash table ADT which supports only a subset of the operations allowed by binary search trees.  The implementation of hash tables.
DATA STRUCTURES AND ALGORITHMS Lecture Notes 7 Prepared by İnanç TAHRALI.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
1 Hash table. 2 Objective To learn: Hash function Linear probing Quadratic probing Chained hash table.
Appendix E-A Hashing Modified. Chapter Scope Concept of hashing Hashing functions Collision handling – Open addressing – Buckets – Chaining Deletions.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Hashing Hashing is another method for sorting and searching data.
Hash Tables - Motivation
Hashing - 2 Designing Hash Tables Sections 5.3, 5.4, 5.4, 5.6.
WEEK 1 Hashing CE222 Dr. Senem Kumova Metin
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Hash Tables CSIT 402 Data Structures II. Hashing Goal Perform inserts, deletes, and finds in constant average time Topics Hash table, hash function, collisions.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
Hash Tables. 2 Exercise 2 /* Exercise 1 */ void mystery(int n) { int i, j, k; for (i = 1; i
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
CSE 373 Data Structures and Algorithms Lecture 17: Hashing II.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
H ASH TABLES. H ASHING Key indexed arrays had perfect search performance O(1) But required a dense range of index values Otherwise memory is wasted Hashing.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Copyright © Curt Hill Hashing A quick lookup strategy.
Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.
1 Hashing by Adlane Habed School of Computer Science University of Windsor May 6, 2005.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
1 Resolving Collision Although collisions should be avoided as much as possible, they are inevitable Need a strategy for resolving collisions. We look.
1 Designing Hash Tables Sections 5.3, 5.4, 5.5, 5.6.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Fundamental Structures of Computer Science II
Hashing (part 2) CSE 2011 Winter March 2018.
Hashing Problem: store and retrieving an item using its key (for example, ID number, name) Linked List takes O(N) time Binary Search Tree take O(logN)
Hashing CSE 2011 Winter July 2018.
Hash tables Hash table: a list of some fixed size, that positions elements according to an algorithm called a hash function … hash function h(element)
Hash Table.
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
EE 312 Software Design and Implementation I
EE 312 Software Design and Implementation I
CSE 373: Data Structures and Algorithms
Presentation transcript:

1 Joe Meehean 1

 BST easy to implement average-case times O(LogN) worst-case times O(N)  AVL Trees harder to implement worst case times O(LogN)  Can we do better in the average-case? 2

 “Dictionary” ADT average-case time O(1) for lookup, insert, and delete  Idea stores keys (and associated values) in an array compute each key’s array index as a function of its value take advantage of array’s fast random access  Alternative implementation for sets and maps 3

 Goal Store info about a company’s 50 employees Each employee has a unique employee ID  in range  Approach use an array of size 101 (range of IDs) store employee E’s info in array[E-100]  Result insert, lookup, delete each O(1) Wasted space, 51 locations 4

 Less functionality than trees  Hash tables cannot efficiently find min find max print entire table in sorted order  Must be very careful how we use them 5

 Hashtable the underlying array  Hash function function that converts a key to an index in example: hash(x) = x – 100  TableSize size of underlying array or vector  Bucket single cell of a hash table array  Collision when two keys hash to the same bucket 6

 Keys we are using have a hash function or we can define good hash functions for them  Keys overload the following operators == != 7

 How do we make a good hash function?  What should we do about collisions?  How large should we make our hash table? 8

 Hash function should be fast  Keys should be evenly distributed different keys should have different hash values  Should reduce space needed e.g., student IDs are 10 digits do not need an array size of 10,000,000,000 there are only ~3,000 students 9

 Convert key to an int n scramble up the data ensure the data spreads over the entire integer space  Return n % TableSize ensures that n doesn’t fall off the end of the table 10

 Method 1 convert each char to an int sum them return sum % TableSize  Advantages simple time is O(key length) 11

 Method 1 convert each char to an int sum them return sum % TableSize  Problems short keys may not reach end of table  sum of characters < TableSize (by a lot) maps all permutations to same hash  hash(“able”) = hash(“bale”) Time is O(key length) 12

 Method 2 Multiply individual chars by different values Then sum a[0] * 37 n + a[1] * 37 n-1 + … + a[n-1] * 37 a[i] * 37 n-i  Advantages produces big range of values permutations hash to different values 13

 Method 2 Multiply individual chars by different values Then sum  Disadvantages relies on integer overflow need to worry about negative hashes  Handling negative hash hash = hash % TableSize if(hash < 0) hash += TableSize 14

 Fast hash vs. evenly distributed hash often faster leads to less evenly distributed even distribution leads to slower  String example could use only some of the characters faster, but more collisions likely 15

 How do we make a good hash function?  What should we do about collisions?  How large should we make our hash table? 16

 What if two keys hash to the same bucket (array entry)?  Array entries are linked lists (or trees) different keys with same hash value stored in same list (or tree) commonly called chained bucket hashing, or just chaining 17

 TableSize = 10  keys: 10 digit student IDs  hashfn = sum of digits % TableSize ID (Key)ValueSumHash Code A B C D E444 18

ID (Key)ValueSumHash Code A B C D E C C E E B B D D A A

 During a lookup  How can we tell which value we want if there are > 1 entries in the bucket?  Compare the keys buckets store keys and values 20

 How do we make a good hash function?  What should we do about collisions?  How large should we make our hash table? 21

22

 Related to hashing function  Some hashing functions lead to data clustered together  Using a prime TableSize helps resolve this issue hashing function not like to share factor with table size 23

 If number of keys known in advance make the hash table a little larger prime near 1.25 * the number of keys a little room to avoid collisions trades space for potentially faster lookup  If number of keys not known in advance plan to expand array as needed coming up in another lecture 24

 Lookup Key k 1. compute h = hash(k) 2. see if k is in the list in hashtable[h]  Insert Key k 1. Compute h = hash(k) 2. Make sure k is not already in hashtable[h] 3. Add k to the list in hashtable[h]  Delete Key k 1. Compute h = hash(k) 2. Remove k from list in hashtable[h] 25

26 template class HashSet{ private: vector > table; int currentSize; Hash hashfn; public: … bool contains(const K&) const; void insert(const K&); void remove(const K&); };

27

 Recall chaining hash tables array cells stored linked lists 2 keys with same hash end up in same list  Chaining hash tables require 2 data structures hash table and linked list  Can we solve collisions with more hashing? use just one data structure 28

 No linked list in array cells  Collisions handled using alternative hash try cells h 0 (x), h 1 (x), h 2 (x),… until an empty cell is found h i (x) = hash(x) + f(i) f(i) is collision resolution strategy  Probing looking for alternative hash locations 29

30

 f(i) is a linear function often f(i) = i  If a collision occurs, look in the next cell hash(x) + 1 keep looking until an empty cell is found hash(x) + 2, hash(x) + 3, … use modulus to wrap around table  Should eventually find an empty cell if the table is not full 31

 Simple hash: h(x) = x % TableSize Insert 89 h0(x)h0(x)

 Simple hash: h(x) = x % TableSize Insert 18 h0(x)h0(x)

 Simple hash: h(x) = x % TableSize Insert 49 h0(x)h0(x)

 Simple hash: h(x) = x % TableSize Insert 49 h1(x)h1(x)

 Simple hash: h(x) = x % TableSize Insert 58 h0(x)h0(x)

 Simple hash: h(x) = x % TableSize Insert 58 h1(x)h1(x)

 Simple hash: h(x) = x % TableSize Insert 58 h2(x)h2(x)

 Simple hash: h(x) = x % TableSize Insert 58 h3(x)h3(x)

 Advantages no need for list collision resolution function is fast  Disadvantages requires more book keeping primary clustering 40

 What if an entry is deleted and we try to lookup another entry that collided with it? Delete 89 h0(x)h0(x)

 What if an entry is deleted an we try to lookup another entry that collided with it? Lookup 49 h0(x)h0(x)

 Need extra information per cell  Differentiate between states ACTIVE: cell contains a valid key EMPTY: cell never contained a valid key DELETED: previously contained a valid key  All cells start EMPTY  Lookup keep looking until you find key or EMPTY cell 43

Delete 89 h0(x)h0(x) A E E EA E E E A A

Delete 89 h0(x)h0(x) A E E EA E E E A D

Lookup 49 h0(x)h0(x) A E E EA E E E A D

Lookup 49 h1(x)h1(x) A E E EA E E E A D

 Should we? 48

 Inserting into deleted cells  Insert find 1 st empty cell to prevent duplicates find 1 st empty or deleted cell to insert doubles run time  Special case insert a key, delete, reinsert can insert into deleted cell previously occupied lookup knows item is not in table when finds deleted entry 49

50 template class HashSet{ private: vector table; int currentSize;... }; class HashEntry{ public: enum EntryType{ACTIVE, EMPTY, DELETED}; private: K element; EntryType info; friend class HashSet; };

51

 No more bucket lists  Use collision resolution strategy h i (x) = hash(x) + f(i)  If collision occurs, try the next cell f(i) = i repeat until you find an empty cell  Need extra book keeping ACTIVE, EMPTY, DELETED 52

 What could go wrong?  How can we fix it? Professor Meehean, you haven’t told us what “it” is yet. 53

 Clusters of data requires several attempts to resolve collisions makes cluster even bigger too many 9’s eat up all of 8’s space then the 8’s eat up 7’s space, etc…  Inserting keys in space that should be empty results in collisions clusters have overrun the whole chunks of the hash table 54

Insert 30 h0(x)h0(x)

Insert 30 h1(x)h1(x)

Insert 30 h2(x)h2(x)

Insert 30 h3(x)h3(x)

 Only gets worse as load factor gets larger  As memory use gets more efficient  Performance gets worse 59

 Primary clustering caused by linear nature of linear probing collision end up right next to each other  What if we jumped farther away on a collision? f(i) = i 2  If a collision occurs… hash(x) + 1, hash(x) + 4, hash(x) + 9, … 60

61

 h i (x) = h(x) + i Insert 58 h0(x)h0(x)

 h 1 (x) = h(x) Insert 58 h1(x)h1(x)

 h 2 (x) = h(x) Insert 58 h2(x)h2(x)

 Quadratic probing eliminates primary clustering  Keys with the same hash… probe the same alternative cells clusters still exist per bucket just spread out  Called secondary clustering  Can we beat secondary clustering? 65

 If the first hashing function causes a collision, try a second hashing function h i (x) = hash(x) + f(i) f(i) = i hash 2 (x) h 0 (x) = hash(x) h 1 (x) = hash(x) + hash 2 (x) h 2 (x) = hash(x) + 2 hash 2 (x) h 3 (x) = hash(x) + 3 hash 2 (x) 66

 hash 2 (x) must be carefully selected  It can never be 0 h 1 (x) = hash(x) h 2 (x) = hash(x) h 1 (x) = h 2 (x) = h 3 (x) = h n (x)  It must eventually probe all cells quadratic probed half requires TableSize to be prime 67

 hash 2 (x) = R – (x % R)  where R is a prime smaller than TableSize previous value of TableSize? 68

h i (x) = hash(x) + i hash 2 (x) hash 2 (x) = R – (x % R) R = Insert 49 h0(x)h0(x)

h 1 (x) = 9+ 1 hash 2 (x) hash 2 (x) = 7 – (49 % 7) = 7 – 0 = 7 h 1 (x) = Insert 49 h1(x)h1(x)

Why prime TableSize is important

Why prime TableSize is important h i (x) = (x % TableSize )+ i hash 2 (x)) % TableSize hash 2 (x) = 7 – (x % 7) Insert 23

Why prime TableSize is important h i (x) = (3 + i 5) % 10 hash 2 (x) = 7 – (23 % 7) = 7 – 2 = Insert 23 h0(x)h0(x)

Why prime TableSize is important h i (x) = (3 + i 5) % 10 h 1 (x) = ( ) % 10 = Insert 23 h1(x)h1(x)

Why prime TableSize is important h i (x) = (3 + i 5) % 10 h 2 (x) = ( ) % 10 = Insert 23 h2(x)h2(x)

Why prime TableSize is important h i (x) = (3 + i 5) % 10 h3(x) = ( ) % 10 = Insert 23 h2(x)h2(x)

Why prime TableSize is important h i (x) = (x % TableSize )+ i hash 2 (x)) % TableSize hash 2 (x) = 7 – (23 % 7) = 7 – 2 = 5 h i (x) = (3 + i 5) % 10 5 is a factor of 10 hash function will wrap infinitely, landing on same buckets if TableSize is prime, result of hash 2 (x) can never be factor 77

78

 What to do when hash table gets too full? problem for both chained an probing HTs degrades performance may cause insert failure for quadratic probing 79

 Create another table 2x the size really, nearest prime 2x table size  Scan original table compute new hash for valid entries insert into new table 80

 Assume quadratic probing  hash(x) = x % TableSize Insert 23 A E A E E h0(x)h0(x)

Insert 23 A E A E E h1(x)h1(x)  Assume quadratic probing  hash(x) = x % TableSize

Insert 23 A E A E E h2(x)h2(x)  Assume quadratic probing  hash(x) = x % TableSize

Insert 23 A A A E E h2(x)h2(x)  Assume quadratic probing  hash(x) = x % TableSize

A A A E E

A A A E E E E E E E E E E E E E i

A A A E E E E E E E E E E E E E i

A A A E E E A E E E E E E E E E i

A A A E E E A E A E E E E E E E i

A A A E E E A E A E A E E E E E i

 O(N)  Initialization or offline (batch) cost is amortized at least N/2 inserts between rehash  Interactive can cause periodic unresponsiveness program is snappy for N/2 – 1 operations N/2th causes rehash 91

92

93

94

 How do we use C++ hash_maps and hash_sets?  When should we use a map… backed by a hash table backed by a tree (e.g., BST, B+)  When should we use a set backed by a hash table backed by a tree (e.g., BST, B+) 95

 unordered_map and unordered_set alternative implementation of map and set use a hash table requires a hash unary functor requires an equals predicate functor  C++11 only 96

 Lookup Key k 1. compute h = hash(k) 2. see if k is in the list in hashtable[h]  Time for lookup Time for step 1 + step 2  Worst-case for step 2 All keys hash to same index O(# keys in table) = O(N) 97

 If hash function distributes keys uniformly probability that hash(k) = h is 1 /TableSize for all h in range 0 to TableSize  Then probability of a collision = N/TableSize if N ≤ TableSize, then p(collision) ≤ 1 98

 Loophole compacts to… If hash function distributes keys uniformly AND, subset of keys distributes uniformly AND, # of keys ≤ TableSize AND, hash function is O(1) Then, average time for lookup is O(1) 99

 Insert Key k compute h = hash(k) put k in table at or near table[h]  Complexity hash function: should be O(1) collision resolution: O(N)  chained: must check all keys in list  probing: probe may hit every other filled cell Worst case: O(N) Loophole average case: O(1) 100

 Delete Key k compute h = hash(k) remove k from at or near table[h]  Complexity same as lookup and insert O(N) in the worst case O(1) in the loophole average case 101

102

 Loophole limited collisions O(1) average complexity for lookup, insert, and delete  Worst case times insert: O(N)  even with loophole rehash makes this possible lookup, delete: O(N) 103

 Alternative implementation for sets and maps, but…  Balanced tree, all operations are: O(LogN) safe middle of the road performance  Gamble on hash implementations potential O(1) operations potential O(N) operations  Some operations are not efficient print in sorted order find largest/smallest 104

 Must be positive there will be a small # of hash key collisions not just small probability an actual worst-case small # of collisions 1. All keys are known in advance and hashing doesn’t cause a large # of collisions 2. The map/set will always store all keys no collisions due to modulus no key similarities due to select sample 105

106