CMSC 341 Hashing.

Slides:



Advertisements
Similar presentations
Hashing General idea Hash function Separate Chaining Open Addressing
Advertisements

Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
Lecture 10 Sept 29 Goals: hashing dictionary operations general idea of hashing hash functions chaining closed hashing.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Hashing General idea: Get a large array
L. Grewe. Computing hash function for a string Horner’s rule: (( … (a 0 x + a 1 ) x + a 2 ) x + … + a n-2 )x + a n-1 ) int hash( const string & key )
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Hashing. Hashing as a Data Structure Performs operations in O(c) –Insert –Delete –Find Is not suitable for –FindMin –FindMax –Sort or output as sorted.
Hashtables David Kauchak cs302 Spring Administrative Talk today at lunch Midterm must take it by Friday at 6pm No assignment over the break.
CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University1 Hashing CS 202 – Fundamental Structures of Computer Science II Bilkent.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
1.  We’ll discuss the hash table ADT which supports only a subset of the operations allowed by binary search trees.  The implementation of hash tables.
DATA STRUCTURES AND ALGORITHMS Lecture Notes 7 Prepared by İnanç TAHRALI.
Hashing Vishnu Kotrajaras, PhD. What do we want to do? Insert Delete find (constant time) No sorting No Findmin findmax.
Data Structures Hash Tables. Hashing Tables l Motivation: symbol tables n A compiler uses a symbol table to relate symbols to associated data u Symbols:
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
WEEK 1 Hashing CE222 Dr. Senem Kumova Metin
Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Hashtables David Kauchak cs302 Spring Administrative Midterm must take it by Friday at 6pm No assignment over the break.
CMSC 341 Hashing Readings: Chapter 5. Announcements Midterm II on Nov 7 Review out Oct 29 HW 5 due Thursday CMSC 341 Hashing 2.
Hashing Vishnu Kotrajaras, PhD Nattee Niparnan, PhD.
1 Designing Hash Tables Sections 5.3, 5.4, 5.5, 5.6.
Fundamental Structures of Computer Science II
CE 221 Data Structures and Algorithms
Hashing (part 2) CSE 2011 Winter March 2018.
Hashing.
Data Structures Using C++ 2E
Hashing Problem: store and retrieving an item using its key (for example, ID number, name) Linked List takes O(N) time Binary Search Tree take O(logN)
Hashing Alexandra Stefan.
Week 8 - Wednesday CS221.
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Hash functions Open addressing
Advanced Associative Structures
Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations
Hash Tables.
Dictionaries and Their Implementations
Hashing.
CMSC 341 Hashing 12/2/2018.
CSCE 3110 Data Structures & Algorithm Analysis
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hashing Alexandra Stefan.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
CMSC 341 Hashing 2/18/2019.
CMSC 341 Hashing.
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
EE 312 Software Design and Implementation I
CMSC 341 Hashing 4/11/2019.
Data Structures – Week #7
CMSC 341 Hashing 4/27/2019.
Collision Handling Collisions occur when different elements are mapped to the same cell.
Hashing Vishnu Kotrajaras, PhD.
Hashing.
CMSC 341 Lecture 12.
What we learn with pleasure we never forget. Alfred Mercier
CMSC 341 Lecture 12.
Hashing.
EE 312 Software Design and Implementation I
Lecture-Hashing.
Presentation transcript:

CMSC 341 Hashing

The Basic Problem We have lots of data to store. We desire efficient – O( 1 ) – performance for insertion, deletion and searching. Too much (wasted) memory is required if we use an array indexed by the data’s key. The solution is a “hash table”. 8/30/2007 CMSC 341 Hashing

Hash Table Basic Idea Desired Properties of h(k) 1 2 m-1 Basic Idea The hash table is an array of size ‘m’ The storage index for an item determined by a hash function h(k): U  {0, 1, …, m-1} Desired Properties of h(k) easy to compute uniform distribution of keys over {0, 1, …, m-1} when h(k1) = h(k2) for k1, k2  U , we have a collision U is the universe of key values h(k) is a function from the set of all possible key values to an int in the range 0 to (m-1) inclusive Hashing integers is easy. The key values in the set U are already integers. H(k) just needs to distribute them over the range 0…m-1 in some uniform way. There are basically two ways to do this. 8/30/2007 CMSC 341 Hashing

Division Method The hash function: h( k ) = k mod m where m is the table size. m must be chosen to spread keys evenly. Poor choice: m = a power of 10 Poor choice: m = 2b, b> 1 A good choice of m is a prime number. Table should be no more than 80% full. Choose m as smallest prime number greater than mmin, where mmin = (expected number of entries)/0.8 Factor of 10 what if keys all end in zero? 70 mod 10 = 0 80 mod 10 = 0 90 mod 10 = 0 Power of 2 h)k) = k mod 2^b just extracts the lower order bits 19 mod 4 = 3 10011 mod 2^2 =00011 (the bottom two bits) 8/30/2007 CMSC 341 Hashing

Multiplication Method The hash function: h( k ) =  m( kA -  kA  )  where A is some real positive constant. A very good choice of A is the inverse of the “golden ratio.” Given two positive numbers x and y, the ratio x/y is the “golden ratio” if  = x/y = (x+y)/x The golden ratio: x2 - xy - y2 = 0  2 -  - 1 = 0  = (1 + sqrt(5))/2 = 1.618033989… ~= Fibi/Fibi-1 Note that kA - kA lies between 0 and 1 8/30/2007 CMSC 341 Hashing

Multiplication Method (cont.) Because of the relationship of the golden ratio to Fibonacci numbers, this particular value of A in the multiplication method is called “Fibonacci hashing.” Some values of h( k ) = m(k -1 - k -1 ) = 0 for k = 0 = 0.618m for k = 1 (-1 = 1/ 1.618… = 0.618…) = 0.236m for k = 2 = 0.854m for k = 3 = 0.472m for k = 4 = 0.090m for k = 5 = 0.708m for k = 6 = 0.326m for k = 7 = … = 0.777m for k = 32 8/30/2007 CMSC 341 Hashing

8/30/2007 CMSC 341 Hashing

Non-integer Keys In order to have a non-integer key, must first convert to a positive integer: h( k ) = g( f( k ) ) with f: U  integer g: I  {0 .. m-1} Suppose the keys are strings. How can we convert a string (or characters) into an integer value? Lots of possibilities. Most of which are not very good. 1. Sum the ascii values of the characters in the string. If the string S has length N, and the characters are c0, …, cn-1, then f(k) = sum(ci) if g is the division method, then h(k) = (sum(ci)) mod tablesize Problem: if string length is short, the integers will not span the table (since ascii values are 0…127), max integer will be 127n 2. Compute a polynomial function using the characters as the constants. Let B be the base of the polynomial f(k) = sum(ciB^i) What should be the value of B? Try smallest prime number > 32 = 37 Use prime number in order to involve all bits in integer Prob: integer might overflow and become negative. 8/30/2007 CMSC 341 Hashing

Horner’s Rule static int hash(String key, int tableSize) { int hashVal = 0; for (int i = 0; i < key.length(); i++) hashVal = 37 * hashVal + key.charAt(i); hashVal %= tableSize; if(hashVal < 0) hashVal += tableSize; return hashVal; } 8/30/2007 CMSC 341 Hashing

HashTable Class public class SeparateChainingHashTable<AnyType> { public SeparateChainingHashTable( ){/* Later */} public SeparateChainingHashTable(int size){/*Later*/} public void insert( AnyType x ){ /*Later*/ } public void remove( AnyType x ){ /*Later*/} public boolean contains( AnyType x ){/*Later */} public void makeEmpty( ){ /* Later */ } private static final int DEFAULT_TABLE_SIZE = 101; private List<AnyType> [ ] theLists; private int currentSize; private void rehash( ){ /* Later */ } private int myhash( AnyType x ){ /* Later */ } private static int nextPrime( int n ){ /* Later */ } private static boolean isPrime( int n ){ /* Later */ } } Template parameter HashedObj implies all the properties of Object: has a constructor of no args has a copy constructor has a destructor has an assignment operator Plus: has operator == or != there is a function template <class HashedObj> int hash(const HashedObj &key, int tablesize) Hash functions of class are note template functions. Overload the template functions for specific classes: int hash(const string &key, int tablesize); int hash(const int &key, int tablesize); 8/30/2007 CMSC 341 Hashing

HashTable Ops boolean contains( AnyType x ) void insert (AnyType x) Returns true if x is present in the table. void insert (AnyType x) If x already in table, do nothing. Otherwise, insert it, using the appropriate hash function. void remove (AnyType x) Remove the instance of x, if x is present. Ptherwise, does nothing void makeEmpty() It is up to the hashtable user to provide a HashedObj that cannot possibly be in the table. The text provides no way to test for this. Perhaps HashTable should have a method: template <class HashedObj> bool HashTable<HashedObj>::isNotFound(const HashedObj &x); Why not return an iterator from find -- as we did in list? Could do that. It would have advantages: iter.isPastEnd() if not found iter.retrieve() to get the object Question: can ITEM_NOT_FOUND be inserted? ANS: Yes. What if it is? Then can no longer distinguish find successful/unsuccessful. 8/30/2007 CMSC 341 Hashing

Hash Methods private int myhash( AnyType x ) { int hashVal = x.hashCode( ); hashVal %= theLists.length; if( hashVal < 0 ) hashVal += theLists.length; return hashVal; } 8/30/2007 CMSC 341 Hashing

Handling Collisions Collisions are inevitable. How to handle them? Separate chaining hash tables Store colliding items in a list. If m is large enough, list lengths are small. Insertion of key k hash( k ) to find the proper list. If k is in that list, do nothing, else insert k on that list. Asymptotic performance If always inserted at head of list, and no duplicates, insert = O(1) for best, worst and average cases Ex: hash first 10 perfect squares using k mod 10: tablesize = 10, m = 10 insert 0, 1, 4, 9, 16, 25, 36, 49, 64, 81 8/30/2007 CMSC 341 Hashing

Hash Class for Separate Chaining To implement separate chaining, the private data of the hash table is an array of Lists. The hash functions are written using List functions private List<AnyType> [ ] theLists; 8/30/2007 CMSC 341 Hashing

Performance of contains( ) Hash k to find the proper list. Call contains( ) on that list which returns a boolean. Performance best: worst: average Performance best: selected list is empty or key is first  O(1) worst: Let N be number of elements in the hash table. Worst case: all N elements are in one list (all have the same hash value) and key not there  O(n) average: Suppose there are M buckets and N elements in table. Then, expected list length = N/M  O(N/M) = O(N) if M is small = O(1) if M is large = N/M is called the load factor of the table. It is important to keep the load factor from getting too large. If N <= M, <= 1 and O(N/M)  O(1) where N/M is constant. 8/30/2007 CMSC 341 Hashing

Performance of remove( ) Remove k from table Hash k to find proper list. Remove k from list. Performance best worst average Performance: best: k is 1st element on list, or list is empty: O(1) worst: all elements on one list : O(n) average: O(N/M) : O(1) for <= 1 So, what’s the big deal? Performance for hash table and list are same best and worst-- Right. But average performance for a well-designed hash table is much better: O(1) So, why not always use a well-designed hash table? 1. Does not support ordered output. (No findMin or findMax) 2. Does not support finding all elements in a given range 3. How to design the hash function is not always clear. 8/30/2007 CMSC 341 Hashing

Handling Collisions Revisited Probing hash tables All elements stored in the table itself (so table should be large. Rule of thumb: m >= 2N) Upon collision, item is hashed to a new (open) slot. Hash function h: U x {0,1,2,….}  {0,1,…,m-1} h( k, i ) = ( h’( k ) + f( i ) ) mod m for some h’: U  { 0, 1,…, m-1} and some f( i ) such that f(0) = 0 Each attempt to find an open slot (i.e. calculating h( k, i )) is called a probe 8/30/2007 CMSC 341 Hashing

HashEntry Class for Probing Hash Tables In this case, the hash table is just an array private static class HashEntry<AnyType>{ public AnyType element; // the element public boolean isActive; // false if deleted public HashEntry( AnyType e ) { this( e, true ); } public HashEntry( AnyType e, boolean active ) { element = e; isActive = active; } } // The array of elements private HashEntry<AnyType> [ ] array; // The number of occupied cells private int currentSize; 8/30/2007 CMSC 341 Hashing

Linear Probing Use a linear function for f( i ) f( i ) = c * i Example: h’( k ) = k mod 10 in a table of size 10 , f( i ) = i So that h( k, i ) = (k mod 10 + i ) mod 10 Insert the values U={89,18,49,58,69} into the hash table Example: h’(k) = k mod 10 in a table of size 10 (not prime, but easy to calculate) U={89,18,49,58,69} f(I) = I 1. 89 hashes to 9 2. 18 hashes to 8 3. 49 hashes to 9, collides with 89 h(k,1) = (49%10+1)%10=0 4. 58 hashes to 8, collides with 18 h(k,1)=(58 % 10 + 1 ) % 10=9, collides with 89 h(k,2)=(58 % 10+2)%10=0, collides with 49 h(k,3)=(58 % 10+3)%10=1 5. 69 hashes to 9, collides with 89 h(69,1) = (h’(69)+f(1))mod 10 = 0, collides with 49 h(69,2) = (h’(69+f(2))mod 10 =0, collides with 58 h(69,3) = (h’(69)+f(3))mod 10 = 2 8/30/2007 CMSC 341 Hashing

Linear Probing (cont.) Problem: Clustering Asymptotic Performance When the table starts to fill up, performance  O(N) Asymptotic Performance Insertion and unsuccessful find, average  is the “load factor” – what fraction of the table is used Number of probes  ( ½ ) ( 1+1/( 1- )2 ) if   1, the denominator goes to zero and the number of probes goes to infinity 8/30/2007 CMSC 341 Hashing

Linear Probing (cont.) Remove Can’t just use the hash function(s) to find the object and remove it, because objects that were inserted after X were hashed based on X’s presence. Can just mark the cell as deleted so it won’t be found anymore. Other elements still in right cells Table can fill with lots of deleted junk 8/30/2007 CMSC 341 Hashing

Quadratic Probing Use a quadratic function for f( i ) f( i ) = c2i2 + c1i + c0 The simplest quadratic function is f( i ) = i2 Example: Let f( i ) = i2 and m = 10 Let h’( k ) = k mod 10 So that h( k, i ) = (k mod 10 + i2 ) mod 10 Insert the value U={89, 18, 49, 58, 69 } into an initially empty hash table 1. 89 hashes to 9 2. 18 hashes to 8 3. 49 hashes to 9, collision with 89 4. 58 hashes to 8, collides with 18 5. 69 hashes to 9 8/30/2007 CMSC 341 Hashing

Quadratic Probing (cont.) Advantage: Reduced clustering problem Disadvantages: Reduced number of sequences No guarantee that empty slot will be found if λ ≥ 0.5, even if m is prime If m is not prime, may not find an empty slot even if λ < 0.5 For example: h(k,I) = (h’(k)+I^2)mod M h(k, I) = h(k) + 0 + 1 + 4 + 9 + 16 + 25 + 26 with m = 10, h’(k) = k mod 10 key I=0 1 2 3 4 5 | 6 7 8 9 10 0 0 1 4 9 6 5 | 6 9 4 1 0 1 1 2 5 0 7 6 | 7 0 5 2 1 2 2 3 6 1 8 7 | 8 1 6 3 2 duplicate values to right of line I cannot exceed 5 in this table, only 6 slot available for each key If table size is prime, an open slot is guaranteed when <= 1/2. 8/30/2007 CMSC 341 Hashing

Double Hashing Let f( i ) use another hash function f( i ) = i * h2( k ) Then h( k, I ) = ( h’( k ) + * h2( k ) ) mod m And probes are performed at distances of h2( k ), 2 * h2( k ), 3 * h2( k ), 4 * h2( k ), etc Choosing h2( k ) Don’t allow h2( k ) = 0 for any k. A good choice: h2( k ) = R - ( k mod R ) with R a prime smaller than m Characteristics No clustering problem Requires a second hash function Works best when table size is prime Quadratic is reasonably good, faster, simpler, and does not require second hash function -- hash functions for keys like strings are expensive to compute 8/30/2007 CMSC 341 Hashing

Rehashing If the table gets too full, the running time of the basic operations starts to degrade. For hash tables with separate chaining, “too full” means more than one element per list (on average) For probing hash tables, “too full” is determined as an arbitrary value of the load factor. To rehash, make a copy of the hash table, double the table size, and insert all elements (from the copy) of the old table into the new table Rehashing is expensive, but occurs very infrequently. 8/30/2007 CMSC 341 Hashing