CMSC 341 Hashing 4/11/2019.

CMSC 341 Hashing 4/11/2019

Hash Table 0 1 2 m-1 Basic Idea an array in which items are stored
storage index for an item determined by a hash function h(k): U  {0, 1, …, m-1} Desired Properties of h(k) easy to compute uniform distribution of keys over {0, 1, …, m} when h(k1) = h(k2) for k1, k2  U , we have a collision U is the universe of key values h(k) is a function from the set of all possible key values to an int in the range 0 to (m-1) inclusive Hashing integers is easy. The key values in the set U are already integers. H(k) just needs to distribute them over the range 0…m-1 in some uniform way. There are basically two ways to do this. 4/11/2019

Division Method The function: h(k) = k mod m
where m is the table size. m must be chosen to spread keys evenly. Ex: m = a factor of 10 Ex: m = 2b, b> 1 A good choice of m is a prime number. Also we want the table to be no more than 80% full. Choose m as smallest prime number greater than mmin, where mmin = (expected number of entries)/0.8 Factor of 10 what if keys all end in zero? 70 mod 10 = 0 80 mod 10 = 0 90 mod 10 = 0 Power of 2 h)k) = k mod 2^b just extracts the lower order bits 19 mod 4 = 3 10011 mod 2^2 =00011 (the bottom two bits) 4/11/2019

Multiplication Method
The function h(k) = m(kA - kA) where A is some real positive constant. A very good choice of A is the inverse of the “golden ratio.” Given two positive numbers x and y, the ratio x/y is the golden ratio if  = x/y = (x+y)/x The golden ratio: x2 - xy - y2 = 0  2 -  - 1 = 0  = (1 + sqrt(5))/2 = … ~= Fibi/Fibi-1 Note that kA - kA lies between 0 and 1 4/11/2019

Multiplication Method (cont.)
Because of the relationship of the golden ratio to Fibonacci numbers, this particular value of A in the multiplication method is called “Fibonacci hashing.” Some values of h(k) = m(k -1 - k -1 ) = 0 for k = 0 = 0.618m for k = 1 (-1 = 1/ 1.618… = 0.618…) = 0.236m for k = 2 = 0.854m for k = 3 = 0.472m for k = 4 = 0.090m for k = 5 = 0.708m for k = 6 = 0.326m for k = 7 = … = 0.777m for k = 32 4/11/2019

4/11/2019

Non-integer Keys In order to has a non-integer key, must first convert to a positive integer: h(k) = g(f(k)) with f: U  int g: I  {0 .. m-1}/2 Suppose the keys are strings. How can we convert a string (or characters) into an integer value? Lots of possibilities. Most of which are not very good. 1. Sum the ascii values of the characters in the string. If the string S has length N, and the characters are c0, …, cn-1, then f(k) = sum(ci) if g is the division method, then h(k) = (sum(ci)) mod tablesize Problem: if string length is short, the integers will not span the table (since ascii values are 0…127), max integer will be 127n 2. Compute a polynomial function using the characters as the constants. Let B be the base of the polynomial f(k) = sum(ciB^i) What should be the value of B? Try smallest prime number > 32 = 37 Use prime number in order to involve all bits in integer Prob: integer might overflow and become negative. 4/11/2019

Horner’s Rule hashval = 37*hasval + key[i];
int hash(const string &key, int tablesize) { int hashval = 0; // f(k) by Horner’s rule for (int i = 0; i < key.length(); i++) hashval = 37*hasval + key[i]; // g(k) by division method hashval %= tablesize; if (hashval < 0) hashval += tablesize; return hashval; } Template parameter HashedObj implies all the properties of Object: has a constructor of no args has a copy constructor has a destructor has an assignment operator Plus: has operator == or != there is a function template <class HashedObj> int hash(const HashedObj &key, int tablesize) Hash functions of class are note template functions. Overload the template functions for specific classes: int hash(const string &key, int tablesize); int hash(const int &key, int tablesize); 4/11/2019

HashTable Class template <class HashedObj> class HashTable {
public: explicit HashTable(const HashedObj &notFound, size=101); HashTable(const HashTable &rhs) : ITEM_NOT_FOUND(rhs.ITEM_NOT_FOUND), theLists(rhs.theLists) { /* no code */ } const HashedObj &find(const HashedObj &x) const; void makeEmpty(); void insert (const HashedObj &x); void remove (const HashedObj &x); const HashTable& operator=(const HashTable &rhs); private: vector<List<HashedObj> > theLists; const HashedObj ITEM_NOT_FOUND; }; 4/11/2019

Hash Table Ops returns the HashedObj in the table, if present
const HashedObj &find(const HashedObj &x) const; returns the HashedObj in the table, if present otherwise, returns ITEM_NOT_FOUND void insert (const HashedObj &x); if x already in table, do nothing. otherwise insert it, using the appropriate hash function void remove (const HashedObj &x); remove the instance of x, if x is present otherwise, does nothing void makeEmpty(); It is up to the hashtable user to provide a HashedObj that cannot possibly be in the table. The text provides no way to test for this. Perhaps HashTable should have a method: template <class HashedObj> bool HashTable<HashedObj>::isNotFound(const HashedObj &x); Why not return an iterator from find -- as we did in list? Could do that. It would have advantages: iter.isPastEnd() if not found iter.retrieve() to get the object Question: can ITEM_NOT_FOUND be inserted? ANS: Yes. What if it is? Then can no longer distinguish find successful/unsuccessful. 4/11/2019

Handling Collisions Collisions are inevitable. How to handle them?
One possibility: separate chaining (aka open hashing) store colliding items in a list if m is large enough, list lengths are small Insertion of key k hash(k) to find bucket if k is on that this, do nothing. Else, insert k on that list. Asymptotic performance if always inserted at head of list, and no duplicates, insert = O(1): best, worst, average Ex: hash first 10 perfect squares using k mod 10: tablesize = 10, m = 10 insert 0, 1, 4, 9, 16, 25, 36, 49, 64, 81 4/11/2019

Find Performance Find hash k to find the bucket
do a find on that list, returns a listItr if itr.isPastEnd(), return ITEM_NOT_FOUND, otherwise, return itr.retrieve() Performance best: worst: average Performance best: selected list is empty or key is first  O(1) worst: Let N be number of elements in the hash table. Worst case: all N elements are in one list (all have the same hash value) and key not there  O(n) average: Suppose there are M buckets and N elements in table. Then, expected list length = N/M  O(N/M) = O(N) if M is small = O(1) if M is large = N/M is called the load factor of the table. It is important to keep the load factor from getting too large. If N <= M, <= 1 and O(N/M)  O(1) where N/M is constant. 4/11/2019

Remove Performance Remove k from table hash k to find bucket
remove k from list Performance best worst average Performance: best: k is 1st element on list, or list is empty: O(1) worst: all elements on one list : O(n) average: O(N/M) : O(1) for <= 1 So, what’s the big deal? Performance for hash table and list are same best and worst-- Right. But average performance for a well-designed hash table is much better: O(1) So, why not always use a well-designed hash table? 1. Does not support ordered output. (No findMin or findMax) 2. Does not support finding all elements in a given range 3. How to design the hash function is not always clear. 4/11/2019

Handling Collisions Revisited
Open addressing (aka closed hashing) all elements stored in the table itself (so table should be large. Rule of thumb: M >= 2N) upon collision, item is hashed to a new (open) slot. Hash function h: U x {0,1,2,….}  {0,1,…,M-1} h( k, I ) = (h’ ( k ) + f( I ) ) mod m for some h’: U  {0,1,…,M-1} and f(0) = 0 Each try is called a probe 4/11/2019

Linear Probing Function: f(i) = ci Example:
h’(k) = k mod 10 in a table of size 10 , f(i) = i U={89,18,49,58,69} Example: h’(k) = k mod 10 in a table of size 10 (not prime, but easy to calculate) U={89,18,49,58,69} f(I) = I 1. 89 hashes to 9 2. 18 hashes to 8 3. 49 hashes to 9, collides with 89 h(k,1) = (49%10+1)%10=0 4. 58 hashes to 8, collides with 18 h(k,1)=(58 % ) % 10=9, collides with 89 h(k,2)=(58 % 10+2)%10=0, collides with 49 h(k,3)=(58 % 10+3)%10=1 5. 69 hashes to 9, collides with 89 h(69,1) = (h’(69)+f(1))mod 10 = 0, collides with 49 h(69,2) = (h’(69+f(2))mod 10 =0, collides with 58 h(69,3) = (h’(69)+f(3))mod 10 = 2 4/11/2019

Linear Probing (cont) Problem: Clustering
when table starts to fill up, performance  O(N) Asymptotic Performance insertion and unsuccessful find, average # probes  1/2(1+1/(1-)2) if   1, the denominator goes to zero and the number of probes goes to infinity 4/11/2019

Linear Probing (cont) Remove
Can’t just use the hash function(s) to find the object,and remove it, because objects that were inserted after x were hashed based on x’s presence. Can just mark the cell as deleted so it won’t be found anymore. Other elements still in right cells Table can fill with lots of deleted junk 4/11/2019

Quadratic Probing Function: f(i) = c2i2 + c1i + c0 Example:
f(i) = i2, m=10 U={89,18,49,58,69} hashes to 9 2. 18 hashes to 8 3. 49 hashes to 9, collision with 89 4. 58 hashes to 8, collides with 18 5. 69 hashes to 9 4/11/2019

Quadratic Probing (cont.)
Advantage: reduced clustering problem Disadvantages: reduced number of sequences no guarantee that empty slot will be found if lambda >= 0.5 if table size is not prime For example: h(k,I) = (h’(k)+I^2)mod M h(k, I) = h(k) + 0 + 1 + 4 + 9 + 16 + 25 + 26 with m = 10, h’(k) = k mod 10 key I= | | | | duplicate values to right of line I cannot exceed 5 in this table, only 6 slot available for each key If table size is prime, an open slot is guaranteed when <= 1/2. 4/11/2019

Double Hashing Use two hash functions: h’1(k), h’2(k)
h(k,I) = (h’1(k) + ih’2(k)) mod M Choosing h’2(k) don’t allow h’2(k) = 0 for any k. a good choice: h’2(k) = R - (k mod R) with R a prime smaller than M Characteristics No clustering problem Requires a second hash function 4/11/2019

CMSC 341 Hashing 4/11/2019.

Similar presentations

Presentation on theme: "CMSC 341 Hashing 4/11/2019."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMSC 341 Hashing 4/11/2019.

Similar presentations

Presentation on theme: "CMSC 341 Hashing 4/11/2019."— Presentation transcript:

Similar presentations

About project

Feedback