Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing.

Slides:



Advertisements
Similar presentations
Hashing General idea Hash function Separate Chaining Open Addressing
Advertisements

Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
CSCE 3400 Data Structures & Algorithm Analysis
Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
Hashing as a Dictionary Implementation
© 2004 Goodrich, Tamassia Hash Tables1  
Hashing Chapters What is Hashing? A technique that determines an index or location for storage of an item in a data structure The hash function.
Hashing Techniques.
Lecture 10 Sept 29 Goals: hashing dictionary operations general idea of hashing hash functions chaining closed hashing.
Lecture 18 Nov 3 Goals: hashing dictionary operations general idea of hashing hash functions chaining closed hashing.
© 2006 Pearson Addison-Wesley. All rights reserved13 A-1 Chapter 13 Hash Tables.
Lecture 11 March 5 Goals: hashing dictionary operations general idea of hashing hash functions chaining closed hashing.
Hashing Text Read Weiss, §5.1 – 5.5 Goal Perform inserts, deletes, and finds in constant average time Topics Hash table, hash function, collisions Collision.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Hashing General idea: Get a large array
L. Grewe. Computing hash function for a string Horner’s rule: (( … (a 0 x + a 1 ) x + a 2 ) x + … + a n-2 )x + a n-1 ) int hash( const string & key )
Lecture 6 Hashing. Motivating Example Want to store a list whose elements are integers between 1 and 5 Will define an array of size 5, and if the list.
Hashing. Hashing as a Data Structure Performs operations in O(c) –Insert –Delete –Find Is not suitable for –FindMin –FindMax –Sort or output as sorted.
1. 2 Problem RT&T is a large phone company, and they want to provide enhanced caller ID capability: –given a phone number, return the caller’s name –phone.
Hash Table March COP 3502, UCF.
1 Chapter 5 Hashing General ideas Methods of implementing the hash table Comparison among these methods Applications of hashing Compare hash tables with.
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University1 Hashing CS 202 – Fundamental Structures of Computer Science II Bilkent.
1.  We’ll discuss the hash table ADT which supports only a subset of the operations allowed by binary search trees.  The implementation of hash tables.
DATA STRUCTURES AND ALGORITHMS Lecture Notes 7 Prepared by İnanç TAHRALI.
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Algorithm Course Dr. Aref Rashad February Algorithms Course..... Dr. Aref Rashad Part: 4 Search Algorithms.
Implementing Dictionaries Many applications require a dynamic set that supports dictionary-type operations such as Insert, Delete, and Search. E.g., a.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Hashing as a Dictionary Implementation Chapter 19.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
CS201: Data Structures and Discrete Mathematics I Hash Table.
David Luebke 1 11/26/2015 Hash Tables. David Luebke 2 11/26/2015 Hash Tables ● Motivation: Dictionaries ■ Set of key/value pairs ■ We care about search,
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Hashing 8 April Example Consider a situation where we want to make a list of records for students currently doing the BSU CS degree, with each.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Hashing Basis Ideas A data structure that allows insertion, deletion and search in O(1) in average. A data structure that allows insertion, deletion and.
Hash Tables. 2 Exercise 2 /* Exercise 1 */ void mystery(int n) { int i, j, k; for (i = 1; i
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Hashing TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA Course: Data Structures Lecturer: Haim Kaplan and Uri Zwick.
Hashing Goal Perform inserts, deletes, and finds in constant average time Topics Hash table, hash function, collisions Collision handling Separate chaining.
CMSC 341 Hashing Readings: Chapter 5. Announcements Midterm II on Nov 7 Review out Oct 29 HW 5 due Thursday CMSC 341 Hashing 2.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
1 Designing Hash Tables Sections 5.3, 5.4, 5.5, 5.6.
Prof. Amr Goneid, AUC1 CSCI 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables.
Fundamental Structures of Computer Science II
Hashing Problem: store and retrieving an item using its key (for example, ID number, name) Linked List takes O(N) time Binary Search Tree take O(logN)
CSCI 210 Data Structures and Algorithms
CMSC 341 Hashing.
Hashing CSE 2011 Winter July 2018.
Hashing Alexandra Stefan.
Hash Table.
Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations
CSCE 3110 Data Structures & Algorithm Analysis
Hashing Alexandra Stefan.
CS202 - Fundamental Structures of Computer Science II
CMSC 341 Hashing.
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Collision Handling Collisions occur when different elements are mapped to the same cell.
Presentation transcript:

Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Dictionary operations o search o insert o delete Applications: data base search books in a library patient records, GIS data etc. web page caching (web search) combinatorial search (game tree)

Dictionary operations o search o insert o delete ARRAY LINKED LIST sorted unsorted Search Insert delete O(log n) O(n) O(n) O(n) O(n) O(1) O(n) O(n) O(n) O(n) O(n) O(n) comparisons and data movements combined (Assuming keys can be compared with and = outcomes) Exercise: Create a similar table separately for data movements and for comparisons.

Performance goal for dictionary operations: O(n) is too inefficient. Goal O(log n) on average O(log n) in the worst-case O(1) on average Data structure that achieve these goals: (a) binary search tree (b) balanced binary search tree (AVL tree) (c) hashing. (but worst-case is O(n))

Hashing oAn important and widely useful technique for implementing dictionaries. oConstant time per operation (on the average). oWorst case time proportional to the size of the set for each operation (just like array and linked list implementation)

General idea U = Set of all possible keys: (e.g. 9 digit SS #) If n = |U| is not very large, a simple way to support dictionary operations is: map each key e in U to a unique integer h(e) in the range 0.. n – 1. Boolean array H[0.. n – 1] to store keys.

General idea

Ideal case not realistic U the set of all possible keys is usually very large so we can’t create an array of size n = |U|. Create an array H of size m much smaller than n. Actual keys present at any time will usually be smaller than n. mapping from U -> {0, 1, …, m – 1} is called hash function. Example: D = students currently enrolled in courses, U = set of all SS #’s, hash table of size = 1000 Hash function h(x) = last three digits.

Example (continued) Insert Student “Dan” SS# = h( ) = hash table buckets 871 Dan NULL

Example (continued) Insert Student “Tim” SS# = h( ) = 871, same as that of Dan. Collision hash table buckets 871 Dan NULL

Hash Functions If h(k 1 ) =  = h(k 2 ): k 1 and k 2 have collision at slot  There are two approaches to resolve collisions.

Collision Resolution Policies Two ways to resolve: (1) Open hashing, also known as separate chaining (2) Closed hashing, a.k.a. open addressing Chaining: keys that collide are stored in a linked list.

Previous Example: Insert Student “Tim” SS# = h( ) = 871, same as that of Dan. Collision hash table buckets 871 Dan NULL Tim

Open Hashing The hash table is a pointer to the head of a linked list All elements that hash to a particular bucket are placed on that bucket’s linked list Records within a bucket can be ordered in several ways by order of insertion, by key value order, or by frequency of access order

Open Hashing Data Organization D-1...

Implementation of open hashing - search bool contains( const HashedObj & x ) { list whichList = theLists[ myhash( x ) ]; return find( whichList.begin( ), whichList.end( ), x ) != whichList.end( ); } Find is a function in the STL class algorithm. Code for find is described below: template InputIterator find ( InputIterator first, InputIterator last, const T& value ) { for ( ;first!=last; first++) if ( *first==value ) break; return first; }

Implementation of open hashing - insert bool insert( const HashedObj & x ) { list whichList = theLists[ myhash( x ) ]; if( find( whichList.begin( ), whichList.end( ), x ) != whichList.end( ) ) return false; whichList.push_back( x ); return true; } The new key is inserted at the end of the list.

Implementation of open hashing - delete

Choice of hash function A good hash function should: be easy to compute distribute the keys uniformly to the buckets use all the fields of the key object.

Example: key is a string over {a, …, z, 0, … 9, _ } Suppose hash table size is n = (Choose table size to be a prime number.) Good hash function: interpret the string as a number to base 37 and compute mod h(“word”) = ? “w” = 23, “o” = 15, “r” = 18 and “d” = 4. h(“word”) = (23 * 37^ * 37^ * 37^1 + 4) % 10007

Computing hash function for a string Horner’s rule: (( … (a 0 x + a 1 ) x + a 2 ) x + … + a n-2 )x + a n-1 ) int hash( const string & key ) { int hashVal = 0; for( int i = 0; i < key.length( ); i++ ) hashVal = 37 * hashVal + key[ i ]; return hashVal; }

Computing hash function for a string int myhash( const HashedObj & x ) const { int hashVal = hash( x ); hashVal %= theLists.size( ); return hashVal; } Alternatively, we can apply % theLists.size() after each iteration of the loop in hash function. int myHash( const string & key ) { int hashVal = 0; int s = theLists.size(); for( int i = 0; i < key.length( ); i++ ) hashVal = (37 * hashVal + key[ i ]) % s; return hashVal % s; }

Analysis of open hashing/chaining Open hashing uses more memory than open addressing (because of pointers), but is generally more efficient in terms of time. If the keys arriving are random and the hash function is good, keys will be nicely distributed to different buckets and so each list will be roughly the same size. Let n = the number of keys present in the hash table. m = the number of buckets (lists) in the hash table. If there are n elements in set, then each bucket will have roughly n/m If we can estimate n and choose m to be ~ n, then the average bucket will be 1. (Most buckets will have a small number of items).

Analysis continued Average time per dictionary operation: m buckets, n elements in dictionary  average n/m elements per bucket n/m = is called the load factor. insert, search, remove operation take O(1+n/m) = O(1  time each (1 for the hash function computation) If we can choose m ~ n, constant time per operation on average. (Assuming each element is likely to be hashed to any bucket, running time constant, independent of n.)

Closed Hashing Associated with closed hashing is a rehash strategy: “If we try to place x in bucket h(x) and find it occupied, find alternative location h 1 (x), h 2 (x), etc. Try successively until all the cells have been probed. If this happens, then the hash table is full.” h(x) is called home bucket Simplest rehash strategy is called linear hashing h i (x) = (h(x) + i) % D In general, the collision resolution strategy is to generate a sequence of hash table addresses (probe sequence); test each slot until you find an empty one (probing)

Closed Hashing Example: m =8, keys a,b,c,d have hash values h(a)=3, h(b)=0, h(c)=4, h(d)= b a c Where do we insert d? 3 already filled Probe sequence using linear hashing: h 1 (d) = (h(d)+1)%8 = 4%8 = 4 h 2 (d) = (h(d)+2)%8 = 5%8 = 5* h 3 (d) = (h(d)+3)%8 = 6%8 = 6 Etc. Wraps around the beginning of the table d

Operations Using Linear Hashing Test for membership: search Examine h(k), h 1 (k), h 2 (k), …, until we find k or an empty bucket or home bucket case 1: successful search -> return true case 2: unsuccessful search -> false case 3: unsuccessful search and table is full If deletions are not allowed, strategy works! What if deletions?

Dictionary Operations with Linear Hashing What if deletions? If we reach empty bucket, cannot be sure that k is not somewhere else and empty bucket was occupied when k was inserted Need special placeholder deleted, to distinguish bucket that was never used from one that once held a value

Implementation of closed hashing Code slightly modified from the text. // CONSTRUCTION: an approximate initial size or default of 101 // // ******************PUBLIC OPERATIONS********************* // bool insert( x ) --> Insert x // bool remove( x ) --> Remove x // bool contains( x ) --> Return true if x is present // void makeEmpty( ) --> Remove all items // int hash( string str ) --> Global method to hash strings There is no distinction between hash function used in closed hashing and open hashing. (I.e., they can be used in either context interchangeably.)

template class HashTable { public: explicit HashTable( int size = 101 ) : array( nextPrime( size ) ) { makeEmpty( ); } bool contains( const HashedObj & x ) { return isActive( findPos( x ) ); } void makeEmpty( ) { currentSize = 0; for( int i = 0; i < array.size( ); i++ ) array[ i ].info = EMPTY; }

bool insert( const HashedObj & x ) { int currentPos = findPos( x ); if( isActive( currentPos ) ) return false; array[ currentPos ] = HashEntry( x, ACTIVE ); if( ++currentSize > array.size( ) / 2 ) rehash( ); // rehash when load factor exceeds 0.5 return true; } bool remove( const HashedObj & x ) { int currentPos = findPos( x ); if( !isActive( currentPos ) ) return false; array[ currentPos ].info = DELETED; return true; } enum EntryType { ACTIVE, EMPTY, DELETED };

private: struct HashEntry { HashedObj element; EntryType info; }; vector array; int currentSize; bool isActive( int currentPos ) const { return array[ currentPos ].info == ACTIVE; }

int findPos( const HashedObj & x ) { int offset = 1; // int offset = s_hash(x); /* double hashing */ int currentPos = myhash( x ); while( array[ currentPos ].info != EMPTY && array[ currentPos ].element != x ) { currentPos += offset; // Compute ith probe // offset += 2 /* quadratic probing */ if( currentPos >= array.size( ) ) currentPos -= array.size( ); } return currentPos; }

Performance Analysis - Worst Case Initialization: O(m), m = # of buckets Insert and search: O(n), n number of elements currently in the table – Suppose there are close to n elements in the table that form a chain. Now want to search x, and say x is not in the table. It may happen that h(x) = start address of a very long chain. Then, it will take O(c) time to conclude failure. c ~ n. No better than an unsorted array.

Example h(k) = k%11 = What if next element has home bucket 0?  go to bucket 3 Same for elements with home bucket 1 or 2! Only a record with home position 3 will stay.  p = 4/11 that next record will go to bucket 3 2. Similarly, records hashing to 7,8,9 will end up in Only records hashing to 4 will end up in 4 (p=1/11); same for 5 and 6 I II insert 1052 (h.b. 7) 1052 next element in bucket 3 with p = 8/11

Performance Analysis - Average Case Distinguish between successful and unsuccessful searches Delete = successful search for record to be deleted Insert = unsuccessful search along its probe sequence Expected cost of hashing is a function of how full the table is: load factor = n/m

Random probing model vs. linear probing model It can be shown that average costs under linear hashing (probing) are: Insertion: 1/2(1 + 1/(1 - ) 2 ) Deletion: 1/2(1 + 1/(1 - )) Random probing: Suppose we use the following approach: we create a sequence of hash functions h, h,… all of which are independent of each other. insertion: 1/(1 – ) deletion: 1/ log(1/ (1 – ))

Random probing – analysis of insertion (unsuccessful search) What is the expected number of times one should roll a die before getting 4? Answer: 6 (probability of success = 1/6.) More generally, if the probability of success = p, expected number of times you repeat until you succeed is 1/p. If the current load factor =, then the probability of success = 1 – since the proportion of empty slots is 1 –.

Improved Collision Resolution Linear probing: h i (x) = (h(x) + i) % D all buckets in table will be candidates for inserting a new record before the probe sequence returns to home position clustering of records, leads to long probing sequence Linear probing with increment c > 1: h i (x) = (h(x) + ic) % D c constant other than 1 records with adjacent home buckets will not follow same probe sequence Double hashing: h i (x) = (h(x) + i g(x)) % D G is another hash function that is used as the increment amount. Avoids clustering problems associated with linear probing.

Comparison with Closed Hashing Worst case performance is O(n) for both. Average case is a small constant in both cases when  is small. Closed hashing – uses less space. Open hashing – behavior is not sensitive to load factor. Also no need to resize the table since memory is dynamically allocated.

Random probing model vs. linear probing model It can be shown that average costs under linear hashing (probing) are: Insertion: 1/2(1 + 1/(1 - ) 2 ) ‏ Deletion: 1/2(1 + 1/(1 - )) ‏ Random probing: Suppose we use the following approach: we create a sequence of hash functions h, h,… all of which are independent of each other. insertion: 1/(1 – ) ‏ deletion: 1/ log(1/ (1 – )) ‏

Random probing – analysis of insertion (unsuccessful search) ‏ What is the expected number of times one should roll a die before getting 4? Answer: 6 (probability of success = 1/6.) More generally, if the probability of success = p, expected number of times you repeat until you succeed is 1/p. Probes are assumed to be independent. Success in the case of insertion involves finding an empty slot to insert.

Proof for the case insertion: 1/(1 – ) ‏ Recall: geometric distribution involves a sequence of independent random experiments, each with outcome success (with prob = p) or failure (with prob = 1 – p). We repeat the experiment until we get success. The question is: what is the expected number of trials performed? Answer: 1/p In case of insertion, success involves finding an empty slot. Probability of success is thus 1 –. Thus, the expected number of probes = 1/(1 – ) ‏

Improved Collision Resolution Linear probing: h i (x) = (h(x) + i) % D all buckets in table will be candidates for inserting a new record before the probe sequence returns to home position clustering of records, leads to long probing sequence Linear probing with increment c > 1: h i (x) = (h(x) + ic) % D c constant other than 1 records with adjacent home buckets will not follow same probe sequence Double hashing: h i (x) = (h(x) + i g(x)) % D G is another hash function that is used as the increment amount. Avoids clustering problems associated with linear probing.

Comparison with Closed Hashing Worst case performance is O(n) for both. Average case is a small constant in both cases when  is small. Closed hashing – uses less space. Open hashing – behavior is not sensitive to load factor. Also no need to resize the table since memory is dynamically allocated.

Another hash function - Multiplication Method We choose m to be power of 2 (m=2 p ) and For example, k=123456, m=512 then:

Multiplication Method: Implementation x w bits A 2 W key h(key) ‏ extract p bits product high order word low order word