Fundamental Structures of Computer Science II

Slides:



Advertisements
Similar presentations
§4 Open Addressing 2. Quadratic Probing f ( i ) = i 2 ; /* a quadratic function */ 【 Theorem 】 If quadratic probing is used, and the table size is prime,
Advertisements

Hashing General idea Hash function Separate Chaining Open Addressing
Lecture 11 oct 6 Goals: hashing hash functions chaining closed hashing application of hashing.
Hashing CS 3358 Data Structures.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Introduction to Hashing CS 311 Winter, Dictionary Structure A dictionary structure has the form: (Key, Data) Dictionary structures are organized.
Lecture 11 oct 7 Goals: hashing hash functions chaining closed hashing application of hashing.
Hash Tables. Container of elements where each element has an associated key Each key is mapped to a value that determines the table cell where element.
L. Grewe. Computing hash function for a string Horner’s rule: (( … (a 0 x + a 1 ) x + a 2 ) x + … + a n-2 )x + a n-1 ) int hash( const string & key )
CS 206 Introduction to Computer Science II 04 / 06 / 2009 Instructor: Michael Eckmann.
Hashing. Hashing as a Data Structure Performs operations in O(c) –Insert –Delete –Find Is not suitable for –FindMin –FindMax –Sort or output as sorted.
§3 Separate Chaining ---- keep a list of all keys that hash to the same value struct ListNode; typedef struct ListNode *Position; struct HashTbl; typedef.
1 Joe Meehean 1.  BST easy to implement average-case times O(LogN) worst-case times O(N)  AVL Trees harder to implement worst case times O(LogN)  Can.
1 Chapter 5 Hashing General ideas Methods of implementing the hash table Comparison among these methods Applications of hashing Compare hash tables with.
CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University1 Hashing CS 202 – Fundamental Structures of Computer Science II Bilkent.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
1.  We’ll discuss the hash table ADT which supports only a subset of the operations allowed by binary search trees.  The implementation of hash tables.
DATA STRUCTURES AND ALGORITHMS Lecture Notes 7 Prepared by İnanç TAHRALI.
Hashing Vishnu Kotrajaras, PhD. What do we want to do? Insert Delete find (constant time) No sorting No Findmin findmax.
Hashing - 2 Designing Hash Tables Sections 5.3, 5.4, 5.4, 5.6.
Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
CSE 373 Data Structures and Algorithms Lecture 17: Hashing II.
Chapter 5: Hashing Collision Resolution: Open Addressing Extendible Hashing Mark Allen Weiss: Data Structures and Algorithm Analysis in Java Lydia Sinapova,
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CMSC 341 Hashing Readings: Chapter 5. Announcements Midterm II on Nov 7 Review out Oct 29 HW 5 due Thursday CMSC 341 Hashing 2.
CS 206 Introduction to Computer Science II 04 / 08 / 2009 Instructor: Michael Eckmann.
1 Designing Hash Tables Sections 5.3, 5.4, 5.5, 5.6.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
CE 221 Data Structures and Algorithms
Hashing (part 2) CSE 2011 Winter March 2018.
Hashing.
Hashing Problem: store and retrieving an item using its key (for example, ID number, name) Linked List takes O(N) time Binary Search Tree take O(logN)
CMSC 341 Hashing.
Hashing CSE 2011 Winter July 2018.
Hashing - resolving collisions
Hashing Alexandra Stefan.
Hash Tables (Chapter 13) Part 2.
Handling Collisions Open Addressing SNSCT-CSE/16IT201-DS.
Hashing Alexandra Stefan.
Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
Hash Tables.
Collision Resolution Neil Tang 02/18/2010
Resolving collisions: Open addressing
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
CMSC 341 Hashing 12/2/2018.
Searching Tables Table: sequence of (key,information) pairs
CSCE 3110 Data Structures & Algorithm Analysis
CSE 326: Data Structures Hashing
Hashing Alexandra Stefan.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
CMSC 341 Hashing 2/18/2019.
CMSC 341 Hashing.
Tree traversal preorder, postorder: applies to any kind of tree
EE 312 Software Design and Implementation I
CMSC 341 Hashing 4/11/2019.
Collision Resolution Neil Tang 02/21/2008
CMSC 341 Hashing 4/27/2019.
Collision Handling Collisions occur when different elements are mapped to the same cell.
Hashing Vishnu Kotrajaras, PhD.
CMSC 341 Lecture 12.
Data Structures and Algorithm Analysis Hashing
DATA STRUCTURES-COLLISION TECHNIQUES
EE 312 Software Design and Implementation I
Collision Resolution: Open Addressing Extendible Hashing
CSE 373: Data Structures and Algorithms
Presentation transcript:

Fundamental Structures of Computer Science II Hashing - 2 CS 202 – Fundamental Structures of Computer Science II Bilkent University Computer Engineering Department CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Outline Collision Resolution Techniques Separate Chaining – (we have seen this) Open Addressing Linear Probing Quadratic Probing Double Hashing Rehashing Extendible Hashing CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Open Addressing Separate chaining method was using linked lists. Requires implementation of a second data structures For some languages, creating new nodes (for linked lists) is expensive and slows down the system. In open addressing: All items are stored in the hash table itself. If a collision occurs, alternative cells are tried until an empty cell is found. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Open Addressing The cells that are tried successively can be expressed formally as: h0(x), h1(x), h2(x), … h0(x) is the initial cells that causes a collision. h1(x), h2(x), …are alternative cells. hi(x) = (hash(x) + f(i)) mode TableSize f(i) is collision resolution strategy (function). f(0) = 0. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Open Addressing There are varies methods as open addressing schemes: Linear Probing hash(x) = hash(x) + f(i) = i, where i >= 0 Quadratic Probing hash(x) = hash(x) + f(i) = i2,, where i >= 0 Double Hashing hash(x) = hash1(x) + i ∙ hash2(x), where i >= 0 CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Linear Probing In linear probing, f is a linear function of i. Typically f(i) = i. When a collision occurs, cells are tried sequentially in search of an empty cell. Wrap around when end of array is reached. Example: Insert items: 89, 18, 49, 58, 69 into an empty hash table. Table size is 10. Hash function is hash(x) = x mod 10. Collision resolution strategy is f(i) = i; CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Example Cell number Empty Table After inserting 89 After inserting 18 After inserting 49 58 69 1 2 3 4 5 6 7 8 18 9 89 Primary cluster cells: 8,9,0,1,2 CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II x h0(x) h1(x) h2(x) h3(x) ….. Number of Probes 89 9 1 18 8 49 3 58 4 69 2 Keys CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Primary Clustering Blocks of occupied cells (a cluster) are starting forming A key that is hashed into the cluster, will requires several attempts to resolve the collision. After several attempts it will add up to the cluster, making the cluster bigger. This is called primary clustering. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Performance CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Collision Resolution Analysis Assume collision resolution is random. f(i) = a random number between 0 and TableSize-1 Load factor is λ (fraction of cells that are full) Fraction of cells that are empty is 1-λ Then expected number of cells to probe for unsuccessful search is: 1/ (1-λ) CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Cost of average successful searcgh The cost of a successful search of item x is: Equal to the Cost of inserting that item x (that was done previously). When we insert items, load factor increasing, hence the insertion cost of later items is bigger Compute average cost of N items from the insertion cost of N items. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Linear probing: unsuccessful search Linear probing: successful search Random probing: unsuccessful search Random probing: successful search Load Factor versus Number of Probes CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Linear Probing As a rule of thumb: Linear probing is bad idea if load factor is expected to grow beyond 0.5 Rehashing should be used to grow the hash table if load factor is more than 0.5 and linear hashing is wanted to be used. Comments Linear probing causes primary clustering Simple collision resolution function to evaluate. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Quadratic Probing Eliminates primary clustering Collision resolution function is a quadratic function f(i) = i2 Causes secondary clustering Rule of thumbs for using quadratic probing TableSize should be prime Load factor should be less than 0.5, otherwise table needs to rehashed. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Example Cell number Empty Table After inserting 89 After inserting 18 After inserting 49 58 69 1 2 3 4 5 6 7 8 18 9 89 Primary clusters eliminated. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II x h0(x) h1(x) h2(x) h3(x) ….. Number of Probes 89 9 1 18 8 49 2 58 3 69 Keys CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Quadratic Probing There is no guarantee to find an empty cell is table is more than half full. If table is less than half full, it is guaranteed that we can find an empty cell by quadratic probing where we can insert a colliding item. Table size must be prime to have this condition hold. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Theorem If quadratic probing is used, and the table size is prime, than a new element can always be inserted if the table is at least half empty. Proof: Let the table size (T) be a prime number greater than 3. We will first show that: For a given key x, that need to be inserted, the first k= upper(T/2) alternative locations are all distinct. Namely, h1(x), h2(x), h2(x),…. hk-1(x) are all distinct. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Notes to keep in mind Table must be at least half empty Load factor smaller than 0.5 Table size must be prime Deletions should be lazy. The item should not be removed, but just marked as invalid. Otherwise, the deleted cell might have caused a collision to go past it. That item is needed to find the next item in probe sequence. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Hash Table Class with Quadratic Probing template <class HashedObj> class HashTable { public: explicit HashTable( const HashedObj & notFound, int size = 101 ); HashTable( const HashTable & rhs ) : ITEM_NOT_FOUND( rhs.ITEM_NOT_FOUND ), array( rhs.array ), currentSize( rhs.currentSize ) { } const HashedObj & find( const HashedObj & x ) const; void makeEmpty( ); void insert( const HashedObj & x ); void remove( const HashedObj & x ); const HashTable & operator=( const HashTable & rhs ); enum EntryType { ACTIVE, EMPTY, DELETED }; CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Hash Table Class with Quadratic Probing private: struct HashEntry { HashedObj element; EntryType info; HashEntry( const HashedObj & e = HashedObj( ), EntryType i = EMPTY ) : element( e ), info( i ) { } } vector<HashEntry> array; int currentSize; const HashedObj ITEM_NOT_FOUND; bool isActive( int currentPos ) const; int findPos( const HashedObj & x ) const; void rehash( ); CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Find template <class HashedObj> const HashedObj & HashTable<HashedObj>::find( const HashedObj & x ) const { int currentPos = findPos( x ); if( isActive( currentPos ) ) return array[ currentPos ].element; else return ITEM_NOT_FOUND; } CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

FindPos (i-1)2 = i2 – 2i + 1 then i2 = (i-1)2 + (2i – 1) ith probe is (2i-1) more than the (i-1)th probe. template <class HashedObj> int HashTable<HashedObj>::findPos( const HashedObj & x ) const { int collisionNum = 0; int currentPos = hash( x, array.size( ) ); while( array[ currentPos ].info != EMPTY && array[ currentPos ].element != x ) /* search for item */ currentPos += 2 * (++collisionNum) - 1; // Compute ith probe if ( currentPos >= array.size( ) ) currentPos -= array.size( ); } return currentPos; CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Insert template <class HashedObj> void HashTable<HashedObj>::insert( const HashedObj & x ) { // Insert x as active int currentPos = findPos( x ); if( isActive( currentPos ) ) return; // return without inserting array[ currentPos ] = HashEntry( x, ACTIVE ); // create an active hash entry // Rehash; see Section 5.5 if( ++currentSize > array.size( ) / 2 ) /* load factor greater then 0.5 rehash( ); /* double the hash table size. } CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Remove /** * Remove item x from the hash table. */ template <class HashedObj> void HashTable<HashedObj>::remove( const HashedObj & x ) { int currentPos = findPos( x ); if( isActive( currentPos ) ) .// item to be deleted found array[ currentPos ].info = DELETED; } CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Quadratic Probing Review Causes secondary clustering. Elements that hash to the same position will probe the same alternative cells. Load factor should not exceed 0. Table size should be a prime number. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Double Hashing Two hash functions are used. hash(x) = hash1(x) + i*hash2(x), where i >= 0. hash2(x) Hash Table TableSize-1 hash1(x) +hash2(x) initial try +2hash2(x) +3hash2(x) 1st probe 2nd probe CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Double Hashing Tips Choice of hash2(x) is very important. A poor choice would not help to resolve collisions. hash2 should never evaluate to zero. Table size should be prime. hash2(x) = R – (x mod R) would work as a second hash function. R is a prime number here. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Example TableSize is again 10. 1st hash function = x mod 10 2nd has function = 7 - x mode 7 CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Example Cell number Empty Table After inserting 89 After inserting 18 After inserting 49 58 69 1 2 3 4 5 6 7 8 18 9 89 Primary and secondary clusters eliminated. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II x h0(x) h1(x) h2(x) h3(x) ….. Number of Probes 89 9 1 18 8 49 6 2 58 3 69 Keys CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Double Hashing Eliminates primary and secondary clustering Two hash functions computed. More cost per operation. If table size is not prime, than we can run out of alternative positions much quickly. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II Extendible Hashing All methods so far assumed that hash table can fit in memory. For large amount of data, this may not be true Data items should reside in disk in this case. A directory that will ease to reach data items can be kept in memory If it is too big, it too can be stored in disk. CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II N: Number of items to be stored M: Maximum number of items that can be stored in a disk block. D: number of bits used by the root – directory (here it is 2) Directory (root) 00 01 10 11 (2) (2) (2) (2) 000100 001000 001010 010100 011000 100000 101000 101100 101110 111000 111001 Disk Blocks (leafs) dL: number of bits of a leaf that are common CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II 000 001 010 011 100 101 110 111 (2) (2) (3) (3) (2) 000100 001000 001010 001011 010100 011000 100000 100100 101000 101100 101110 111000 111001 After insertion of 100100 and leaf and directory split CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University

Fundamental Structures of Computer Science II 000 001 010 011 100 101 110 111 (3) (3) (2) (3) (3) (2) 000000 000100 001000 001010 001011 010100 011000 100000 100100 101000 101100 101110 111000 111001 After insertion of 000000 and leaf split CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University