CSCI 210 Data Structures and Algorithms

CSCI 210 Data Structures and Algorithms
Prof. Amr Goneid AUC Part 5. Dictionaries(2): Hash Tables Prof. Amr Goneid, AUC

Dictionaries(2): Hash Tables
Hash Tables as Dictionaries Hashing Process Collision Handling: Open Addressing Collision Handling: Chaining Properties of Hash Functions Template Class Hash Table Performance Prof. Amr Goneid, AUC

1. Hash Tables as Dictionaries
Simple containers such as tables, stacks and queues permit access of elements by position or order of insertion. A Dictionary is a form of container that permits access by content. Prof. Amr Goneid, AUC

The Dictionary Data Structure
A dictionary DS should support the following main operations: Insert (D,x): Insert item x in dictionary D Delete (D,x): Delete item x from D Search (D,k): search for key k in D Prof. Amr Goneid, AUC

Examples: Unsorted arrays and Linked Lists: permit linear search Sorted arrays: permit Binary search Ordered Lists: permit linear search Binary Search Trees (BST): fast support of all dictionary operations. Hash Tables: Fast retrieval by hashing key directly to a position. Prof. Amr Goneid, AUC

There are 3 types of dictionaries: Static Dictionaries — These are built once and never change. Thus they need to support search, but not insertion or deletion. These are better implemented using arrays or Hash tables with linear probing. Semi-dynamic Dictionaries — These structures support insertion and search queries, but not deletion. These can be implemented as arrays, linked lists or Hash tables with linear probing. Prof. Amr Goneid, AUC

Fully Dynamic Dictionaries — These need fast support of all dictionary operations. Binary Search Trees are best. Hash tables are also great for fully dynamic dictionaries as well, provided we use chaining as the collision resolution mechanism. Prof. Amr Goneid, AUC

In the revision part R3, we present two dictionary data structures that support all basic operations. Both are linear structures and so employ linear search, i.e O(n). They are suitable for small to medium sized data. The first uses a run-time array to implement an ordered list and is suitable if we know the maximum data size The second uses a linked list and is suitable if we do not know the size of data to insert. Prof. Amr Goneid, AUC

Hash Tables as Dictionaries
Dictionaries implemented as linear lists perform searching through matching. Linear search costs O(n) comparisons. Dictionaries implemented as BST’s also search by matching. However, the search cost is O(h), where (h) is the tree height. For balanced trees, this is O(log n). Some situations require even faster search. This can be achieved by using dictionaries based on Hash Tables. Hash tables are excellent dictionary data structures, particularly if deletion need not be supported. Prof. Amr Goneid, AUC

Hash Tables as Dictionaries
Hashing applies a function to the search key so we can determine where the item will appear in an array (Hash Table) without looking at the other items (Direct Search). Also, we do not care about the sorting order of the keys The function o use is called a “Hash Function” Under ideal circumstances the cost of search is constant, independent of the size (n) of keys, i.e. it is O(1) Prof. Amr Goneid, AUC

2. Hashing Process For a hash table of size (n):
h = hash (key), h = 0,1,2,...,n-1 The basic hash function converts the key to an integer, and takes the value of this integer mod the size of the hash table. key data 1 h O(1) key hash(key) n-1 Prof. Amr Goneid, AUC

Collision It could happen that two keys hash to the same position, e.g., a table of size 11 and two keys, 55 and 66: 55 % 11  0 and 66 % 11  0 Two distinct keys mapped to the same location are called “synonyms” and the situation is called “collision” There are different ways to handle collisions. One of them is called “open addressing” or “Linear Probing” Prof. Amr Goneid, AUC

3. Collision Handling: Open Addressing
Prof. Amr Goneid, AUC

Collision Handling: Open Addressing / Linear Probing
In open addressing, we use a simple rule to probe where to put a new item when the desired slot h is already occupied. A popular probe sequence is Linear Probing. We always put the item in the next unoccupied cell. If slot h is occupied, the next slot to probe is h = (h+1) mod maxsize On searching for a given item, we go to the intended location and search sequentially. If we find an empty cell before we find the item, it does not exist anywhere in the table. Prof. Amr Goneid, AUC

Example consider inserting the following sequence of keys in a hash table of size n = 11 {55,35,66,76,59,48,84,70} Assume a simple hashing function: h = hash(key) = key % n Assume the table to be initially empty. We may use -1 as an empty symbol. -1 Prof. Amr Goneid, AUC

Example 55  0 35  2 55 -1 35 Prof. Amr Goneid, AUC

Example 66  0 collides with 55 55 -1 35 Prof. Amr Goneid, AUC

Example 66  0 so it is put in the next available slot 55 66 35 -1

Example 76  10 59  4 55 66 35 -1 59 76 Prof. Amr Goneid, AUC

Example 48  4 collides with 59 55 66 35 -1 59 76

Example 48  4 so it is put in the next available slot 55 66 35 -1 59
76 Prof. Amr Goneid, AUC

Example 84  7 55 66 35 -1 59 48 84 76 Prof. Amr Goneid, AUC

Example 70  4 collides with 59 55 66 35 -1 59 48 84 76

Example 70  4 so it is put in the next available slot 55 66 35 -1 59
48 70 84 76 Prof. Amr Goneid, AUC

Example What happens if we have to probe beyond the end of the table?
For example 54  10 collides with 76 55 66 35 -1 59 48 70 84 76 Prof. Amr Goneid, AUC

Example So, we do a circular search: h = (h+1) % n 54  10 55 66 35 54
59 48 70 84 -1 76 Prof. Amr Goneid, AUC

Demo Linear Probing Demo Prof. Amr Goneid, AUC

Insertion Algorithm bool insert (key , data) { if (table is not full)
h = hash(key); // Hash key to slot h while (slot h not empty) h = (h+1) % MaxSize; // Circular Advance insert key and data at slot h; return true; } else return false; Prof. Amr Goneid, AUC

Search Algorithm Searching for a key in a hash table using open
addressing faces 3 situations: The slot h is empty, then the key does not exist There is a match at slot h, key is found Another key occupies slot h, so we do a circular search until one of the above situations exists, or we return back to the starting point, in which case the key does not exist. Prof. Amr Goneid, AUC

Search Algorithm bool search (key ) { if (table is not empty)
h = hash(k); // Hash key to slot h start = h; // Starting Slot while (true) if (slot h is Empty) return false; if (there is a match at h) return true; h = (h+1) % MaxSize; // Circular Advance if (h == start) return false; } else return false; Prof. Amr Goneid, AUC

4. Collision Handling: Chaining
Chaining is a collision resolution mechanism A smaller table is used in which each location is associated with a linked list Synonyms of a key in slot are stored in the linked list associated with that slot. Searching is done by hashing the key to a main slot and if not found, a linear search is conducted in the associated linked list. Prof. Amr Goneid, AUC

Example 55 66 44 33 89 45 67 35 47 36 59 60 38 71 27 h = key % 11 Prof. Amr Goneid, AUC

5. Properties of Hash Functions
A hash function is usually specified in two steps: Hash code map: h1(key) -> an integer (K) Compression Map: h2(K) -> [0, N-1] i.e. h(key) = h2(h1(key)) Prof. Amr Goneid, AUC

Properties of Hash Functions
A hash function should be simple, fast and single-valued A hash function should scatter (h) over the range 0 to MaxSize-1, i.e. it should provide a uniform distribution of hash values A hash function should not cluster keys in regions of the table. Using MaxSize as a prime number reduces clustering. The key to efficiency is using a large-enough table that contains many holes. Prof. Amr Goneid, AUC

There are many hash functions with varying performance. For numeric keys, Random Hashing is very good: If x is the key, then a large integer is obtained as: K = (α x + β) % m α = β = m = 65536 The hashed value is then computed as: h = K % MaxSize Prof. Amr Goneid, AUC

For a string key (S) consisting of characters {S0 S1...SL-1 } we may use one of the following: Prof. Amr Goneid, AUC

Other Hash Functions Hash Code Maps: Hash Compression Maps:
Memory addresses as integers (K) Partition bits of the key into components of fixed length (e.g. 8 or 16 its) and sum components Hash Compression Maps: Divide: h2(K) = K mod N Multiply, add and divide (MAD): h2(K) = (aK+b) mod N, with a mod N  0 Prof. Amr Goneid, AUC

6. ADT HashTable As an example, we consider a hashTable ADT that supports most dictionary functions, but not deletion The table is implemented as a dynamic array. We use a simple remainder hashing function Linear probing is used for collision handling Prof. Amr Goneid, AUC

HashTable ADT Operations
constructor: Construct an empty table Destructor: Destroy table MakeTableEmpty: Empty whole table TableIsEmpty : Return True if table is empty TableIsFull : Return True if table is full Occupancy: Return number of occupied slots Insert: Insert key and data in a slot Search: Search for a key Retrieve: Retrieve the data part of the current slot Update: Update the data part of the current slot Traverse: Traverse whole table Prof. Amr Goneid, AUC

7. Performance: Linear Probing
Although searching in a hash table is supposed to be of complexity O(1), collision will increase search cost. Consider a hash table of size m. Let P(n,m) be the probability that No collisions happen when inserting the nth key in a hash table already occupied by (n-1) keys. Then P(1,m) = m/m = 1, and P(2,m) = (m/m)(m-1)/m, etc. Prof. Amr Goneid, AUC

Performance: Linear Probing
Generally: Prof. Amr Goneid, AUC

Performance: Linear Probing
For m = 100, this probability is about 50% when n = 12 and is almost 0 when n = 30 . P(n,m) m = 100 n Fall 2007 Prof. Amr Goneid, AUC

Performance : Linear Probing
An important factor is the Load Factor α = No. of Keys / MaxSize = occupancy Let S(α) be the average cost of successful search for a key, and U(α) be that for unsuccessful search. The problem of deriving these costs was solved by Donald Knuth in 1962. Prof. Amr Goneid, AUC

Performance : Linear Probing
The solution is S() ≈ ( 1/2 ) ( 1 + x ) for successful search U() ≈ ( 1/2 ) ( 1 + x2 ) for unsuccessful search where x = 1/(1- ) and  is the load factor. The following table shows how the costs are affected by the load factor:  66% 75% 90% S(α) 2 2.5 5.5 U(α) 5 8.5 50.5 Prof. Amr Goneid, AUC

Performance: Double Hashing
In case of collision, a second hashing function is used to hash key to the next probe position. h = [h1(key)+ h2(key)] mod Maxsize Average Case Analysis (Knuth): Example: Successful Search S(n) = - ln (1 - )/  Unsuccessful Search U(n) = 1/(1 - ) 10 3 U(n) 2.55 1.6 S(n) 0.9 2/3  Prof. Amr Goneid, AUC

Performance: Chaining
n = total number of keys Q = number of main slots For n >> Q then the average chain length is L = n/Q Best Case: T(n) = 1 Worst Case: T(n) = L + 1 = n/Q + 1 Average case: T(n) = n/(2Q) + 1 L Q Prof. Amr Goneid, AUC

Learn on your own about:
Hashing Functions Buckets and Chaining Double hashing Prof. Amr Goneid, AUC

CSCI 210 Data Structures and Algorithms

Similar presentations

Presentation on theme: "CSCI 210 Data Structures and Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCI 210 Data Structures and Algorithms

Similar presentations

Presentation on theme: "CSCI 210 Data Structures and Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback