Design and Analysis of Algorithms

Slides:



Advertisements
Similar presentations
Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
Advertisements

Hash Tables.
Data Structures Using C++ 2E
1 Chapter 9 Maps and Dictionaries. 2 A basic problem We have to store some records and perform the following: add new record add new record delete record.
1 CSE 326: Data Structures Hash Tables Autumn 2007 Lecture 14.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Hashing 1. Def. Hash Table an array in which items are inserted according to a key value (i.e. the key value is used to determine the index of the item).
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
Chapter 12 Hash Table. ● So far, the best worst-case time for searching is O(log n). ● Hash tables  average search time of O(1).  worst case search.
Data Structures and Algorithms Hashing First Year M. B. Fayek CUFE 2010.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
Chapter 11 Hash Tables © John Urrutia 2014, All Rights Reserved1.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
CSE 373 Data Structures and Algorithms Lecture 17: Hashing II.
Tirgul 11 Notes Hash tables –reminder –examples –some new material.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Hashing. Search Given: Distinct keys k 1, k 2, …, k n and collection T of n records of the form (k 1, I 1 ), (k 2, I 2 ), …, (k n, I n ) where I j is.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Fundamental Structures of Computer Science II
Sets and Maps Chapter 9.
Data Structures Using C++ 2E
Hash table CSC317 We have elements with key and satellite data
Open Addressing: Quadratic Probing
Hashing CSE 2011 Winter July 2018.
Lecture No.43 Data Structures Dr. Sohail Aslam.
Hashing Alexandra Stefan.
Week 8 - Wednesday CS221.
Hashing - Hash Maps and Hash Functions
Hashing CENG 351.
Subject Name: File Structures
Hashing Alexandra Stefan.
Data Structures Using C++ 2E
Review Graph Directed Graph Undirected Graph Sub-Graph
Hash functions Open addressing
Quadratic probing Double hashing Removal and open addressing Chaining
Hash tables Hash table: a list of some fixed size, that positions elements according to an algorithm called a hash function … hash function h(element)
Hash Table.
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions
CSE 2331/5331 Topic 8: Hash Tables CSE 2331/5331.
Hash Tables.
Data Structures and Algorithms
Hashing.
Resolving collisions: Open addressing
Double hashing Removal (open addressing) Chaining
Searching Tables Table: sequence of (key,information) pairs
Data Structures and Algorithms
CSCE 3110 Data Structures & Algorithm Analysis
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.
Hashing Alexandra Stefan.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Sets and Maps Chapter 9.
EE 312 Software Design and Implementation I
Hash Tables Chapter 12 discusses several ways of storing information in an array, and later searching for the information. Hash tables are a common.
Data Structures – Week #7
CSE 326: Data Structures Lecture #12 Hashing II
Collision Handling Collisions occur when different elements are mapped to the same cell.
Hash Maps Introduction
Data Structures and Algorithm Analysis Hashing
EE 312 Software Design and Implementation I
Lecture-Hashing.
CSE 373: Data Structures and Algorithms
Presentation transcript:

Design and Analysis of Algorithms Khawaja Mohiuddin Assistant Professor, Department of Computer Sciences Bahria University, Karachi Campus, Contact: khawaja.mohiuddin@bimcs.edu.pk Lecture # 8 – Hashing

Hashing Topics To Cover Hash table fundamentals Chaining Open Addressing Linear probing Quadratic probing Pseudorandom probing Double hashing Ordered hashing

Hash Table Fundamentals A Hast Table maps data to locations in a data structure It associates a key to a value (a larger record) Also called associative arrays or dictionaries. The process of mapping a key to a value for use by the hash table is called hashing.

Hash Table Fundamentals (contd.) A hash table stores items in a way that lets you calculate an item’s location in the table directly using a hash function. For example: look up an employee’s information by searching employee ID. Allocate an array of 100 items and then store an employee with employee ID N in position using a hash function N mod 100 in the array. So, an employee with ID 2190 would go in position 90, an employee with ID 2817 would go in position 17, and an employee with ID 3078 would go in position 78. To find a particular employee, you would simply calculate the ID mod 100 and look at the corresponding array entry. This is an O(1) operation that’s even faster than interpolation search

Hash Table Fundamentals (contd.) Good hashing functions spread out key values so that they don’t all go to the same position in the table. In particular, key values are often similar, so a good hashing function maps similar key values to dissimilar locations in the table. If you put enough values in a hash table, eventually you’ll find two keys that hash to the same value. That’s called a collision. When that occurs, you need a collision-resolution policy that determines what to do. Another feature that is useful but not provided by some hash tables is the ability to remove a hashed key. Different kinds of hash tables use different methods to provide these features.

Chaining A hash table with chaining uses a collection of entries called buckets to hold key values. Each bucket is the top of a linked list holding the items that map to that bucket. Typically the buckets are arranged in an array, so you can use a simple hashing function to determine a key’s bucket. For example, if you have N buckets, and the keys are numeric, you could map the key K to bucket number K mod N.

Chaining (contd.) To add a key to the hash table, you map the key to a bucket using the hash function, and then add a new cell to the bucket’s linked list. Hashing the key to find its bucket takes O(1) steps. Adding the value to the top of the linked list takes O(1) steps. If the hash table uses B buckets and holds a total of N items, and the items are reasonably evenly distributed, each bucket’s linked list holds roughly N / B items. So checking that a new item isn’t already present in a bucket takes O(N / B) steps. That means adding an item to the hash table takes a total of O(1) + O(N / B) = O(N / B) steps.

Chaining (contd.) To find an item, you hash its key to see which bucket should hold it and then traverse that bucket’s linked list until you find the item or come to the end of the list. If you get to the end of the list, you can conclude that the item isn’t in the hash table. This also takes O(N / B) steps. To remove an item, hash its key as usual to find its bucket, and then remove the item from the bucket’s linked list. Hashing the item takes O(1) steps, and removing it takes O(N / B) steps, so the total time is O(N / B).

Chaining (contd.) A hash table with chaining can expand and shrink as needed, so you don’t need to resize it if you don’t want to. If the linked lists become too long, however, finding and removing items will take a long time. In that case you may want to enlarge the table to make more buckets. When you rehash the table, you know that you will not be adding any duplicate items, so you don’t need to search to the end of each bucket’s linked list, looking for duplicates. That allows you to rehash all the items in O(N) time.

Open Addressing Advantage of Chaining Disadvantage of Chaining Can hold any number of values without changing the number of buckets Disadvantage of Chaining Putting too many items in buckets makes time searching through the buckets longer. In open addressing the values are stored in an array, and some sort of calculation serves as the hashing function, mapping values into positions in the array. For example, if a hash table uses an array with M entries, a simple hashing function might map the key value K into array position K mod M. Different variations of open addressing use different hashing functions and collision-resolution policies.

Open Addressing (contd.) Collision-resolution policy produces a sequence of locations in array for a value. If a value maps to a location that is already in use, algorithm tries other locations until it either finds an empty location or concludes that it cannot find one. The sequence of locations that the algorithm tries for a value is called its Probe Sequence. If the average probe sequence is only 1 or 2, adding and locating items has runtime O(1). Open addressing is fast, but has some disadvantages. Performance degrades if its array becomes too full. If the array contains N items and is completely full, it takes O(N) time to conclude that an item is not present in the array. Even finding items that are present can be very slow.

Open Addressing (contd.) If the array becomes too full, you can resize it to make it bigger and give the hash table a smaller fill percentage. To do that, create a new array and rehash the items into it. If the new array is reasonably large, it should take O(1) time to rehash each item, for a total runtime of O(N).

Removing Items in Open Addressing Open addressing does not allow you to remove items the way chaining does. An item in the array might be part of another item’s probe sequence. If you remove an item, you may break the other item’s probe sequence, so you can no longer find the second value. One solution to this problem is to mark the item as deleted instead of resetting the array’s entry to the empty value. When you search for a value, you continue searching if you find the deleted value. When you insert a new value into the hash table, you can place it in a previously deleted entry if you find one in the probe sequence. One drawback of this approach is that if you add and then remove many items, the table may become full of deleted entries. That won’t slow down insertions but will make searching for items slower.

Linear Probing in Open Addressing In linear probing, the collision-resolution policy adds a constant number to each location to generate a probe sequence. This constant number is called the stride and is usually set to 1. Each time the algorithm adds 1, it takes the result modulus the size of the array, so the sequence wraps around to the beginning of the array if necessary. For example: Suppose the hash table’s array contains 10 entries, so 71 maps to location 71 mod 10 = 1. That location already contains the value 61, so the algorithm moves to the next location in the value’s probe sequence, location 2. That location is also occupied, so the algorithm moves to the next location in the probe sequence, location 3. That location is empty, so the algorithm places 71 there.

Linear Probing in Open Addressing (contd.) Advantages It is very simple A probe sequence will eventually visit every location in the array. Therefore, the algorithm can insert an item if any space is left in the array. A Disadvantage Called primary clustering, an effect in which items added to the table tend to cluster to form large blocks of contiguous array entries that are all full. This is a problem because it leads to long probe sequences. If you try to add a new item that hashes to any of the entries in a cluster, the item’s probe sequence will not find an empty location for the item until it crosses the whole cluster.

Linear Probing in Open Addressing (contd.) The example program shows hash table’s array has 101 entries and holds 50 values. If the items were evenly distributed within the array, the probe sequence for every item that is in the table would have a length of 1. The probe sequences for items that are not in the table would have lengths of 1 or 2, depending on whether the initial hashing mapped the item to an occupied location. However, the program shows that the hash table’s average probe sequence length is 2.42, which is a bit above what you would get with an even distribution. The situation is worse with higher load factors.

Quadratic Probing in Open Addressing Instead of adding a constant stride to locations to create a probe sequence, quadratic probing algorithm adds the square of the number of locations it has tried to create the probe sequence. If K, K + 1, K + 2, K + 3, ... is the probe sequence created by linear probing, the sequence created by quadratic probing is K, K + 1², K + 2², K + 3², ... . If two items map to different positions in same cluster, now, they don’t follow the same probe sequences and don’t necessarily end up adding to the cluster. In figure, the value 71 has the probe sequence 1, 1 + 1² = 2, 1 + 2² = 5, 1 + 3² = 10, so it doesn’t add to the cluster.

Quadratic Probing in Open Addressing (contd.) The example program shows that quadratic probing gives a shorter average probe sequence length than linear probing. In this example, quadratic probing gave an average probe sequence length of 1.92, whereas linear probing gave an average length of 2.42.

Quadratic Probing in Open Addressing (contd.) Disadvantage: Quadratic probing reduces primary clustering, but it can suffer from secondary clustering. In secondary clustering, values that map to the same initial position in the array follow the same probe sequence, so they create a cluster. This cluster is spread out through the array, but it still results in longer probe sequences for the items that map to the same initial position. Quadratic probing also has the drawback that it may fail to find an empty location for a value even if a few empty positions are left in the table. Because of how a quadratic probe sequence jumps farther and farther through the array, it may jump over an empty position and not find it.

Pseudorandom Probing in Open Addressing Pseudorandom probing is similar to linear probing, except that the stride is given by a pseudorandom function of the initially mapped location. In other words, if a value initially maps to position K, its probe sequence is K, K + p, K + 2 * p, ..., where p is determined by a pseudorandom function of K. Like quadratic probing, pseudorandom probing prevents primary clustering. Also like quadratic probing, pseudorandom probing is subject to secondary clustering, because values that map to the same initial position follow the same probe sequences. Pseudorandom probing may also skip over some unused entries and fail to insert an item even though the table isn’t completely full.

Double Hashing in Open Addressing Instead of using a pseudorandom function of the initial location to create a stride value, Double Hashing uses a second hashing function to map the original value to a stride. For example, suppose the values A and B both initially map to position K. In pseudorandom probing, a pseudo-random function F1 generates a stride p = F1(K). Then both values use the probe sequence K, K + p, K + 2 * p, K + 3 * p, ... . In contrast, double hashing uses a pseudorandom hash function F2 to map the original values A and B to two different stride values pA = F2(A) and pB = F2(B). The two probe sequences start at the same value K, but after that they are different. Double hashing eliminates primary and secondary clustering. However, like pseudorandom probing, double hashing may skip some unused entries and fail to insert an item even though the table isn’t completely full.

Ordered Hashing in Open Addressing In some applications, it is more important that the program be able to find values quickly than to insert them quickly. For example, a program that uses a dictionary, address book, or product lookup table. A hash table with chaining can find items more quickly if its linked lists are sorted. When searching for an item, the algorithm can stop if it ever finds an item that is larger than the target item. The pseudocode on next slide shows at a high level how you can find an item in an ordered hash table:

Ordered Hashing in Open Addressing (contd.) Pseudocode to find an item in an ordered hash table : // Return the location of the key in the array or -1 if it is // not present. Integer: FindValue(Integer: array[], Integer: key) Integer: probe = <Initial location in key's probe sequence.> // Repeat forever. While true // See if we found the item. If (array[probe] == key) Then Return probe // See if we found an empty spot. If (array[probe] == EMPTY) Then Return -1 // See if we passed where the item should be. If (array[probe] > key) Then Return -1 // Try the next location in the probe sequence. probe = <Next location in key's probe sequence.> End While End FindValue

Ordered Hashing in Open Addressing (contd.) For the given pseudocode to work each probe sequence in the table must be properly ordered so that you can search the table quickly. Unfortunately, often you cannot add the items to a hash table in sorted order because you don’t know that order when you start. Fortunately, there is a way to create an ordered hash table no matter how you add the items. To add an item, follow its probe sequence as usual. If you find an empty spot, insert the item, and you’re done. If you find a spot containing a value that is larger than the new value, replace it with the new value, and then rehash the larger value. As you rehash the larger item, you may encounter another, even larger value. If that happens, drop the item you’re hashing in the new position, and rehash the new item. Continue the process until you find an empty spot for whatever item you’re currently hashing. The pseudocode on next slide shows the process at a high level:

Ordered Hashing in Open Addressing (contd.) Pseudocode to create an ordered hash table : AddItem(Integer: array[], Integer: key) Integer: probe = <Initial location in key's probe sequence.> While true // Repeat forever If (array[probe] == EMPTY) Then // See if we found an empty spot array[probe] = key Return End If If (array[probe] > key) Then // See if we found a value greater than "key“ Integer: temp = array[probe] // Place the key here and rehash the other item key = temp // Try the next location in the probe sequence probe = <Next location in key's probe sequence> End While End AddItem