Design and Analysis of Algorithms

Design and Analysis of Algorithms
Khawaja Mohiuddin Assistant Professor, Department of Computer Sciences Bahria University, Karachi Campus, Contact: Lecture # 8 – Hashing

Hashing Topics To Cover Hash table fundamentals Chaining
Open Addressing Linear probing Quadratic probing Pseudorandom probing Double hashing Ordered hashing

Hash Table Fundamentals
A Hast Table maps data to locations in a data structure It associates a key to a value (a larger record) Also called associative arrays or dictionaries. The process of mapping a key to a value for use by the hash table is called hashing.

Hash Table Fundamentals (contd.)
A hash table stores items in a way that lets you calculate an item’s location in the table directly using a hash function. For example: look up an employee’s information by searching employee ID. Allocate an array of 100 items and then store an employee with employee ID N in position using a hash function N mod 100 in the array. So, an employee with ID 2190 would go in position 90, an employee with ID 2817 would go in position 17, and an employee with ID 3078 would go in position 78. To find a particular employee, you would simply calculate the ID mod 100 and look at the corresponding array entry. This is an O(1) operation that’s even faster than interpolation search

Hash Table Fundamentals (contd.)
Good hashing functions spread out key values so that they don’t all go to the same position in the table. In particular, key values are often similar, so a good hashing function maps similar key values to dissimilar locations in the table. If you put enough values in a hash table, eventually you’ll find two keys that hash to the same value. That’s called a collision. When that occurs, you need a collision-resolution policy that determines what to do. Another feature that is useful but not provided by some hash tables is the ability to remove a hashed key. Different kinds of hash tables use different methods to provide these features.

Chaining A hash table with chaining uses a collection of entries called buckets to hold key values. Each bucket is the top of a linked list holding the items that map to that bucket. Typically the buckets are arranged in an array, so you can use a simple hashing function to determine a key’s bucket. For example, if you have N buckets, and the keys are numeric, you could map the key K to bucket number K mod N.

Chaining (contd.) To add a key to the hash table, you map the key to a bucket using the hash function, and then add a new cell to the bucket’s linked list. Hashing the key to find its bucket takes O(1) steps. Adding the value to the top of the linked list takes O(1) steps. If the hash table uses B buckets and holds a total of N items, and the items are reasonably evenly distributed, each bucket’s linked list holds roughly N / B items. So checking that a new item isn’t already present in a bucket takes O(N / B) steps. That means adding an item to the hash table takes a total of O(1) + O(N / B) = O(N / B) steps.

Chaining (contd.) To find an item, you hash its key to see which bucket should hold it and then traverse that bucket’s linked list until you find the item or come to the end of the list. If you get to the end of the list, you can conclude that the item isn’t in the hash table. This also takes O(N / B) steps. To remove an item, hash its key as usual to find its bucket, and then remove the item from the bucket’s linked list. Hashing the item takes O(1) steps, and removing it takes O(N / B) steps, so the total time is O(N / B).

Chaining (contd.) A hash table with chaining can expand and shrink as needed, so you don’t need to resize it if you don’t want to. If the linked lists become too long, however, finding and removing items will take a long time. In that case you may want to enlarge the table to make more buckets. When you rehash the table, you know that you will not be adding any duplicate items, so you don’t need to search to the end of each bucket’s linked list, looking for duplicates. That allows you to rehash all the items in O(N) time.

Open Addressing Advantage of Chaining Disadvantage of Chaining
Can hold any number of values without changing the number of buckets Disadvantage of Chaining Putting too many items in buckets makes time searching through the buckets longer. In open addressing the values are stored in an array, and some sort of calculation serves as the hashing function, mapping values into positions in the array. For example, if a hash table uses an array with M entries, a simple hashing function might map the key value K into array position K mod M. Different variations of open addressing use different hashing functions and collision-resolution policies.

Open Addressing (contd.)
Collision-resolution policy produces a sequence of locations in array for a value. If a value maps to a location that is already in use, algorithm tries other locations until it either finds an empty location or concludes that it cannot find one. The sequence of locations that the algorithm tries for a value is called its Probe Sequence. If the average probe sequence is only 1 or 2, adding and locating items has runtime O(1). Open addressing is fast, but has some disadvantages. Performance degrades if its array becomes too full. If the array contains N items and is completely full, it takes O(N) time to conclude that an item is not present in the array. Even finding items that are present can be very slow.

Open Addressing (contd.)
If the array becomes too full, you can resize it to make it bigger and give the hash table a smaller fill percentage. To do that, create a new array and rehash the items into it. If the new array is reasonably large, it should take O(1) time to rehash each item, for a total runtime of O(N).

Removing Items in Open Addressing
Open addressing does not allow you to remove items the way chaining does. An item in the array might be part of another item’s probe sequence. If you remove an item, you may break the other item’s probe sequence, so you can no longer find the second value. One solution to this problem is to mark the item as deleted instead of resetting the array’s entry to the empty value. When you search for a value, you continue searching if you find the deleted value. When you insert a new value into the hash table, you can place it in a previously deleted entry if you find one in the probe sequence. One drawback of this approach is that if you add and then remove many items, the table may become full of deleted entries. That won’t slow down insertions but will make searching for items slower.

Linear Probing in Open Addressing
In linear probing, the collision-resolution policy adds a constant number to each location to generate a probe sequence. This constant number is called the stride and is usually set to 1. Each time the algorithm adds 1, it takes the result modulus the size of the array, so the sequence wraps around to the beginning of the array if necessary. For example: Suppose the hash table’s array contains 10 entries, so 71 maps to location 71 mod 10 = 1. That location already contains the value 61, so the algorithm moves to the next location in the value’s probe sequence, location 2. That location is also occupied, so the algorithm moves to the next location in the probe sequence, location 3. That location is empty, so the algorithm places 71 there.

Linear Probing in Open Addressing (contd.)
Advantages It is very simple A probe sequence will eventually visit every location in the array. Therefore, the algorithm can insert an item if any space is left in the array. A Disadvantage Called primary clustering, an effect in which items added to the table tend to cluster to form large blocks of contiguous array entries that are all full. This is a problem because it leads to long probe sequences. If you try to add a new item that hashes to any of the entries in a cluster, the item’s probe sequence will not find an empty location for the item until it crosses the whole cluster.

Linear Probing in Open Addressing (contd.)
The example program shows hash table’s array has 101 entries and holds 50 values. If the items were evenly distributed within the array, the probe sequence for every item that is in the table would have a length of 1. The probe sequences for items that are not in the table would have lengths of 1 or 2, depending on whether the initial hashing mapped the item to an occupied location. However, the program shows that the hash table’s average probe sequence length is 2.42, which is a bit above what you would get with an even distribution. The situation is worse with higher load factors.

Quadratic Probing in Open Addressing
Instead of adding a constant stride to locations to create a probe sequence, quadratic probing algorithm adds the square of the number of locations it has tried to create the probe sequence. If K, K + 1, K + 2, K + 3, ... is the probe sequence created by linear probing, the sequence created by quadratic probing is K, K + 1², K + 2², K + 3², ... . If two items map to different positions in same cluster, now, they don’t follow the same probe sequences and don’t necessarily end up adding to the cluster. In figure, the value 71 has the probe sequence 1, 1 + 1² = 2, 1 + 2² = 5, ² = 10, so it doesn’t add to the cluster.

Quadratic Probing in Open Addressing (contd.)
The example program shows that quadratic probing gives a shorter average probe sequence length than linear probing. In this example, quadratic probing gave an average probe sequence length of 1.92, whereas linear probing gave an average length of

Quadratic Probing in Open Addressing (contd.)
Disadvantage: Quadratic probing reduces primary clustering, but it can suffer from secondary clustering. In secondary clustering, values that map to the same initial position in the array follow the same probe sequence, so they create a cluster. This cluster is spread out through the array, but it still results in longer probe sequences for the items that map to the same initial position. Quadratic probing also has the drawback that it may fail to find an empty location for a value even if a few empty positions are left in the table. Because of how a quadratic probe sequence jumps farther and farther through the array, it may jump over an empty position and not find it.

Pseudorandom Probing in Open Addressing
Pseudorandom probing is similar to linear probing, except that the stride is given by a pseudorandom function of the initially mapped location. In other words, if a value initially maps to position K, its probe sequence is K, K + p, K + 2 * p, ..., where p is determined by a pseudorandom function of K. Like quadratic probing, pseudorandom probing prevents primary clustering. Also like quadratic probing, pseudorandom probing is subject to secondary clustering, because values that map to the same initial position follow the same probe sequences. Pseudorandom probing may also skip over some unused entries and fail to insert an item even though the table isn’t completely full.

Double Hashing in Open Addressing
Instead of using a pseudorandom function of the initial location to create a stride value, Double Hashing uses a second hashing function to map the original value to a stride. For example, suppose the values A and B both initially map to position K. In pseudorandom probing, a pseudo-random function F1 generates a stride p = F1(K). Then both values use the probe sequence K, K + p, K + 2 * p, K + 3 * p, ... . In contrast, double hashing uses a pseudorandom hash function F2 to map the original values A and B to two different stride values pA = F2(A) and pB = F2(B). The two probe sequences start at the same value K, but after that they are different. Double hashing eliminates primary and secondary clustering. However, like pseudorandom probing, double hashing may skip some unused entries and fail to insert an item even though the table isn’t completely full.

Ordered Hashing in Open Addressing
In some applications, it is more important that the program be able to find values quickly than to insert them quickly. For example, a program that uses a dictionary, address book, or product lookup table. A hash table with chaining can find items more quickly if its linked lists are sorted. When searching for an item, the algorithm can stop if it ever finds an item that is larger than the target item. The pseudocode on next slide shows at a high level how you can find an item in an ordered hash table:

Ordered Hashing in Open Addressing (contd.)
Pseudocode to find an item in an ordered hash table : // Return the location of the key in the array or -1 if it is // not present. Integer: FindValue(Integer: array[], Integer: key) Integer: probe = <Initial location in key's probe sequence.> // Repeat forever. While true // See if we found the item. If (array[probe] == key) Then Return probe // See if we found an empty spot. If (array[probe] == EMPTY) Then Return -1 // See if we passed where the item should be. If (array[probe] > key) Then Return -1 // Try the next location in the probe sequence. probe = <Next location in key's probe sequence.> End While End FindValue

For the given pseudocode to work each probe sequence in the table must be properly ordered so that you can search the table quickly. Unfortunately, often you cannot add the items to a hash table in sorted order because you don’t know that order when you start. Fortunately, there is a way to create an ordered hash table no matter how you add the items. To add an item, follow its probe sequence as usual. If you find an empty spot, insert the item, and you’re done. If you find a spot containing a value that is larger than the new value, replace it with the new value, and then rehash the larger value. As you rehash the larger item, you may encounter another, even larger value. If that happens, drop the item you’re hashing in the new position, and rehash the new item. Continue the process until you find an empty spot for whatever item you’re currently hashing. The pseudocode on next slide shows the process at a high level:

Pseudocode to create an ordered hash table : AddItem(Integer: array[], Integer: key) Integer: probe = <Initial location in key's probe sequence.> While true // Repeat forever If (array[probe] == EMPTY) Then // See if we found an empty spot array[probe] = key Return End If If (array[probe] > key) Then // See if we found a value greater than "key“ Integer: temp = array[probe] // Place the key here and rehash the other item key = temp // Try the next location in the probe sequence probe = <Next location in key's probe sequence> End While End AddItem

Design and Analysis of Algorithms

Similar presentations

Presentation on theme: "Design and Analysis of Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Design and Analysis of Algorithms

Similar presentations

Presentation on theme: "Design and Analysis of Algorithms"— Presentation transcript:

Similar presentations

About project

Feedback