Data Structures Hashing 1
Hashing Other search techniques require multiple comparisons The goal of hashing is to reduce the number of comparisons to one That is, we wish to locate the key immediately CIS265/506: Chapter 11 Hashing 2
Hashing A hash search is a search in which the key, through an algorithmic function, determines the location of the data. Hashing is a key-to-address transformation in which the keys map to addresses in a list CIS265/506: Chapter 11 Hashing 3
F(key) = address Key Address Hash Function 4 CIS265/506: Chapter 11 Hashing 4
001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black Address Mapping a Key value Into a record location [001] [002] [003] [004] [005] [006] [007] . . . [099] [100] 001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black 100 John Adams 102002 107095 111060 Hash 5 2 100 Key CIS265/506: Chapter 11 Hashing 5
Synonyms It is possible that two (or more) different key values produce the same location. We call the set of keys that hashes to the same location in our list synonyms CIS265/506: Chapter 11 Hashing 6
Collisions If the actual data that we insert into our list contains two or more synonyms, then we will have a collision. A collision is the event that occurs when a hashing algorithm produces an address for insertion that is already occupied. CIS265/506: Chapter 11 Hashing 7
Collisions The address produced by the hashing algorithm is called the home address. The memory that contains all the home addresses is called the prime area. When two keys collide at a home address, we must resolve the collision by placing one of the keys & its data at another location CIS265/506: Chapter 11 Hashing 8
Searching If we use one hashing algorithm to insert a key, we must use the same algorithm to find it! Each calculation of an address and test for success (is the key & data in the list?) is called a probe. CIS265/506: Chapter 11 Hashing 9
Looking for the ideal Hashing Function Simple to use Instantaneous computation No waste of space Symmetrical (data placement & retrieval use same mechanism) Some examples follow CIS265/506: Chapter 11 Hashing
Direct Hashing The key is the address without any algorithmic manipulation The supporting data structure must contain an element for every possible key. Uses are somewhat limited. Never a risk for synonyms (Why?) CIS265/506: Chapter 11 Hashing 10
Direct Hashing Situation #1: Let’s say we need to look at daily sales figures for one month. We could set up an array with 31 distinct elements. We could use the day as the key (& address) and the sales figures as the data CIS265/506: Chapter 11 Hashing 11
Key Data 01 27.61 02 32.45 03 34.21 … … … 29 81.33 30 65.99 31 00.00 Since we are dealing with “direct hashing” - the address & the key are the same dailySales[current_day] = dailySales[current_day] + sale_amount CIS265/506: Chapter 11 Hashing 12
Direct Hashing That sounds easy! What are some drawbacks? Wasteful in terms of space (eg. SSN or CSU school ID produces a large….large…. prime area) How do you find every possible case? CIS265/506: Chapter 11 Hashing 13
Subtraction Method If our keys were sequential, but did not start from one (or zero), we could use the subtraction method. Example: Keys were from 1000 to 1100. We would subtract 1000 from the key to determine its address Same problems & issues as the direct method CIS265/506: Chapter 11 Hashing 14
Modulo-Division Method Also known as the division-remainder method Divides the key by the array size and uses the remainder plus one to produce the address address = key MODULUS (listSize + 1) CIS265/506: Chapter 11 Hashing 15
Modulo-Division Method Can work with any list size If the list size is a prime number, there will be fewer collisions (???) CIS265/506: Chapter 11 Hashing 16
Modulo-Division Method - Example If we had a need to store 300 pieces of information, we would choose an array size of 307 (the next largest prime number). Now, assume we have key 121267. Q. In what array location (address) is this key stored? 121267 307 = 395 with a remainder of 2 address = key MODULUS (listSize + 1) ?? = (121267 modulus 307) + 1 3 = 2 + 1 A. Address value is: 3 CIS265/506: Chapter 11 Hashing 17
Digit Extraction Method Selected digits are extracted from the key and used as the address: key address 379452 394 121267 112 378845 388 CIS265/506: Chapter 11 Hashing 18
Mid-square Method The key is squared and the address selected from the middle of the squared number The full squared number may be too large for the computer ! If the key has 6 digits, the product is 12 digits, which is larger than the size of an integer in many computers CIS265/506: Chapter 11 Hashing 19
Midsquare Method Example: Assume we have a 4 digit address (0000-9999) Q. What is the address of the key 9452? A. 9452 * 9452 = 89340394 Address = 3403 CIS265/506: Chapter 11 Hashing 20
Folding Methods Fold Shift Method The key value is divided in two parts The left & right parts are shifted and added to the middle part CIS265/506: Chapter 11 Hashing 21
Fold Shift Method Address Size Key 3 123456789 123 456 + 789 1368 123 456 + 789 1368 ← left ← middle ← right the address for the key 123456789 is 368 discarded CIS265/506: Chapter 11 Hashing 22
Folding Methods Fold Boundary Method The key value is divided in two parts The left & right parts are folded and added to the middle part CIS265/506: Chapter 11 Hashing 23
note that the beginning & ending number have been reversed! Fold Boundary Method Address Size 3 Key 123456789 note that the beginning & ending number have been reversed! 321 456 + 987 1764 the address for the key 123456789 is 764 discarded CIS265/506: Chapter 11 Hashing 24
Note that the two folding hashing methods produce different addresses ! CIS265/506: Chapter 11 Hashing 25
Collision Resolution Whenever we are not using a direct one-to-one mapping of keys to addresses, there is a potential for a collision. CIS265/506: Chapter 11 Hashing 2
001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black 202002 [001] [002] [003] [004] [005] [006] [007] . . . [099] [100] 001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black 100 John Adams 102002 107095 111060 5 100 2 Hash 5 Uh-oh!! These two keys (green, purple) hash to the same address! CIS265/506: Chapter 11 Hashing 3
A little bit of theory….. Because of the anticipatory nature of hashing algorithms, it is necessary to have some empty elements in the list Practice says that a hashed list should never be more than 75% full CIS265/506: Chapter 11 Hashing 4
LOAD FACTOR the number of elements in the list divided by the number of physical elements allocated for the list expressed as a percentage Assigned the symbol alpha ( ) k : the number of filled element n : number of total elements = k/n * 100 CIS265/506: Chapter 11 Hashing 5
Clustering As data are added to a list and collisions are resolved, some hashing algorithms tend to cause data to group within a list The tendency of data to build up unevenly across a hashed list is called clustering. High number of clusters causes decreased search efficiency CIS265/506: Chapter 11 Hashing 6
Primary Clustering Primary clustering occurs when data becomes clusters around a home address. Easy to identify CIS265/506: Chapter 11 Hashing 7
Secondary Clustering Secondary clustering occurs when data becomes grouped along a collision path throughout a list Not easy to identify Rapidly decreases search efficiency CIS265/506: Chapter 11 Hashing 8
Secondary Clustering The data are widely distributed across the list so the list appears to be evenly distributed If the data all lie along a well-traveled collision path, the time to locate a requested element of data can become large CIS265/506: Chapter 11 Hashing 9
Secondary Clustering Example: Assume you have a group of n people. We will use their birthdates as a key. What size group is needed before you find two (or more) people with the same birthday (not necessarily the same year, but same date)? CIS265/506: Chapter 11 Hashing 10
Secondary Clustering Factoid: If there are more than 23 people in a group, there is a better than 50% chance that two people have the same birthday. CIS265/506: Chapter 11 Hashing 11
Secondary Clustering If we extrapolate this statistical curiosity into our hashing methods – we could say: if we have a list of 365 empty addresses (one for each day in a non-leap year), we can expect to get a collision within the first 23 inserts 50% of the time. CIS265/506: Chapter 11 Hashing 12
Open Addressing Collision Resolution Methods Four variations on the theme: Linear Probe Quadratic Probe Double Hashing Key Offset Collisions are resolved by placing the offending key in the prime area CIS265/506: Chapter 11 Hashing 13
Linear Probe Simplest Method If there is a collision, we add one to the address and try & insert the data in that location. If that fails, add one & try again. Keep going until there is success CIS265/506: Chapter 11 Hashing 14
001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black 202002 [001] [002] [003] [004] [005] [006] [007] . . . [099] [100] 001 Harry Lee 002 Sarah Trapp 005 Vu Nguyen 007 Ray Black 100 John Adams 102002 107095 111060 5 100 2 Hash 5 Collision! CIS265/506: Chapter 11 Hashing 15
Linear Probe Advantages: Disadvantages Easy to implement Data tends to remain near the home address Disadvantages Tends to produce primary clustering Makes search algorithms more complex - especially after data have been deleted CIS265/506: Chapter 11 Hashing 16
Quadratic Probe Similar to the linear probe The increment is the collision probe number squared Try (probe) number 1: add 12 Try (probe) number 2: add 22 to try #1 Try (probe) number 3: add 32 to try #2 CIS265/506: Chapter 11 Hashing 17
Quadratic Probe Advantage Disadvantages Eliminate primary clustering Secondary clustering remains Inefficient because of time required to square number CIS265/506: Chapter 11 Hashing 18
Quadratic Probe Disadvantages (cont.) Not possible to generate a new address for every element in the list If the list size is a prime number, it is possible to reach at least half the elements in the list CIS265/506: Chapter 11 Hashing 19
Double Hashing Rather than using an arithmetic probe function (i.e. linear probe), we re-hash the address Prevents primary clustering CIS265/506: Chapter 11 Hashing 20
“Pseudo-random collision resolution” We use a pseudo-random number generator such as: y = ( (ax + c) modulo (listSize) ) + 1 where a, x, and c are some pre-defined numbers CIS265/506: Chapter 11 Hashing 21
“Pseudorandom collision resolution” A relatively simple solution Once a collision occurs, there is only one collision resolution path that is followed by all the keys. Can create significant secondary clustering CIS265/506: Chapter 11 Hashing 22
Key Offset Key offset is a double hashing method that produces different collision paths for different keys Calculates the new address as a function of the old address and the key CIS265/506: Chapter 11 Hashing 23
Key Offset Offset = [ key / listSize ] address = ((offset + oldAddress) modulo (listSize) )+ 1 Example: oldAddress = (166702 modulo 307) + 1 = 2 offset = [166702/307] = 543 address = ((543 + 2) modulo 307) + 1 = 239 CIS265/506: Chapter 11 Hashing 24
Linked List Resolution A major disadvantage to open addressing is that each collision resolution increases the probability of future collisions This is eliminated using linked lists CIS265/506: Chapter 11 Hashing 25
30451 Harry Lee 00432 Sarah Trapp 02305 Vu Nguyen 23007 Ray Black [001] [002] [003] [004] [005] [006] [007] . . . [306] [307] 30451 Harry Lee 00432 Sarah Trapp 02305 Vu Nguyen 23007 Ray Black 47100 John Adams 49742 Peter Smith 86351 Harry Eagle . . . CIS265/506: Chapter 11 Hashing 26
Bucket Hashing Another way to avoid collisions is to hash to “buckets” or nodes that are large enough to hold multiple keys Collisions are postponed until the buckets are filled Wasteful in terms of space CIS265/506: Chapter 11 Hashing 27
. . . main buckets overflow buckets 340 460 record pointer 981 record pointer record pointer 182 record pointer 321 761 091 record pointer . . . 022 072 522 record pointer 652 record pointer record pointer record pointer 399 089 record pointer CIS265/506: Chapter 11 Hashing 28
File Organizations Heap or unordered Sorted or sequential Hashed places the records on disk in no particular order Sorted or sequential records order by a particular field Hashed Uses a hash function to determine record placement B-Trees Uses a tree structure to determine location CIS265/506: Chapter 11 Hashing 33
Files of Unordered Records (Heap Files) Records are placed in the file in the chronological order in which they are inserted Inserting is very efficient The last disk block is copied to a buffer; the new record is added; the block is rewritten to the disk Searching requires a linear search - one record at a time very expensive CIS265/506: Chapter 11 Hashing 34
Files of Unordered Records (Heap Files) Deletion Find the correct block Copy the block in to a buffer; delete the record; rewrite the block back to the disk Leaves wasted, empty, space CIS265/506: Chapter 11 Hashing 35
Files of Ordered Records (Sorted Files) We can order records based on the value of a field in the record Called the “ordering field” If the ordering field is also the key field (guaranteed to be unique) we call the field the ordering key for the file CIS265/506: Chapter 11 Hashing 36
Files of Ordered Records (Sorted Files) Reading (in key order)) is very efficient No sorting required Finding the next record usually requires no additional block accesses The next record is usually in the same block as the previous Faster access when binary search techniques are used Faster than linear search CIS265/506: Chapter 11 Hashing 37
Hashed Files Called a “hashed” or “direct” file The search condition is such that a record is found or not on a single attempt CIS265/506: Chapter 11 Hashing 38
Internal Hashing Hashing is typically implemented through an array of records We have seen this technique already CIS265/506: Chapter 11 Hashing 39
External Hashing Hashing for disk files is called external hashing The target address space is made of “buckets” Maps a key to a relative bucket number A table in the file header converts the bucket number in to the corresponding disk block address CIS265/506: Chapter 11 Hashing 40
External Hashing Fastest possible access for retrieving an arbitrary record Most hash functions do not maintain records in hash address order Requires a pre-allocated amount of space What happens when the file grows? CIS265/506: Chapter 11 Hashing 41
Dynamic Hashing The number of buckets is not fixed, but rather grows & shrinks as needed Once the bucket overflows, it is split and a new bucket is created The records are redistributed CIS265/506: Chapter 11 Hashing 42
Dynamic Hashing Internal Nodes: Leaf Nodes guide the search - left pointer = 1, right pointer = 0 Leaf Nodes hold a pointer to a bucket - a bucket address CIS265/506: Chapter 11 Hashing 43
DATA FILE BUCKETS buckets for records whose hash values start with 000 1 buckets for records whose hash values start with 001 buckets for records whose hash values start with 01 1 buckets for records whose hash values start with 10 1 buckets for records whose hash values start with 110 1 buckets for records whose hash values start with 111 1 44