LEARNING OBJECTIVES O(1), O(N) and O(LogN) access times. Hashing: definition how it works examples collisions pros and cons of hashing CPSC 231 Hashing (D.H.)
O(1) , O(N) and O(Logk N) accesses O(1) access to files means that the access time to a file record is CONSTANT (it does not depend on the files size) O(N) access to files means that the access time is proportional to the file size (if the number of records in the file is N) O(Logk N) access to files means that the access time is proportional to Logk N. CPSC 231 Hashing (D.H.)
Which of the file organization methods discussed so far has the O(N) access time? Which of the file organization methods discussed so far has the O(Logk N) access time? CPSC 231 Hashing (D.H.)
Hashing - O(1) access time Hashing is a technique for generating a unique home address for a given key. Hashing is used when rapid access to a key (or to its corresponding record) is required. Hashing can be used to: access records in a file access items in arrays in memory access directories of a file system CPSC 231 Hashing (D.H.)
How does hashing work? A hash function is a function that is applied to a key to generate a home address of the key (home address = h(K)). Home address of a key is the address generated by the hash function applied to this key. If a record is stored at its home address then access time to it is O(1). CPSC 231 Hashing (D.H.)
Example of hashing Suppose you want to store 75 records in a file in which the key to records is a person’s name. Suppose that you set aside space for 1000 records (assuming that your file will grow.) Use the following hash function for each key (h(K)): take the ASCII values of the first two characters of the last name and multiply them. Then take last three digits of the result to get a home address. CPSC 231 Hashing (D.H.)
Example Cont. For the name BALL h(K) = h(“BALL”)= last three digits of (66*65) = last three digits of 4290 = 290 So 290 is the home address of the key BALL. This means that if the record 290 in the file is available then the record with the key BALL will be stored at this location. CPSC 231 Hashing (D.H.)
Hashing vs indexing - two important differences With hashing, the address generated appears to be random (there is no obvious connection between the Key and the corresponding record.) - hashing is sometimes referred to as randomizing. With hashing, two different keys may be transformed to the same address so two records may be sent to the same place in the file. When this occurs, it is called a collision and some means must be found to deal with it. CPSC 231 Hashing (D.H.)
Collisions Collision is a situation when two or more keys produce the same home address. Give an example of a key that will produce a collision with the key BALL from the example of hashing? CPSC 231 Hashing (D.H.)
What to do about collisions? Collisions cause problems - we cannot put two different records in the same place. Ideal solution to this problem is to have a hashing algorithm that avoids collisions altogether. Such an algorithm is called a perfect hashing algorithm. A perfect hashing algorithm is usually very difficult (or impossible) to find. CPSC 231 Hashing (D.H.)
A practical solution to reduce the number of collisions. Spread out the records - find a hashing algorithm that spreads the records randomly among available addresses. Use extra memory - it is easier to avoid collisions if we have only a few records to distribute among many addresses than if we have about the same number of records as addresses. (Problem - fragmentation.) CPSC 231 Hashing (D.H.)
A practical solution to reduce the number of collisions Put more than one record at a single address - e.g. make physical records big enough to hold 5 data records. (E.g. each home address can hold 5 data records of synonyms (two or more different keys that hash to the same address) Addresses that can hold multiple records are called buckets. CPSC 231 Hashing (D.H.)
A Simple Hashing Algorithm Let’s look at an algorithm that randomizes home addresses much better than the hash function presented before. This algorithm has the following three steps: Represent the key in numerical form. Fold and add. Divide by a prime number and use the remainder as the address. CPSC 231 Hashing (D.H.)
Step one - represent the Key in Numerical Form If the key is a number then this step is already accomplished. If a key is a string of characters we may use ASCII codes to convert to a numerical form: 66 65 76 76 32 32 32 32 E.G BALL = B A L L | blank spaces | CPSC 231 Hashing (D.H.)
Step two: Fold and Add Folding and adding means chopping off pieces of the number and adding them together. E.G 6665|7676|3232| 3232 <<---- folding Adding : 6665+7676+3232+ 3232 =20805 In order to avoid an overflow in addition one may choose to use the mod function with a prime number e.g. 19,397 (see text p.469-470) 20805 mod 19397=1408 CPSC 231 Hashing (D.H.)
Step three: Divide by a Prime Number Close to the Size of the Address Space The purpose of this step is to assure that the final result of the calculation is the number that falls within the range of addresses of records in the file. This can be done by using the mod function of the current result over the maximum size of the file. CPSC 231 Hashing (D.H.)
Step three cont. If we decide that the file size is 100 records than we should do the following; a= s mod n or a = 1408 mod 100 =8. (8 is the home address of the key = BALL) What would be the home address of BALL if we allow 1000 records? CPSC 231 Hashing (D.H.)
Choosing a Prime Number for n Choosing the divisor n can have a major effect on how well records are spread out. A prime number is usually used for the divisor because primes tend to distribute remainders much more uniformly than non-primes. E.G. instead of using 100 in the previous example we could use 101. CPSC 231 Hashing (D.H.)
Progressive overflow Progressive overflow is a technique for handling collisions by storing a record in the next available address after its home address. Progressive overflow is not the most efficient overflow handing technique, but it is one of the simplest and is adequate for many applications. (See fig 11.4 , 11.5 p. 487-488) CPSC 231 Hashing (D.H.)
Record Deletion in Hashed Files Deleting a record from a hashed file is complicated by the following two reasons: The slot freed by deletion must not be allowed to hinder later searches; and It should be possible to reuse the freed slot for later additions. See fig 11.9 example on page 499. CPSC 231 Hashing (D.H.)
Tombstones for Handing Deletions Tombstone is a special marker placed in the key field of a record to mark it as no longer valid. Tombstones solve two problems associated with the deletion of records: the freed space does not break a sequential search for a record (WHY?) the freed space is easily recognized and can be reclaimed later (HOW?) (See fig 11.10, 11.11 p 500) CPSC 231 Hashing (D.H.)
Hashing - Pros and Cons Pros: hashing can provide faster access than most of other organizations, usually with very little storage overhead and its adaptable to most primary keys. Ideally, hashing makes it possible to find any record with only one disk access. Cons: primary disadvantage of hashing is that it hashed files may not be sorted by key. CPSC 231 Hashing (D.H.)