Download presentation
Presentation is loading. Please wait.
1
Hashing Indirect Address Translation
Jim Skon
2
Key Based Searching Often we have records of data, where each record is identified by a “key”. Example: An insurance company with 50,000 customers, look up customer by: Customer number Last Name, First Name Phone Number Records are stored on Disk Drive What is the optimal search strategy based on matching a key?
3
Key Based Searching Example: A core network switch has a table to route packets flowing through it. Each packet has a destination address. The switch must maintain a routing table that tells which port (link) to send the packet to send it toward the destination. Address Out Port 23:43:15:43:23:10 23 43:24:75:34:62:52 12 76:E2:E1:D0:1E:66 D9:10:9D:00:64:6F 43 . . . 47:BE:46:DF:71:8F 2 Must switch packets at rate of 10,000 or more packets a second.
4
Key Based Searching What is the optimal search strategy based on matching a key? What is the impact whether the data being searched is stored is in memory, or on a secondary storage device? Typical Access Time: - Memory access time: ns - Flash access time: 1-3 ms - Disk access time: 9-15 ms Secondary is times slower. What does this mean?
5
Key Based Searching Claim: Generalized key based search is nlogn
Binary search, or tree traversal What if we constrain some of the factors? Key range and type Ultimate size of the file
6
Direct Address Translation
Direct translation Primary Key (PK) and the relative record position (RRP) are the same, we say there is a direct translation. Simple direct access lookup systems use this technique. Index First Last 1 Bill Yeakus 2 - 3 4 Cindy Smith 5 Randy Nelson . . . 5234 Joseph Blithers 9999 4 Digit Employee ID: 5234
7
Indirect Address Translation
Analysis – O(1) Direct translation - problems The PKs may not be numeric. Names Alpha numeric IDs
8
Indirect Address Translation
Direct translation - problems Only a small percent of the possible range of PK's may actual have records assigned to them: Consider a keyfield for an employee file is a 9 digit ID number. (E.g. Social Security Number) The company has 200 employees. Since the ID's may have any of the 109 values, The file will have to be huge (109 records!). Thus the file will have a packing density of: 2 r e c o d s u s 1 9 a l t = 7 . %
9
Indirect Address Translation
Hashing A common technique of indirect translation is hashing. A solution in which the broad range of PK values are transformed into the smaller range of RRP values. Hashing uses a hashing function to map translate thne key values into the smaller range of the RRP values.
10
Indirect Address Translation
Hashing Algorithms Development of a hashing function requires careful attention The algorithm should distribute the keys as evenly as possible across the range of address. Some different key MUST necessarily map to the same addresses
11
Key Transformation Algorithms
3 general steps to convert a key to a RRP address: 1) If key is not numeric, convert it into a numeric form, without losing information. 2) Operate on the numeric key using an algorithm which converts the keys into a spread of numbers of the order of magnitude of the address numbers required. 3) The resulting numbers are multiplied by a constant which compresses the address into the precise range of addresses.
12
Key Transformation Algorithms
Example: Key is a 9 Digit Number. Destination file has 7000 records Step 1 - Not needed (already a number) Step 2 - Divide Key by to get remainder between Step 3 - we multiply the value from 2 by .7 to put number within the range 0000 to 6999.
13
Key Transformation Algorithms
Example: What would happen if we simply skip step 2 , and simply compress the number from step 1? What about clustered insertions? (Keys with contiguous values.)
14
Key Transformation Algorithms - Division
The key is divided by a number approximately equal to the number of available addresses, and the remainder is taken as the RRP. A prime number or number with no small factors is used.
15
Key Transformation Algorithms - Division
Example: records have 6-digit key, 5000 RRPs desired. divide by 4997 and use remainder consider key: = 28 remainder 2620. Use 2620 as RRP. How do you suppose this method would work with clustered insertions? 142536 4997
16
Key Transformation Algorithms - Extraction
Select digits from different parts of key. Example: Records with 10-digit key, 5000 RRPs desired. Choose 3rd, 5th, 8th and 9th digits: Consider key = Compress into RRP range: INT(8625 * .5) = Use 4312 as RRP.
17
Key Transformation Algorithms - Folding
Digits in the key are folded inward like folding paper. Then the digits are added. Folding tends to be more appropriate for large keys.
18
Key Transformation Algorithms - Folding
Example Let key be Fold left at 4th digit, right at 3rd digit: Results in 4137 and 735 Add the two resulting values: = 4872 Compress into RRP range: 4872 x .5 = Use 2436 as RRP.
19
Key Transformation Algorithms - Mid-square method
Square the key, and use the central digits of the result. Example: Let records have 6-digit key, and 5000 RRP's desired. Key value of > central digits Compress into RRP range: 1651 x.5 = Use 825 as RRP.
20
Key Transformation Algorithms - Selection
The best way to choose a transform is to take the key set for the file and simulate using different transforms. Choose the one which distributes the records most evenly. The division method seems to be the best general transform.
21
Important hashing considerations
When designing a practical hashing scheme, several important issues must be addressed: record distribution A hashing function needs to be picked which will evenly distribute the records throughout the RRP range. Different key sets will have different distribution patterns. Thus the hashing function chosen may depend on the patterns of keys in the data set.
22
Important hashing considerations
synonyms two or more PKs which transform to the same RRP address. The the goal is to devise a hashing function for a given key set of keys which will minimize synonyms. It is, however, statistically beyond reason to totally avoid synonyms. Not only would all keys need to be known in advance, but only one algorithm in will work!
23
Important hashing considerations
collisions When a new record hashes to a record already in use by another record. The new record and the existing record are called synonyms. The result is called an overflow. A scheme must be devised to handle overflows efficiently.
24
Important hashing considerations
packing density ratio of records stored in a file to addresses available in the file. Typically the best packing density is 80-90%. The larger the file, the less the probability of an overflow. There is thus a trade-off between space and efficiency. efficiency space
25
Techniques for handling collisions
Strategies for collision resolution: 1. Create the file so that each address (physical record) can hold several logical records (usually synonyms). Called Composite Records or buckets. This is particularly good for secondary storage devices … why? 2. Develop algorithms for relocating records which collide somewhere else.
26
Composite Records or buckets
Reduce number of RRP’s, but increase the size of each to hold several records. Each RRP (called a bucket) now holds several logical records. 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4
27
Composite Records or buckets
buckets are arrays of logical records. bucket size - number of records/bucket Now room for several synonyms in each bucket. Probability of overflow is reduced. Overflow now only occurs when bucket is full. Overall file size need not increase, if bucket size 5, then reduce number of physical records by 5.
28
Composite Records or buckets
May be implemented by having file record be arrays of logical records Example: Consider two half full files 1 2 3 4 5 6 7 8 9 10 11 12 rec rec 1 2 3 4 rec Probabity of Overflow? rec rec rec rec
29
Composite Records or buckets
Trade-offs as bucket size increases, probability of a overflow is greatly reduced. as bucket size increases, time to read in and scan bucket increases Typical bucket sizes range from 5 to 30. Ideal bucket size often a multiple of the disk sector or track size. What is the extreme case of having the longest possible bucket?
30
Handling overflows Increasing bucket size will reduce, but not eliminate overflows. They must be dealt with. Many algorithms exist for handling overflows , including: 1. Progressive overflow 2. Separate overflow area 3. Chained Progressive overflow
31
Progressive overflow Adding new record
If home address is full, try the next record. If next address full, try next, and so one. If at end of file, wrap around to record 0 If search continues until home address again reached, file full.
32
Progressive overflow Finding a record If in home bucket, success!
Else if home bucket not full, search fails. Else if home bucket full, go search next bucket. Keep searching successive buckets until either found, or a non-full bucket is searched.
33
Progressive overflow Finding a record
Note that as file fills, search length will increase. What are some enhancements? Each bucket has flag indicating if bucket has really overflowed
34
Progressive overflow Delete record
Can't simply remove, or find may not work correctly Must mark each record as used, unused, or deleted.
35
Progressive overflow Evaluation simple robust
searches may get very long clustering
36
Progressive overflow Alternate version - skip x records each time, where x is prime relative to the number of records. Reduces the problem of record clustering
37
Separate overflow area
Buckets contain pointers which may point to a record in a special overflow area. Records (or buckets) are linked together in the overflow area as a linked list. What happens if there are a lot of synonyms for a few home addresses?
38
Separate overflow area
39
Chained Progressive overflow
similar to progressive, but pointers link synonyms together for quicker searches.
40
Perfect Hashing Static keyset Need for O(1) worst case access.
In book – proof that if n keys stored in m = n2 sized table, the probability is less than ½ that there are any collisions. But storing such a table takes a lot of space Or does it?
41
Perfect Hashing Two Steps
Hash into a small primary index with a typical hash function: h(k) = (ak +b) mod p) mod m Primary index has (as needed) pointer to secondary indexes of variable size. Each secondary index j has the parameters for a for size (mj) , and aj and bj for that sub index.
42
Perfect Hashing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.