Download presentation
Presentation is loading. Please wait.
Published byLillian Bates Modified over 9 years ago
1
File Structure SNU-OOPSLA Lab 1 Chap11. Hashing Chap11. Hashing 서울대학교 컴퓨터공학부 객체지향시스템연구실 SNU-OOPSLA-LAB 교수 김 형 주 File Strutures by Folk, Zoellick and Riccardi
2
File StructureSNU-OOPSLA Lab2 Chapter Objectives u Introduce the concept of hashing u Examine the problem of choosing a good hashing algorithm, presents a reasonable one in detail, and describe some others u Explore several approaches for reducing collisions and storage of several records per address u Develop and use mathematical tools for analyzing performance differences resulting from the use of different hashing techniques u Examine problems associated with file deterioration (record deletions) and discuss some solutions u Discuss collision resolution techniques u Examine effects of patterns of record access on performance
3
File StructureSNU-OOPSLA Lab3 Contents(1) 11.1 Introduction 11.2 A Simple Hashing Algorithm 11.3 Hashing Functions and Record Distribution 11.4 How Much Extra Memory Should Be Used? 11.5 Collision Resolution by Progressive Overflow 11.6 Storing More Than One Record per Address: Buckets 11.7 Making Deletions 11.8 Other Collision Resolution Techniques 11.9 Patterns of Record Access
4
File StructureSNU-OOPSLA Lab4 Overview(1) u O(1) access to files u Variation of the relative file u Record number for a record is not arbitrary; rather, it is obtained by a hashing function H applied to the primary key, H(key) u Record numbers generated should be uniformly and randomly distributed such that 0 < H(key) < N Overview
5
File StructureSNU-OOPSLA Lab5 Overview(2) u A hash function is like a block box that produces an address every time you drop in a key u All parts of the key should be used by the hashing function H so that a lot of records with similar keys do not all hash to the same location u Given two random keys X, Y and N slots, the probability H(X)=H(Y) is 1/N; in this case, X and Y are called synonyms and a collision occurs Overview
6
File StructureSNU-OOPSLA Lab6 Introduction Introduction u Hash function : h(k) u Transforms a key K into an address u Hash vs other index u Sequential search : O(N) u Binary search : O(log 2 N) u B(B + ) Tree index : O(log k N) where k records in an index node u Hash : O(1) 11.1 Introduction
7
File StructureSNU-OOPSLA Lab7 A Simple Hashing Scheme(1) Name ASCII Code for First Two Letters Product Home Address BALL LOWELL TREE 66 65 76 96 84 82 66 X 65 = 4,290 76 X 96 = 6,004 84 X 82 = 6,888 4,290 6,004 6,888 11.1 Introduction
8
File StructureSNU-OOPSLA Lab8 A Simple Hashing Scheme(2) LOWELL’s home address K=LOWELL h(K) Address Record key 1 2 3 4 0 5 6... LOWELL... 4 11.1 Introduction
9
File StructureSNU-OOPSLA Lab9 Idea behind Hash-based Files u Record with hash key i is stored in node i u All record with hash key h are stored in node h u Primary blocks of data level nodes are stored sequentially u Contents of the root node can be expressed by a simple function: Address of data level node for record with primary key k = address of node 0 + H(k) u In literature on hash-based files, primary blocks of data level nodes are called buckets 11.1 Introduction
10
File StructureSNU-OOPSLA Lab10 e.g. Hash-based File 0 1 2 3 4 5 6 root node 70 15 50 1 30 51 10 45 3 11 60 61 124 40 20 55 57 14 15 11.1 Introduction
11
File StructureSNU-OOPSLA Lab11 Hashing(1) u Hashing Functions : Consider a primary key consisting of a string of 12 letters and a file with 100,000 slots. Since 26 12 >> 10 5, So synonyms (collisions) are inevitable! 11.1 Introduction
12
File StructureSNU-OOPSLA Lab12 Hashing(2) u Possible means of hashing u First, because 12 characters = 12 bytes = 3(32-bit) words, partition into 3 words and perform folding as follows : either u (a) add modulo 2 32, or u (b) combine using exclusive OR, or u (c) invent your own method u Next, let R = V mod N R => Record-number V => Value obtained in the above N => Number of Slots so 0< R < N u If N has many small factors, a poor distribution leading to many collisions can occur. Normally, N is chosen to be a primary number 11.1 Introduction
13
File StructureSNU-OOPSLA Lab13 u If M = number of records, N = number of available slots, P(k) = probability of k records hashing to the same slot u then P(k) = where f is the loading factor M/N u As f --> 1, we know that p(0)->1/e and p(1) -> 1/e. The other (1-1/e) of the records must hash into (1-2/e) of the slots, for an average of 2.4 slot. So many synonyms!! Hashing(3) MKMK 1N1N x 1N1N 1- ~ ~ e k *k! fkfk 11.1 Introduction
14
File StructureSNU-OOPSLA Lab14 Collision u Collision u Situation in which a record is hashed to an address that does not have sufficient room to store the record u Perfect hashing : impossible! u Different key, same hash value (Different record, same address) u Solutions u Spread out the records u Use extra memory u Put more than one record at a single address 11.1 Introduction
15
File StructureSNU-OOPSLA Lab15 A Simple Hashing Algorithm Step 1. Represent the key in numerical form u If the key is a string : take the ASCII code u If the key is a number : nothing to be done e.g.. LOWELL = 76 79 87 69 76 76 32 32 32 32 32 32 L O W E L L ( 6 blanks ) 11.2 A Simple Hashing Algorithm
16
File StructureSNU-OOPSLA Lab16 Step 2. Fold and Add u Fold 76 79 | 87 69 | 76 76 | 32 32 | 32 32 | 32 32 u Addparts into one integer (Suppose we use 15bit integer expression, 32767 is limit) 7679 + 8769 + 7676 + 3232 + 3232 + 3232 = 33820 > 32767 (overflow!) u Largest addend : 9090 ( ‘ZZ’ ) u Largest allowable result : 32767 - 9090 = 19937 u Ensure no intermediate sum exceeds using ‘mod’ ( 7679 + 8769 ) mod 19937 = 16448 (16448 + 7676 ) mod 19937 = 4187 ( 4187 + 3232 ) mod 19937 = 7419 ( 7419 + 3232 ) mod 19937 = 10651 ( 10651 + 3232) mod 19937 = 13883 11.2 A Simple Hashing Algorithm
17
File StructureSNU-OOPSLA Lab17 u a = s mod n a : home address s : the sum produced in step 2 n : the number of addresses in the file u e.g.. a = 13883 mod 100 = 83 (s = 13883, n = 100) u A prime number is usually used for the divisor because primes tend to distribute remainders much more uniformly than do nonprimes u So, we chose a prime number as close as possible to the desired size of the address space (eg, a file with 75 records, a good choice for n is 101, then the file will become 74.3 percent full Step 3. Divide by size of the address space 11.2 A Simple Hashing Algorithm
18
File StructureSNU-OOPSLA Lab18 Hashing Functions and Record Distributions Hashing Functions and Record Distributions Distributing records among address Acceptable A B C D E F G 1 2 3 4 5 6 7 8 9 10 RecordAddress Best A B C D E F G 1 2 3 4 5 6 7 8 9 10 RecordAddress A B C D E F G 1 2 3 4 5 6 7 8 9 10 RecordAddress Worst 11.3 Hashing Functions and Record Distributions
19
File StructureSNU-OOPSLA Lab19 Some other hashing methods u Better-than-random u Examine keys for a pattern u Fold parts of the key u Divide the key by a number u When the better-than-random methods do not work ----- randomize! u Square the key and take the middle u Radix transformation 11.3 Hashing Functions and Record Distributions
20
File StructureSNU-OOPSLA Lab20 How Much Extra Memory Should Be Used? uPacking Density = # of records # of spaces r N = uThe more records are packed, the more likely a collision will occur 11.4 How Much Extra Memory Should Be Used?
21
File StructureSNU-OOPSLA Lab21 Poisson Distribution p(x) = (r/N) x e -r/N x! (poisson distribution) N = the number of available addresses r = the number of records to be stored x = the number of records assigned to a given address p(x) : the probability that a given address will have x records assigned to it after the hashing function has been applied to all n records ( x records 가 collision 할 확률 ) 11.4 How Much Extra Memory Should Be Used?
22
File StructureSNU-OOPSLA Lab22 Predicting Collisions for Different Packing Densities u # of addresses no record assigned : N X P(0) u # of addresses one record assigned : N X P(1) u # of addresses more than two assigned : N X [P(2) + P(3) + P(4) +...] u # of overflows : 1 X NP(2) + 2 X NP(3) +... u Percentage of overflow records : N 1 X NP(2) + 2 X NP(3) + 3 X NP(4)... X 100 11.4 How Much Extra Memory Should Be Used?
23
File StructureSNU-OOPSLA Lab23 The larger space, the less overflows Packing Density(%) Synonym as % of records 10 20 30 40 50 60 70 80 90 100 4.8 9.4 13.6 17.6 21.4 24.8 28.1 31.2 34.1 36.8 11.4 How Much Extra Memory Should Be Used? N addresses r records r/N
24
File StructureSNU-OOPSLA Lab24 Collision Resolution by Progressive Overflow u Progressive overflow (= linear probing) u Insert a new record u 1. Take home address if empty u 2. Otherwise, next several addresses are searched in sequence, until an empty one is found u 3. If no more space -- wrapping around 11.5 Collision Resolution by Progressive Overflow
25
File StructureSNU-OOPSLA Lab25 Progressive Overflow(Cont’d) York’s home address (busy) Key York Hash Routine Address 0101 5678956789 2nd try (busy) 3rd try (busy) 4th try (open) York’s actual address.... Novak... Rosen... Jasper... Morely....... 6 11.5 Collision Resolution by Progressive Overflow
26
File StructureSNU-OOPSLA Lab26 Progressive Overflow(Cont'd) Key Blue Hash Routine Address 012012.... 99 97 98 99 Jello... Wrapping around 11.5 Collision Resolution by Progressive Overflow
27
File StructureSNU-OOPSLA Lab27 Progressive Overflow(Cont'd) u Search a record with a hash function value k: from home address k, look at successive records, until Found, or An open address is encountered u Worst case u When the record does not exist and the file is full 11.5 Collision Resolution by Progressive Overflow
28
File StructureSNU-OOPSLA Lab28 Progressive Overflow(Cont'd) - Search length : # of accesses required to retrieve a record (from secondary memory) 20 21 22 23 24 25 Adams... Bates... Cole... Dean... Evans...... Actual Address Home Address 20 21 22 20 Search length 1122511225 Average search length = total search length total # of records = 2.2 11.5 Collision Resolution by Progressive Overflow
29
File StructureSNU-OOPSLA Lab29 Progressive Overflow(Cont'd) u With perfect hashing function : average search length = 1 u Average search length of no greater than 2.0 are generally considered acceptable Average search length Packing density 20 60 80 100 40 5 4 3 2 1 11.5 Collision Resolution by Progressive Overflow
30
File StructureSNU-OOPSLA Lab30 Storing More Than One Record per Address : Buckets u Bucket : a block of records sharing the same address Key Green Hall Jerk King Land Marx Nutt Home Address 30 32 33 Green... Hall... Jenks... King... Land... Marks... 30 31 32 33 Bucket contents (Nutt... is an overflow record) Bucket address 11.6 Storing More Than One Record per Address : Buckets
31
File StructureSNU-OOPSLA Lab31 Effects of Buckets on Performance u N : # of addresses u b : # of records fit in a bucket u bN : # of available locations for records u Packing density = r/bN u # of overflow records N X [ 1XP(b+1) + 2XP(b+2) + 3XP(b+3)...] u As the bucket size gets larger, performance continues to improve 11.6 Storing More Than One Record per Address : Buckets
32
File StructureSNU-OOPSLA Lab32 Bucket Implementation A full bucket An empty bucket Two entries / / / / / ARNSWORTH JONES / / / / / ARNSWORTH JONES STOCKTON BRICE TROOP 0 2 5 / / / / / Collision counter =< bucket size 11.6 Storing More Than One Record per Address : Buckets
33
File StructureSNU-OOPSLA Lab33 Bucket Implementation(Cont'd) u Initializing and Loading u Creating empty space u Use hash values and find the bucket to store u If the home bucket is full, continue to look at successive buckets u Problems when u No empty space exists u Duplicate keys occur 11.6 Storing More Than One Record per Address : Buckets
34
File StructureSNU-OOPSLA Lab34 Making Deletions u The slot freed by the deletion hinders(disturb) later searches u Use tombstones and reuse the freed slots Record Adams Jones Morris Smith Home address 5 6 5 Adams... Jones... Morris... Smith... 56785678 Adams... Jones... ###### Smith... 56785678 A tombstone for Morris Delete Morris 11.7 Making Deletions
35
File StructureSNU-OOPSLA Lab35 Other Collision Resolution Techniques u Double hashing : avoid clustering with a second hash function for overflow records u Chained progressive overflow : each home address contains a pointer to the record with the same address u Chaining with a separate overflow area : move all overflow records to a separate overflow area u Scatter tables : Hash file contains only pointers to records (like indexing) 11.8 Other Collision Resolution Techniques
36
File StructureSNU-OOPSLA Lab36 Overflow File u When building the file, if a collision occurs, place the new synonym into a separate area of the file called the overflow section u This method is not recommended; u if there is a high load factor, there will either be overflow from this overflow section or it will be organized sequentially and performance suffers; u if there is a low load factor, much space is wasted 11.8 Other Collision Resolution Techniques
37
File StructureSNU-OOPSLA Lab37 Linear Probing(1) u When a synonym is identified, search forward from the address given by the hash function (the natural address) until an empty slot is located, and store this record there u This is an example of open addressing (examining a predictable sequence of slots for an empty one) 11.8 Other Collision Resolution Techniques
38
File StructureSNU-OOPSLA Lab38 Linear Probing(2) AS E A R C H I N G E X A MP L E S A A C A E E G H I X E L M N P R key : hash : 1 0 5 1 18 3 8 9 14 7 5 5 1 13 16 12 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 insertion sequence Memory Space A A AC E E E G H I E E G H I X 11.8 Other Collision Resolution Techniques
39
File StructureSNU-OOPSLA Lab39 Rehashing(1) u Another form of open addressing u In linear probing, if synonym occurred, incremented r by 1 and searched next location u In rehashing, use a second hash function for the displacement: D = (FOLD(key) mod P) + 1, where P < N is another prime number u This method has the advantage of avoiding congestion, because each synonym under the first hash function likely uses a different displacement D, and this examines a different sequence of slots 11.8 Other Collision Resolution Techniques
40
File StructureSNU-OOPSLA Lab40 Rehashing (where P=3) (2) AS E A R C H I N G E X A MP L E key : hash : 1 0 5 1 18 3 8 9 14 7 5 5 1 13 16 12 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 S A C E G H I L M N P R insertion sequence Memory Space A A E E G H A A A E E E N X H 11.8 Other Collision Resolution Techniques
41
File StructureSNU-OOPSLA Lab41 Chaining without Replacement(1) u Uses pointers to build linked lists within the file; each linked list contains each set of synonyms; the head of the list is the record at the natural address of these synonyms u When a record is added whose natural address is occupied, it is added to the list whose head is at the natural address u Linear probing usually used to find where to put a new synonym, although rehashing could be used as well 11.8 Other Collision Resolution Techniques
42
File StructureSNU-OOPSLA Lab42 Chaining without Replacement(2) u A problem is that linked lists can coalesce: Suppose that H(R1) = H(R2) = H(R4) = i and H(R3) = H(R5) = i+1, and records are added in the order R1, R2, R3, R4, R5. Then the lists for natural addresses i and i+1 coalesce. Periodic reorganization shortens such lists. u Let FWD and BWD be forward and backward pointers along these chains (doubly-linked) 11.8 Other Collision Resolution Techniques
43
File StructureSNU-OOPSLA Lab43 Chaining without Replacement(3) R1 R2 R3 R4 R5 H H i i+1 11.8 Other Collision Resolution Techniques
44
File StructureSNU-OOPSLA Lab44 Chaining with Replacement(4) u Eliminates problem with deletion that caused abandonment to be necessary u Further reduces search lengths u When inserting a new record, if the slot at the natural address is occupied by a record for which it is not the natural address, then record is relocated so the new record may be replaced at its natural address u Synonym chains can never coalesce, so a record can be deleted even if is the head of a chain, simply by moving the second record on the chain to its natural address ( ABANDON thus is no longer necessary ) 11.8 Other Collision Resolution Techniques
45
File StructureSNU-OOPSLA Lab45 Chaining with Replacement(5) R1 R2 H R1 R3 H R2 H i i+1 R1 R3 R2 i i+1 R4 R5 after R1, R2 addedafter R3 hashes to i+1 finally H H 2 chains : R1 - R2 -R4 R3 - R5 11.8 Other Collision Resolution Techniques
46
File StructureSNU-OOPSLA Lab46 Patterns of Record Access u Pareto Principle ( 80/20 Rule of Thumb): 80 % of the accesses are performed on 20 % of the records! u The concepts of “the Vital Few and the Trivial Many” u 20 % of the fisherman catch 80 % of the fish u 20 % of the burglars steal 80 % of the loot u If we know the patterns of record access ahead, we can do many intelligent and effective things! u Sometimes we can know or guess the access patterns u Very useful hints for file syetms or DBMSs u Intelligent placement of records u fast accesses u less collisions
47
File StructureSNU-OOPSLA Lab47 Let’s Review !!! 11.1 Introduction 11.2 A Simple Hashing Algorithm 11.3 Hashing Functions and Record Distribution 11.4 How Much Extra Memory Should Be Used? 11.5 Collision Resolution by Progressive Overflow 11.6 Storing More Than One Record per Address : Buckets 11.7 Making Deletions 11.8 Other Collision Resolution Techniques 11.9 Patterns of Record Access
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.