Download presentation
Presentation is loading. Please wait.
Published byAndrew Hodges Modified over 9 years ago
1
Hashing Static Hashing Dynamic Hashing
2
– 2 – Sungkyunkwan University, Hyoung-Kee Choi © Symbol table ADT We define the symbol table as a set of name-attribute pairs. In a thesaurus, the name is a word, and the attribute is a list of synonyms for the word. In a symbol table for a compiler, the name is an identifier, and the attributes might include an initial value and a list of lines that use the identifier. Generally we would want to perform the following operations on any symbol table: Determine if a particular name is in the table Retrieve the attributes of that name Modify the attributes of that name Insert a new name and its attributes Delete a name and its attributes.
3
– 3 – Sungkyunkwan University, Hyoung-Kee Choi © There are only three basic operations on symbol tables: searching, inserting, and deleting. To implement these operations, we could use the O(n) binary search tree introduced in Section 5.7, or some other binary trees with O(log n) complexity. In this chapter we examine a technique for search, insert, and delete operations that has very good expected performance. This technique is referred to as hashing. Unlike search tree methods which rely on identifier comparisons to perform a search, hashing relies on a formula called the hash function.
4
– 4 – Sungkyunkwan University, Hyoung-Kee Choi © In static hashing, we store the identifiers in a fixed size table called a hash table. We use an arithmetic function, f, to determine the address, or location, of an identifier, x, in the table. Thus, f(x) gives the hash, or home address, of x in the table. The hash table ht is stored in sequential memory locations that are partitioned into b buckets, ht[0], …, ht[b-1]. Each bucket has s slots. Static hashing
5
– 5 – Sungkyunkwan University, Hyoung-Kee Choi © Definition: The identifier density of a hash table is the ratio n/T, where n is the number of identifiers in the table. The loading density or loading factor of a hash table is = n/(sb). Two identifiers, i 1 and i 2, are synonyms with respect to f if f(i 1 ) = f(i 2 ). An overflow occurs when we hash a new identifier, i, into a full bucket. A collision occurs when we hash tow nonidentical identifiers into the same bucket. When the bucket size is 1, collisions and overflows occur simultaneously.
6
– 6 – Sungkyunkwan University, Hyoung-Kee Choi © The time required to enter, delete, or search for identifiers does not depend on the number of identifiers n in use; it is O(1). Since the ratio b/T is usually small, we cannot avoid collisions altogether. Example 8.1 b = 26, s = 2 f(x) = the first character of x Slot 0Slot 1 0acosatan 1 2charceil 3define 4exp 5floatfloor 6 … 25 Hash tables with 26 buckets and two slots per bucket
7
– 7 – Sungkyunkwan University, Hyoung-Kee Choi © A hash function, f, transforms an identifier, x, into a bucket address in the hash table. We want a hash function that is easy to compute and that minimizes the number of collisions. To avoid collisions, the hash function should depend on all the characters in an identifier. Hashing functions should be unbiased. That is, if we randomly choose an identifier, x, from the identifier space, the probability that f(x) = i is 1/b for all buckets i. We call a hash function that satisfies unbiased property a uniform hash function. Hashing functions
8
– 8 – Sungkyunkwan University, Hyoung-Kee Choi © Four type of uniform hash functions: Mid-square Division Folding Digit Analysis Mid-square We compute the function f by squaring the identifier and then using an appropriate number of bits from the middle of the square to obtain the bucket address. Since the middle bits of the square usually depend upon all the characters in an identifier, there is high probability that different identifiers will produce different hash addresses. The number of bits used to obtain the bucket address depends on the table size. If we use r bits, the range of the value is 2 r.
9
– 9 – Sungkyunkwan University, Hyoung-Kee Choi © Division We divide the identifier x by some number M and use the remainder as the hash address for x. f(x) = x % M This gives bucket addresses that range from 0 to M - 1, where M = that table size. The choice of M is critical. If M is divisible by 2, then odd keys are mapped to odd buckets and even keys are mapped to even buckets. When many identifiers are permutations of each other, a biased use of the table results. A good choice for M would be : M a prime number such that M does not divide r k a for small k and a. In practice, choose M such that it has no prime divisors less than 20.
10
– 10 – Sungkyunkwan University, Hyoung-Kee Choi © Folding We partition the identifier x into several parts. All parts, except for the last one have the same length. We then add the parts together to obtain the hash address for x. There are two ways of carrying out this addition. Shift folding: We shift all parts except for the last one, so that the least significant bit of each part lines up with corresponding bit of the last part. We then add the parts together to obtain f (x). Ex: suppose that we have divided the identifier x into the following parts: x 1 = 123, x 2 = 203, x 3 = 241, x 4 = 112, and x 5 = 20. We would align x 1 through x 4 with x 5 and add. This gives us a hash address of 699. Folding at the boundaries: reverses every other partition before adding. Ex: suppose the identifier x is divided into the same partitions as in shift folding. We would reverse the second and forth partitions, that is x 2 = 302 and x 4 = 211, and add the partitions. This gives us a hash address of 897.
11
– 11 – Sungkyunkwan University, Hyoung-Kee Choi © Digit Analysis Digit Analysis Digital analysis is used with static files. A static file is one in which all the identifiers are known in advance. Using this method, We first transform the identifiers into numbers using some radix, r. We then examine the digits of each identifier, deleting those digits that have the most skewed distribution. We continue deleting digits until the number of remaining digits is small enough to give an address in the range of the hash table. Of these methods, the one most suitable for general purpose applications is the division method with a divisor, M, such that M has no prime factors less than 20.
12
– 12 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (1/8) Linear open addressing (Linear probing) Compute f(x) for identifier x Examine the buckets: ht[(f(x)+j)%TABLE_SIZE], 0 j TABLE_SIZE The bucket contains x. The bucket contains the empty string (insert to it) The bucket contains a nonempty string other than x (examine the next bucket) (circular rotation) Return to the home bucket ht[f(x)], if the table is full we report an error condition and exit
13
– 13 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (2/8) Additive transformation and Division Hash table with linear probing (13 buckets, 1 slot/bucket) insertion
14
– 14 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (3/8) Problem of Linear Probing Identifiers tend to cluster together Adjacent cluster tend to coalesce Increase the search time Example: suppose we enter the C built-in functions into a 26-bucket hash table in order. The hash function uses the first character in each function name acos, atoi, char, define, exp, ceil, cos, float, atol, floor, ctime Hash table with linear probing (26 buckets, 1 slot/bucket) Enter: Enter sequence: acosatoichardefineexpceilcosfloatatolfloorctime # of key comparisons=35/11=3.18
15
– 15 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (4/8) Alternative techniques to improve open addressing approach: Quadratic probing rehashing random probing Rehashing Try f 1, f 2, …, f m in sequence if collision occurs disadvantage comparison of identifiers with different hash values use chain to resolve collisions
16
– 16 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (5/8) Quadratic Probing Linear probing searches buckets (f(x)+i)%b Quadratic probing uses a quadratic function of i as the increment Examine buckets f(x), (f(x)+i 2 )%b, (f(x)-i 2 )%b, for 1<=i<=(b-1)/2 When b is a prime number of the form 4j+3, j is an integer, the quadratic search examines every bucket in the table Primej j 304310 715914 11212731 19425162 235503125 3171019254
17
– 17 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (6/8) Chaining Linear probing and its variations perform poorly because inserting an identifier requires the comparison of identifiers with different hash values. In this approach we maintained a list of synonyms for each bucket. To insert a new element Compute the hash address f (x) Examine the identifiers in the list for f(x). Since we would not know the sizes of the lists in advance, we should maintain them as linjed chains The experimental evaluation indicates that chaining performs better than linear open addressing.
18
– 18 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (7/8) Results of Hash Chaining acos, atoi, char, define, exp, ceil, cos, float, atol, floor, ctime f (x)=first character of x # of key comparisons=21/11=1.91
19
– 19 – Sungkyunkwan University, Hyoung-Kee Choi © Overflow Handling (8/8) Comparison: In Figure 8.7, The values in each column give the average number of bucket accesses made in searching eight different table with 33,575, 24,050, 4909, 3072, 2241, 930, 762, and 500 identifiers each. Chaining performs better than linear open addressing. We can see that division is generally superior Average number of bucket accesses per identifier retrieved
20
– 20 – Sungkyunkwan University, Hyoung-Kee Choi © void chaing_insert(element item, list_pointer ht[]) { int hash_value = hash(item.key); list_pointer ptr, trail = NULL, lead = ht[hash_value]; for(; lead; trail = lead, lead = lead->link) if (!strcmp(lead->item.key, item-key)) { fprintf(stderr, “The key is in the table\n”); exit(1);} ptr = (list_pointer)malloc(sizeof(list)); if (IS_FULL(ptr)) { fprintf(stderr, “The memory is full\n”); exit(1);} ptr->item = item; ptr->link = NULL if (trail) trail->link = ptr; else ht[hash_value] = ptr; } typedef struct { char key[MAX_CHAR]; /* other fields */ } element; typedef struct list *list_pointer; Typedef struct list { element item; list_pointer link;} List_pointer hash_table[TABLE_SIZE];
21
– 21 – Sungkyunkwan University, Hyoung-Kee Choi © Motivation for dynamic hashing One of the most important classes of software is the DBMS. A key characteristic of a DBMS is that the amount of information can vary a great deal over time. Various data structures have been suggested for storing the data in a DBMS. In this section, we examine an extension of hashing that permits the technique to be used by a DBMS. Traditional hashing schemes are not ideal because we must statically allocate a portion of memory to hold the hash table. Dynamic hashing, also referred to as extendible hashing, can accommodate dynamically increasing and decreasing file size without penalty. Dynamic hashing
22
– 22 – Sungkyunkwan University, Hyoung-Kee Choi © Dynamic hashing using directories Example: an identifier consists of two characters and each character is represented by 3 bits. We would like to place these identifiers into a table that has four page. Each page can hold no more than two identifiers, and the pages are indexed by the 2 bit sequence 00, 01, 10, 11, respectively. We use the two low-order bits of each identifier to determine the page address of the identifier. Identifiers Binary representation a0100 000 a1100 001 b0101 000 b1101 001 c0110 000 c1110 001 c2110 010 c3110 011
23
– 23 – Sungkyunkwan University, Hyoung-Kee Choi © a0, b0 c2 a1, b1 c3 0 0 1 0 1 1 a0, b0 c2 a1, b1 c3 0 0 1 0 1 1 c5 0 1 a0, b0 c2 a1, c1 c3 0 0 1 0 1 1 c5 0 1 b1 0 1 two level trie on four pages inserting c5 with overflow inserting c1 with overflow a b c d a a bb c c d d e e f We use the term trie to denote a binary tree in which we locate an identifier by following its bit sequence. Notice that this trie has nodes that always branch in two directions corresponding to 0 or 1. Only the leaf nodes of the trie contain a pointer to a page.
24
– 24 – Sungkyunkwan University, Hyoung-Kee Choi © From this example we can see two major problems exist. The access time for a page depends on the number of bits needed to distinguish the identifiers. If the identifiers have a skewed distribution, the tree is also skewed. How to avoid these problems? To avoid the skewed distribution of identifiers, a hash function is used. To avoid the long search down the trie, the trie is mapped to a directory. A directory is a table of page pointer. In case k bits are needed to distinguish the identifiers, the directory has 2 k entries indexed 0, 1, 2, 3, …, 2 k -1.
25
– 25 – Sungkyunkwan University, Hyoung-Kee Choi © 00 -a-> a0, b0000 -a-> a0, b00000 -a-> a0, b0 01 -c-> a1, b1001 -c-> a1, b10001 -c-> a1, c1 10 -b-> c2010 -b-> c20010 -b-> c2 11 -d-> c3011 -e-> c30011 -f-> c3 100 -a->0100 -a-> 101 -d-> c50101 -e-> c5 110 -b->0110 -b-> 111 -e->0111 -f-> 1000 -a-> 1001 -d-> b1 1010 -b-> 1011 -f-> 1100 -a-> 1101 -e-> 1110 -b-> 1111 -f->
26
– 26 – Sungkyunkwan University, Hyoung-Kee Choi © Advantage of a directory: Using a directory to represent a trie allows the table of identifiers to grow and shrink dynamically. Accessing any page requires only two steps. First, use the hash function to find the address of the directory entry. Then, retrieve the page associated with the address. Disadvantage of a directory: If the keys are not uniformly divided among the pages, the directory can grow quite large. However, most of the entries point to the same pages. To prevent this from happening, we cannot use the bit sequence of the keys themselves. Instead we translate the bits into a random sequence using a uniform hash function as discussed in the previous section. We need a family of hash functions, because, at any point, we may require a different number of bits to distinguish the new key.
27
– 27 – Sungkyunkwan University, Hyoung-Kee Choi © Our solution is the family of where hash i is simply hash i-1 with either a zero or one appended as the new leading bit of the result. Thus hash (key, i) might be a function that produces a random number of i bits from the identifier key. Some important twists are associated with this approach. For example, suppose a page identified by i bits overflows. We allocate a new page and rehash the identifiers into those two pages. The identifiers in both pages have their low-order i bits in common. We refer to these pages as buddies. When the number of identifiers in two buddy pages is no more than the capacity of a single page, then we coalesce the two pages into one.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.