External Memory Hashing
Hash Tables Hash function h: search key [0…B-1]. Buckets are blocks, numbered [0…B-1]. Big idea: If a record with search key K exists, then it must be in bucket h(K). One disk I/O if there is only one block per bucket. h(d) = 0, h(e) = h(c) = 1, … HashTable Lookup: For record(s) with search key K, compute h(K); search that bucket.
HashTable Insertion Put in bucket h(K) if it fits; otherwise create an overflow block. Overflow block(s) are part of bucket. Example: Insert record with search key g, where h(g) = 1 Delete c, where h(c)=1 ?
What if the File Grows too Large? Efficiency is highest if actual #records < #buckets max #(records/bucket) If file grows, we need a dynamic hashing method to maintain the above relationship. Extensible Hashing: double the number of buckets when needed. Linear hashing: add one more bucket as appropriate.
Dynamic Hashing Framework Hash function h produces a sequence of k bits. Only some of the bits are used at any time to determine placement of keys in buckets. Extensible Hashing (Buckets may share blocks!) Keep parameter i = number of bits from the beginning of h(K) that determine the bucket. Bucket array now = pointers to buckets. A block can serve several buckets. For each block, a parameter ji tells how many bits of h(K) determine membership in the block. i.e., a block represents 2i-j buckets that share the first j bits of their number.
Example An extensible hash table when i=1:
Extensible Hashtable Insert If record with key K fits in the block B pointed to by h(K), put it there. If not, let this block B represent j bits. j=i: Set i:=i+1; Double the bucket array, so it has now 2i+1 entries; Let w be an old array entry. Both the new entries, w0 and w1, point to the same block that w used to point to. Split B into two and distribute the records (of B) according to (j+1)st bit; set j:=j+1; fix pointers in bucket array, so that entries that formerly pointed to B now point either to B or the new block How? depending on…(j+1)st bit j<i: Do as in 1.d
Now, after the insertion Example Insert record with h(K) = 1010. Before Now, after the insertion
Example: Next Next: records with h(K)=0000; h(K)=0111. Bucket for 0... gets split, but i stays at 2. Then: record with h(K) = 1000. Overflows bucket for 10... Raise i to 3. After the insertions Currently
More on Extensible Hashing Two ideas Use i bits of the k bits of h(Key) 00110101 k use i grows over time…. h(Key) Use directory: Bucket Address Table (BAT) . h(Key)[i ] to bucket
Example – Extensible Hashing – Insert New directory 2 00 01 10 11 i = i = 1 0001 1001 1100 h(K) is 4 bits; 2 keys/bucket 1 1100 1010 Insert 1010
Example – Extensible Hashing – Insert 2 0111 0000 0001 i = 1 0001 2 1001 1010 1100 00 01 10 11 h(K) is 4 bits; 2 keys/bucket 0111 Insert: 0111 0000
Example – Extensible Hashing – Insert 000 001 010 011 100 101 110 111 3 i = 00 01 10 11 2 i = 1001 1010 1100 0111 0000 0001 h(K) is 4 bits; 2 keys/bucket 1001 1011 Insert: 1011
Extensible Hash Tables: Advantages: Lookup; never search more than one data block. Hope that the bucket array fits in main memory Defects: Doubling the bucket array could make the array to not fit in main memory. Problem with skewed key distributions. E.g. Let 1 block=2 records. Suppose that three records have hash values, which happen to be the same in the first 20 bits. In that case we would have i=21 and 2^21 ≈ two million bucket-array entries, even though we have only 3 records!!
This is also part of the structure Linear Hashing Number of buckets n is always chosen so that no. of records r < p × (no. of buckets × bucket capacity) p is typically between 0.5 and 0.9 Overflow blocks are permitted If r > p × (no. of buckets × bucket capacity) one new bucket is allocated, and one old bucket needs to be rehashed Number of bits i needed to index n buckets = log2(n) Buckets numbered [0…n-1], where 2i-1 < n 2i. i=1 n=2 r=3 #of buckets #of records This is also part of the structure
This is also part of the structure Linear Hashing Use i bits from right (low order) end of h(K). Buckets numbered [0…n-1], where 2i-1 < n 2i. Let last i bits of h(K) be ai-1ai-2…a0 = integer m. If m < n, then record belongs to bucket m. If n m < 2i, then record belongs to bucket m-2i-1, that is the bucket we would get if we changed ai-1 (which must be 1) to 0. i=1 n=2 r=3 #of buckets #of records This is also part of the structure
Lookup in Linear Hash Table For record(s) with search key K, compute h(K); search the corresponding bucket according to the procedure described for insertion. If the record we wish to look up isn’t there, it can’t be anywhere else. E.g. lookup for a key which hashes to 1010, and then for a key which hashes to 1011. i=2 n=3 r=4
Linear HashTable Insert Pick an upper limit on capacity, e.g., 85% (1.7 records/bucket in our example). If an insertion exceeds capacity limit, set n := n + 1. If new n is 2i + 1, set i := i + 1. No change in bucket numbers needed --- just imagine a leading 0. Need to split bucket n - 2i-1 because there is now a bucket numbered (old) n.
Example Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. r=1 n=1 i=1 0000 Before After
Example Capacity limit exceeded; increment n Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. r=1 n=1 i=1 0000 r=2 n=2 i=1 1010 0000 1 Before After Capacity limit exceeded; increment n
Example Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. r=2 n=2 i=1 1010 0000 r=3 n=2 i=1 1 1111 1 Before After
Example Capacity limit exceeded; increment n, Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. r=3 n=2 i=1 1010 0000 r=4 n=3 i=2 0000 00 1111 1 1111 01 0101 1010 10 Before After Capacity limit exceeded; increment n, which causes incrementing i as well.
Example As long as capacity is not exceeded can add overflow blocks. Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. r=4 n=3 i=2 0000 00 r=5 n=3 i=2 0000 00 1111 01 0101 1111 01 0101 0001 1010 10 1010 10 Before After As long as capacity is not exceeded can add overflow blocks.
Example Capacity limit exceeded; increment n. Insert records with h(K) = 0000, 1010, 1111, 0101, 0001, 1100. r=5 n=3 i=2 0000 00 r=6 n=4 i=2 0000 00 1100 1111 01 0101 0001 0001 01 0101 1010 10 1010 10 1111 11 Before After Capacity limit exceeded; increment n.
Exercise Suppose we want to insert keys with hash values: 0000…1111 in a linear hash table with 100% capacity threshold. Assume that a block can hold three records.