Hash Table indexing and Secondary Storage Hashing
In Memory An array of B buckets indexed from 0 to B-1 Each bucket is the head of a linked list The bucket index is determined by hash function h(k), where k is the key A common hash function, h(k) = B%k 4/6/20092COMP Mount Allison University
Secondary Storage Hashing Static hash table, fixed number of buckets Dynamic hash table, number of buckets can grow 4/6/20093COMP Mount Allison University
Static Hash Table The bucket array consists of blocks, rather than pointers to linked lists Records that are hashed by the hash function to a certain bucket are stored in the block of that bucket If there is no more place in the block, a chain of overflow blocks can be added to the bucket 4/6/20094COMP Mount Allison University
Dynamic Hash Tables Number of buckets (B) approximate the number of records divided by the number of records that can fit on a block, i.e. there is about one block per bucket Extensible hashing, B grows by doubling it Linear hashing, B grows by 1 4/6/20095COMP Mount Allison University
Extensible Hashing There is an array of pointers to blocks that represent the buckets, instead of array consisting of data itself. The length of array is always a power of, so in a growing step, the number of buckets doubles. There is not necessarily a data block for every bucket, some buckets can share a block if total number of records in those buckets fit in a block 4/6/20096COMP Mount Allison University
Extensible Hashing The hash function computes for each key a sequence of k bits. The bucket numbers use a small set of those k bits, say i most significant bits Therefore the bucket array has 2 i entries 4/6/20097COMP Mount Allison University
Extensible Hashing Advantage: when looking for a record, we never need to search more than one data block Disadvantage: for large i, doubling the array size is a substantial amount of work 4/6/20098COMP Mount Allison University
Extensible Hashing Disadvantage: for large i, the bucket array may not fit in memory any more. Example: assuming i = 32, the size of array will be 4 billion entries, and every pointer is 32 bits or 4 bytes, then the size of array will be 4 bytes x 4 billion = 16 GB 4/6/20099COMP Mount Allison University
Extensible Hashing 1.Every key has 4 bits, the most significant bit is used to determine the bucket number 2.The number 1 appearing in the nub of each block (lets call it j), indicates the number of bits used to determine membership of records in this block 4/6/200910COMP Mount Allison University
Extensible Hashing Insertion: If i = j, increment i by 1, and double the length of bucket array, i.e. 2 i+1 If j < i, split block B into two, distribute records in B to the two blocks based on (j+1) most significant bits, adjust j value for the proper blocks, adjust pointers in bucket array to point to proper blocks 4/6/200911COMP Mount Allison University
Extensible Hashing 1.Lets insert 1010 into this structure, it has to go to block 1, but there no place, 2.Then we have to split the block,, 3.and i = j, then we increment i and double the size of bucket array 4.Then we can split the block 1 into two blocks 4/6/200912COMP Mount Allison University
Extensible Hashing 1.Now block 1 is split into blocks 10 and 11 2.We use two bits now to determine the proper block for every record 3.Note the first block still is using one bit, therefore both buckets 00 and 01 point to it 4.If we insert 0000, it will go to the block pointed by buckets 00 and 01 5.If we insert 0111, based on i = 2 it has to go the same block and there is no room 6.Since j < i, we can simply split that block into two and adjust the proper bucket pointers 4/6/200913COMP Mount Allison University
Linear Hashing The number of buckets B is always chosen so the average number of records per bucket is a fixed fraction, say 80%, of the number of records that fill one block. Since blocks cannot always be split, overflow blocks are permitted. 4/6/200914COMP Mount Allison University
Linear Hashing The number of bits used to number the entries of the bucket array is (Ceiling (log 2 B)), where B is the current number of buckets. These bits are always taken from the right (low-order) end of the bit sequence that is produced by the hash function. We treat those bits as a binary integer number m, therefore if m<B, then the bucket m exists, if B <= m < 2 i, the bucket m does not exist yet, we place the record in bucket m – 2 i-1, 4/6/200915COMP Mount Allison University
Linear Hashing 1.i is the number of bits to address the buckets, the right most bit is used 2.n is the number of buckets 3.r is the number of records 4.We keep r/n <= 1.7, average occupancy of a bucket does not exceed 85% of the capacity of the block 4/6/200916COMP Mount Allison University
Linear Hashing 1.To insert 0101, since the bit sequence ends in 1, the record goes to bucket 1. 2.There is room then it can go there. 3.However now we exceed the ratio 1.7 (r/n), we should raise n to 3, then i = log 3 = 2 4/6/200917COMP Mount Allison University
Linear Hashing 1.Now we insert 0001, it has to go to the bucket 01, since its last two bits are 01 2.However that bucket is full 3.We add an overflow block 4.The ratio of records/buckets is 5/3, and still less than 1.7, so we don’t create new bucket 4/6/200918COMP Mount Allison University
Linear Hashing 1.Now lets insert 0111, this has to go to bucket m = 11 2, 2. m = 11 2 = 3 10 = n (number of buckets), then the bucket doesn’t exist 3.We place it in the bucket m – 2 i-1, i.e. 3 – 2 = 1 10 = 01 2, 4.However, the ratio of r/n exceeds 1.7, so we create a new bucket, i.e. 11 4/6/200919COMP Mount Allison University
Linear Hashing 4/6/2009COMP Mount Allison University20 1.Suppose we look for Since i = 2, we look for bucket number = 10 2 = Since m < n, then the bucket exist 1.Now lets look for Must be in bucket 11 3.But 11 2 = 3 10 = n, therefore the bucket doesn’t exist 4.We redirect to bucket 01 2 = 1 10, remember (m – 2 i-1 ) 5.If it is not there, surely it doesn’t exist