Download presentation
Presentation is loading. Please wait.
1
15.5 Two-Pass Algorithms Based on Hashing 115 ChenKuang Yang
2
Outline Partitioning Relations by Hashing A Hash-Based Algorithm for Duplicate Elimination Hash-Based Grouping and Aggregation Hash-Based Union, Intersection, and Difference The Hash-Join Algorithm
3
The essential idea behind all these hash-based algorithms: If the data is too big to store in main- memory, hash all the tuples of the argument or arguments using an appropriate hash key.
4
Partitioning Relations by Hashing Partition R into M-1 buckets of roughly equal size. Associate one buffer with each bucket. Each tuple t in the block is hashed to bucket h(t) and copied to the appropriate buffer. Assumes that tuples are never too large to it in an empty buffer.
5
A Hash-Based Algorithm for Duplicate Elimination Two copies of the same tuple t will hash to the same bucket. We can examine one bucket at a time, perform δ on that bucket in isolation, and take as the answer the union of δ(R i ), where R i is the portion of R that hashes to the ith bucket.
6
Hash-Based Grouping and Aggregation In order to make sure that all tuples of the same group wind up in the same bucket, we must choose a hash unction that depends only on the grouping attributes of the list L. If there are few groups, then we may actually be able to handle much larger relations R than is indicated by the B(R) ≦ M 2 rule.
7
Hash-Based Union, Intersection, and Difference When the operation is binary, we must make sure that we use the same hash function to hash tuples of both arguments. The one-pass algorithms for union, intersection, and difference require that the smaller operand occupies at most M-1 blocks.
8
The Hash-Join Algorithm The only difference of the join operation from the other operations is that we must use as the hash key just the join attributes, then we can be sure that if tuples of R and S join, they will wind up in corresponding buckets Ri and Si for some i.
9
Saving Some Disk I/O If there is more memory available on the first pass than we need to hold one block per bucket, then we have some opportunities to save disk I/O. Hybrid hash-join: when we hash S, we can choose to keep m of the k buckets entirely in main memory, while keeping only one block for each of the other k-m buckets if
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.