Download presentation
Presentation is loading. Please wait.
1
Query Execution 15.5 Two-pass Algorithms based on Hashing By Swathi Vegesna
2
Introduction Hashing is done if the data is too big to store in main memory buffers. – Hash all the tuples of the argument(s) using an appropriate hash key. – For all the common operations, there is a way to select the hash key so all the tuples that need to be considered together when we perform the operation have the same hash value. – This reduces the size of the operand(s) by a factor equal to the number of buckets.
3
Partitioning Relations by Hashing Algorithm: initialize M-1 buckets using M-1 empty buffers; FOR each block b of relation R DO BEGIN read block b into the Mth buffer; FOR each tuple t in b DO BEGIN IF the buffer for bucket h(t) has no room for t THEN BEGIN copy the buffer t o disk; initialize a new empty block in that buffer; END; copy t to the buffer for bucket h(t); END ; FOR each bucket DO IF the buffer for this bucket is not empty THEN write the buffer to disk;
4
Duplicate Elimination For the operation δ(R) hash R to M-1 Buckets. (Note that two copies of the same tuple t will hash to the same bucket) Do duplicate elimination on each bucket R i independently, using one-pass algorithm The result is the union of δ(R i ), where R i is the portion of R that hashes to the ith bucket
5
Requirements Number of disk I/O's: 3*B(R) – B(R) < M(M-1), only then the two-pass, hash-based algorithm will work In order for this to work, we need: – hash function h evenly distributes the tuples among the buckets – each bucket R i fits in main memory (to allow the one-pass algorithm) – i.e., B(R) ≤ M 2
6
Grouping and Aggregation Hash all the tuples of relation R to M-1 buckets, using a hash function that depends only on the grouping attributes (Note: all tuples in the same group end up in the same bucket) Use the one-pass algorithm to process each bucket independently Uses 3*B(R) disk I/O's, requires B(R) ≤ M 2
7
Union, Intersection, and Difference For binary operation we use the same hash function to hash tuples of both arguments. R U S we hash both R and S to M-1 R ∩ S we hash both R and S to 2(M-1) R-S we hash both R and S to 2(M-1) Requires 3(B(R)+B(S)) disk I/O’s. Two pass hash based algorithm requires min(B(R)+B(S))≤ M2
8
Hash-Join Algorithm Use same hash function for both relations; hash function should depend only on the join attributes Hash R to M-1 buckets R 1, R 2, …, R M-1 Hash S to M-1 buckets S 1, S 2, …, S M-1 Do one-pass join of R i and S i, for all i 3*(B(R) + B(S)) disk I/O's; min(B(R),B(S)) ≤ M 2
9
Sort based Vs Hash based For binary operations, hash-based only limits size to min of arguments, not sum Sort-based can produce output in sorted order, which can be helpful Hash-based depends on buckets being of equal size Sort-based algorithms can experience reduced rotational latency or seek time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.