1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA AILAMAKI Carnegie Mellon University PHILLIP B. GIBBONS Intel Research Pittsburgh and TODD C. MOWRY Carnegie Mellon University and Intel Research Pittsburgh -Manisha Singh
2 Outline ____________________________ -Overview -Proposed Techniques -Experimental setup -Performance evaluation -Conclusion
3 Hash Joins ________________________________ - used in the implementation of a relational database management system - Two relation – build (small) and probe (large). - Excessive random I/Os -If the build relation and hash table can not fit in memory Build Relation Probe Relation Hash Table
4 Hash Join Performance ________________________________ - Suffer from CPU Cache Stalls Most of execution time is wasted on data cache misses -Most of execution time is wasted on data cache misses – 82% for partition, 73% for join – Because of random access patterns in memory
5 Solution: Cache Prefetching ___________________________ Cache prefetching has been successfully applied to several types of applications. Cache prefetching has been successfully applied to several types of applications. exploit cache prefetching to improve hash join performance. exploit cache prefetching to improve hash join performance.
6 Challenges to Cache Prefetching __________________________ Difficult to obtain memory addresses early –Randomness of hashing prohibits address prediction –Data dependencies within the processing of a tuple Complexity of hash join code –Ambiguous pointer references –Multiple code paths –Cannot apply compiler prefetching techniques
7 Overcoming These Challenges ____________________________ -Evaluate two new prefetching techniques: Group prefetching - try to hide cache miss latency across a group tuples Software-pipelined prefetching - avoid these intermittent stalls
8 Group Prefetching –Hide cache miss latency across a group tuples. –Then combine the processing of a group of tuples into a single loop body and rearrange the probe operations into stages –Process the tuples for a stage and then move to the next stage –Add prefetch instructions to the algorithm. –issue prefetch instructions in one code stage for the memory references in the next code stage.
9 Group Prefetching
10 Software-Pipelined Prefetching –Overlaps cache misses across different code stages of different tuples –The code stages of the same tuple are processed in subsequent iterations –Can overlap the cache miss latency of a tuple across all processing in an iteration
11 Software-Pipelined Prefetching
12 Group vs. Software-Pipelined Prefetching Hiding latency: –Software-pipelined pref is always able to hide all latencies Book-keeping overhead: –Software-pipelined pref has more overhead Code complexity: –Group prefetching is easier to implement –Natural group boundary provides a place to do necessary processing left (e.g. for read-write conflicts) –A natural place to send outputs to the parent operator if pipelined operator is needed
13 Experimental Setup - Use a simple schema for both the build and probe relations - Every tuple contains a 4 byte join attribute and a fixed length payload - Perform join without selections and projections. - Assume the join phase uses 50MB memory to join a pair of build and probe partition
14 Performance Evaluation Hash Join is CPU-bound with reasonable I/O bandwidth -The main total time is the elapsed real time of an algorithm phase. -The worker io stall time is the largest I/O stall time of individual worker threads
15 Performance Evaluation cont.. User-Mode CPU Cache Performance - Join Phase Performance This technique achieved X speedups over original hash join
16 Performance Evaluation cont.. Join Performance varying Memory Latency -prefecthing techniques are effective even when the processor/memory speed gap increases dramatically
17 Performance Evaluation cont..
18 Some Practical Issues Some issues may arise when implementing these prefetching techniques in a production DBMS that targets multiple architectures and is distributed as binaries. 1. The syntax of prefetch instructions is often different across architectures and compilers. 2. Some architecture do not support faulting prefetches 3. Several architectures require software to explicitly manage the caches and network processors 4. Pre-set parameters for the group size and the prefetch distance may be suboptimal on machines with very different configurations
19Conclusion -Even though prefetching is a promising technique for improving CPU cache performance, applying it to the hash join algorithm is not straightforward (due to the dependencies within the processing of a single tuple and the randomness of Hashing) (due to the dependencies within the processing of a single tuple and the randomness of Hashing) - Experimental results demonstrated that hash join performance can be improved by using group prefetching and software-pipelined prefetching techniques. -Several practicle issues when used on DBMS that targets multiple architectures
20 Thank you Questions?