Download presentation
Presentation is loading. Please wait.
1
@ Carnegie Mellon Databases Improving Hash Join Performance Through Prefetching Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki ‡ Carnegie Mellon University Intel Research Pittsburgh ‡
2
@ Carnegie Mellon Databases - 2 - Hash Join Simple hash join: Build hash table on smaller (build) relation Probe hash table using larger (probe) relation Random access patterns inherent in hashing Excessive random I/Os If build relation and hash table cannot fit in memory Build Relation Probe Relation Hash Table
3
@ Carnegie Mellon Databases - 3 - I/O Partitioning Avoid excessive random disk accesses Join pairs of build and probe partitions separately Sequential I/O patterns for relations and partitions Hash join is CPU-bound with reasonable I/O bandwidth Build Probe
4
@ Carnegie Mellon Databases - 4 - Partition: divides a 1GB relation into 800 partitions Join: 50MB build partition 100MB probe partition Detailed simulations based on Compaq ES40 system Most of execution time is wasted on data cache misses 82% for partition, 73% for join Because of random access patterns in memory Hash Join Cache Performance
5
@ Carnegie Mellon Databases - 5 - Cache partitioning: generating cache-sized partitions Effective in main-memory databases [Shatdal et al., 94], [Boncz et al., 99], [Manegold et al.,00] Two limitations when used in commercial databases 1) Usually need additional in-memory partitioning pass Cache is much smaller than main memory 50% worse than our techniques 2) Sensitive to cache sharing by multiple activities Employing Partitioning for Cache? Main Memory CPU Cache Build Partition Hash Table
6
@ Carnegie Mellon Databases - 6 - Our Approach: Cache Prefetching Modern processors support: Multiple cache misses to be serviced simultaneously Prefetch assembly instructions for exploiting the parallelism Overlap cache miss latency with computation Successfully applied to Array-based programs [Mowry et al., 92] Pointer-based programs [Luk & Mowry, 96] Database B + -trees [Chen et al., 01] Main Memory CPU L2/L3 Cache L1 Cache pref 0(r2) pref 4(r7) pref 0(r3) pref 8(r9)
7
@ Carnegie Mellon Databases - 7 - Challenges for Cache Prefetching Difficult to obtain memory addresses early Randomness of hashing prohibits address prediction Data dependencies within the processing of a tuple Naïve approach does not work Complexity of hash join code Ambiguous pointer references Multiple code paths Cannot apply compiler prefetching techniques
8
@ Carnegie Mellon Databases - 8 - Our Solution Dependencies are rare across subsequent tuples Exploit inter-tuple parallelism Overlap cache misses of one tuple with computation and cache misses of other tuples We propose two prefetching techniques Group prefetching Software-pipelined prefetching
9
@ Carnegie Mellon Databases - 9 - Outline Overview Our Proposed Techniques Simplified Probing Algorithm Naïve Prefetching Group Prefetching Software-Pipelined Prefetching Dealing with Complexities Experimental Results Conclusions
10
@ Carnegie Mellon Databases - 10 - Simplified Probing Algorithm foreach probe tuple { (0)compute bucket number; (1)visit header; (2)visit cell array; (3)visit matching build tuple; } Hash Bucket Headers Hash Cell (hash code, build tuple ptr) Build Partition
11
@ Carnegie Mellon Databases - 11 - Naïve Prefetching foreach probe tuple { (0)compute bucket number; prefetch header; (1)visit header; prefetch cell array; (2)visit cell array; prefetch matching build tuple; (3)visit matching build tuple; } 0 1 2 3 0 1 2 3 0 1 2 3 tim e Cache miss latency Data dependencies make it difficult to obtain addresses early
12
@ Carnegie Mellon Databases - 12 - Group Prefetching foreach group of probe tuples { foreach tuple in group { (0)compute bucket number; prefetch header; } foreach tuple in group { (1)visit header; prefetch cell array; } foreach tuple in group { (2)visit cell array; prefetch build tuple; } foreach tuple in group { (3)visit matching build tuple; } } 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 a group
13
@ Carnegie Mellon Databases - 13 - Software-Pipelined Prefetching Prologue; for j=0 to N-4 do { tuple j+3: (0)compute bucket number; prefetch header; tuple j+2: (1)visit header; prefetch cell array; tuple j+1: (2)visit cell array; prefetch build tuple; tuple j: (3)visit matching build tuple; } Epilogue; prologue epilogue j j+3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
14
@ Carnegie Mellon Databases - 14 - Dealing with Multiple Code Paths Previous compiler techniques cannot handle this A BCF DG E cond Multiple code paths: There could be 0 or many matches Hash buckets could be empty or full Keep state information for tuples being processed Record state Test state to decide: Do nothing, if state = B Execute D, if state = C Execute G, if state = F
15
@ Carnegie Mellon Databases - 15 - Dealing with Read-write Conflicts Use busy flag in bucket header to detect conflicts Postpone hashing 2 nd tuple until finish processing 1 st Compiler cannot perform this transformation Hash Bucket Headers Build Tuples In hash table building:
16
@ Carnegie Mellon Databases - 16 - More Details In Paper General group prefetching algorithm General software-pipelined prefetching algorithm Analytical models Discussion of important parameters: group size, prefetching distance Implementation details
17
@ Carnegie Mellon Databases - 17 - Outline Overview Our Proposed Techniques Experimental Results Setup Performance of Our Techniques Comparison with Cache Partitioning Conclusions
18
@ Carnegie Mellon Databases - 18 - Experimental Setup Relation schema: 4-byte join attribute + fixed length payload No selection, no projection 50MB memory available for the join phase Detailed cycle-by-cycle simulations 1GHz superscalar processor Memory hierarchy is based on Compaq ES40
19
@ Carnegie Mellon Databases - 19 - Joining a Pair of Build and Probe Partitions Our techniques achieve 2.1-2.9X speedups over original hash join A 50MB build partition joins a 100MB probe partition 1:2 matching Number of tuples decreases as tuple size increases
20
@ Carnegie Mellon Databases - 20 - Varying Memory Latency processor to memory latency execution time (M cycles) A 50MB build partition joins a 100MB probe partition 1:2 matching 100 B tuples 150 cycles: default parameter 1000 cycles: memory latency in future Our techniques achieve 9X speedups over baseline at 1000 cycles Absolute performances of our techniques are very close
21
@ Carnegie Mellon Databases - 21 - Comparison with Cache Partitioning Cache partitioning: generating cache sized partitions [Shatdal et al., 94], [Boncz et al., 99], [Manegold et al.,00] Additional in-memory partition step after I/O partitioning At least 50% worse than our prefetching schemes A 200MB build relation joins a 400MB probe relation 1:2 matching Partitioning + join
22
@ Carnegie Mellon Databases - 22 - Robustness: Impact of Cache Interference Cache partitioning relies on exclusive use of cache Periodically flush cache: worst case interference Self normalized to execution time when no flush Cache partitioning degrades 8-38% Our prefetching schemes are very robust
23
@ Carnegie Mellon Databases - 23 - Conclusions Exploited inter-tuple parallelism Proposed group prefetching and software-pipelined prefetching Prior prefetching techniques cannot handle code complexity Our techniques achieve dramatically better performance 2.1-2.9X speedups for join phase 1.4-2.6X speedups for partition phase 9X speedups at 1000 cycle memory latency in future Absolute performances are close to that at 150 cycles Robust against cache interference Unlike cache partitioning Our prefetching techniques are effective for hash joins
24
@ Carnegie Mellon Databases - 24 - Thank you !
25
@ Carnegie Mellon Databases - 25 - Back Up Slides
26
@ Carnegie Mellon Databases - 26 - Is Hash Join CPU-bound ? Quad-processor Pentium III, four disks A 1.5 GB build relation, a 3GB probe relation Main thread: GRACE hash join Background I/O thread per disk: I/O prefetching and writing Hash join is CPU-bound with reasonable I/O bandwidth Still large room for CPU performance improvement Partition phaseJoin phase 550MHz CPUs, 512MB RAM, Seagate Cheetah X15 36LP SCSI disks (max transfer rate 68MByte/sec), Linux 2.4.18 100B tuples, 4B keys, 1:2 matching Striping unit = 256KB 10 measurements, std < 10% mean or std < 1s
27
@ Carnegie Mellon Databases - 27 - Hiding Latency within A Group Hide cache miss latency across multiple tuples within a group Group size can be increased to hide most cache miss latency for hash joins Generic algorithm and analytical model (please see paper) There are gaps between groups 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Group size = 3 Group size = 5 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
28
@ Carnegie Mellon Databases - 28 - 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 D=2 Prefetching Distance Prefetching distance (D) : the number of iterations between two subsequent code stages for a single tuple Increase prefetching distance to hide all cache miss latency Generic algorithm and analytical model (please see paper) 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 D=1
29
@ Carnegie Mellon Databases - 29 - Group Pref: Multiple Code Paths We keep state information for tuples in a group One of the states decides which code path to take A BCF DG E cond A B st=1 cond C st=2 F st=3 DG st 23 EEmpty st 2 or 3other
30
@ Carnegie Mellon Databases - 30 - Prefetching Distance = D Prologue; for j=0 to N-3D-1 do { tuple j+3D: compute hash bucket number; prefetch the target bucket header; tuple j+2D: visit the hash bucket header; prefetch the hash cell array; tuple j+D: visit the hash cell array; prefetch the matching build tuple; tuple j: visit the matching build tuple to compare keys and produce output tuple; } Epilogue;
31
@ Carnegie Mellon Databases - 31 - Experiment Setup We have implemented our own hash join engine: Relations are stored in files with slotted page structure A simple XOR and shift based hash function is used GRACE hash join (baseline), Simple prefetching, Group prefetching, Software-pipelined prefetching Experiment Design: Same schema for build and probe relations: 4-byte key + fixed length payload No selection and projection 50MB memory available for the joins
32
@ Carnegie Mellon Databases - 32 - Simulation Platform Detailed cycle-by-cycle simulations Out-of-order processor pipeline Integer multiply and divide latency are based on Pentium4 Memory hierarchy is based on Compaq ES40 memory system parameters in near future better prefetching support Supports TLB prefetching
33
@ Carnegie Mellon Databases - 33 - Simulation Parameters Processor Pipeline Parameters Clock Rate1 GHz Issue Width4 insts/cycle Functional Units2 Int, 2 FP, 1 Int Divide, 2 Mem, 1 Branch Reorder Buffer Size128 insts Integer Multiply/Divide15/56 cycles All Other Integer1 cycle Branch Prediction Scheme gshare Memory Parameters Line Size64 bytes Primary Inst Cache64 KB, 2-way set assoc. Primary Data Cache64 KB, 4-way set-assoc. Miss Handlers32 for data, 2 for inst DTLB64 entries, fully-assoc. DTLB Miss Handlers1 Page Size8 KB Unified Secondary Cache1 MB, 4-way set assoc. Primary-to-Secondary Miss Latency 15 cycles (plus contention) DTLB Miss Latency20 cycles Primary-to-Memory Miss Latency 150 cycles (plus contention) Main Memory Bandwidth1 access per 10 cycles
34
@ Carnegie Mellon Databases - 34 - Simulator vs. Real Machine Better prefetching support: Never ignore prefetching Our prefetches are not hints! TLB prefetching When a prefetch incurs a TLB miss, perform TLB loading More miss handlers: 32 for data
35
@ Carnegie Mellon Databases - 35 - Varying Group Size and Prefetching Distance Too small: latencies are not fully hidden Too large: many prefetched cache lines are replaced by other memory references Similar performance even when latency increases to 1000 cycles! Varying params for the 20B case in the previous figure Hash table probing
36
@ Carnegie Mellon Databases - 36 - Breakdowns of Cache Misses to Understand the Tuning Curves
37
@ Carnegie Mellon Databases - 37 - Cache Performance Breakdowns Our schemes indeed hide most of the data cache miss latencies Overheads lead to larger portions of busy times Join (100B tuples) Partition (800 partitions)
38
@ Carnegie Mellon Databases - 38 - Partition Phase Performance When the number of partitions is small, use simple prefetching When the number of partitions is large, use group or software- pipelined prefetching Combined: 1.9-2.6X speedups over the baseline
39
@ Carnegie Mellon Databases - 39 - Cache Partitioning Schemes Two ways to employ cache partitioning in GRACE join: Direct cache: generate the cache partitions in the I/O partition phase Two-step cache: generate the cache partitions in the join phase as a preprocessing step Requires to generate larger number of smaller I/O partitions Bounded by available memory Bounded by requirements of underlying storage managers So cache partitioning may not be used when joining relations are very large Not robust with multiple activities going on Require exclusive use of (part of) cache Performance penalty due to cache conflicts
40
@ Carnegie Mellon Databases - 40 - Comparison with Cache Partitioning I/O partition phase Join phaseOverall performance Direct cache suffers from larger number of partitions generated in the I/O partition phase Two-step cache suffers from the additional partition step Our schemes are the best (slightly better than direct cache)
41
@ Carnegie Mellon Databases - 41 - Robustness: Impact of Cache Interference Performance degradation when the cache is periodically flushed The worst cache interference Direct cache and 2-step cache degrade 15-67% and 8-38% Our prefetching schemes are very robust ‘100’ corresponds to the join phase execution time when there is no cache flush
42
@ Carnegie Mellon Databases - 42 - Group Pref vs. Software-pipelined Pref Hiding latency: Software-pipelined pref is always able to hide all latencies (according to our analytical model) Book-keeping overhead: Software-pipelined pref has more overhead Code complexity: Group prefetching is easier to implement Natural group boundary provides a place to do necessary processing left (e.g. for read-write conflicts) A natural place to send outputs to the parent operator if pipelined operator is needed
43
@ Carnegie Mellon Databases - 43 - Try to hide the latency within the processing of a single tuple Example: hash table probing Does not work: Dependencies essentially form a critical path Randomness makes prediction almost impossible Hash Bucket Headers Hash Cell Array Build Partition Challenges in Applying Prefetching
44
@ Carnegie Mellon Databases - 44 - 0 1 2 3 0 1 2 3 Naïve Prefetching foreach probe tuple { compute bucket number; prefetch header; visit header; prefetch cell array; visit cell array; prefetch matching build tuple; visit matching build tuple; } 0 1 2 3 Cache miss latency 0 1 2 3 time Data dependencies make it difficult to obtain addresses early
45
@ Carnegie Mellon Databases - 45 - Group Prefetching foreach group of probe tuples { foreach tuple in group { compute bucket number; prefetch header; } foreach tuple in group { visit header; prefetch cell array; } foreach tuple in group { visit cell array; prefetch matching build tuple; } foreach tuple in group { visit matching build tuple; } } 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 time a group
46
@ Carnegie Mellon Databases - 46 - 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Software-Pipelined Prefetching Prologue; for j=0 to N-4 do { tuple j+3: compute bucket number; prefetch header; tuple j+2: visit header; prefetch cell array; tuple j+1: visit cell array; prefetch matching build tuple; tuple j: visit matching build tuple; } Epilogue; prologue epilogue j j+3
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.