Presentation is loading. Please wait.

Presentation is loading. Please wait.

@ Carnegie Mellon Databases Improving Hash Join Performance Through Prefetching Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki ‡ Carnegie.

Similar presentations


Presentation on theme: "@ Carnegie Mellon Databases Improving Hash Join Performance Through Prefetching Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki ‡ Carnegie."— Presentation transcript:

1 @ Carnegie Mellon Databases Improving Hash Join Performance Through Prefetching Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki ‡ Carnegie Mellon University Intel Research Pittsburgh ‡

2 @ Carnegie Mellon Databases - 2 - Hash Join  Simple hash join:  Build hash table on smaller (build) relation  Probe hash table using larger (probe) relation  Random access patterns inherent in hashing  Excessive random I/Os  If build relation and hash table cannot fit in memory Build Relation Probe Relation Hash Table

3 @ Carnegie Mellon Databases - 3 - I/O Partitioning  Avoid excessive random disk accesses  Join pairs of build and probe partitions separately  Sequential I/O patterns for relations and partitions  Hash join is CPU-bound with reasonable I/O bandwidth Build Probe

4 @ Carnegie Mellon Databases - 4 -  Partition: divides a 1GB relation into 800 partitions  Join: 50MB build partition 100MB probe partition  Detailed simulations based on Compaq ES40 system  Most of execution time is wasted on data cache misses  82% for partition, 73% for join  Because of random access patterns in memory Hash Join Cache Performance

5 @ Carnegie Mellon Databases - 5 - Cache partitioning: generating cache-sized partitions  Effective in main-memory databases [Shatdal et al., 94], [Boncz et al., 99], [Manegold et al.,00]  Two limitations when used in commercial databases 1) Usually need additional in-memory partitioning pass  Cache is much smaller than main memory  50% worse than our techniques 2) Sensitive to cache sharing by multiple activities Employing Partitioning for Cache? Main Memory CPU Cache Build Partition Hash Table

6 @ Carnegie Mellon Databases - 6 - Our Approach: Cache Prefetching  Modern processors support:  Multiple cache misses to be serviced simultaneously  Prefetch assembly instructions for exploiting the parallelism  Overlap cache miss latency with computation  Successfully applied to  Array-based programs [Mowry et al., 92]  Pointer-based programs [Luk & Mowry, 96]  Database B + -trees [Chen et al., 01] Main Memory CPU L2/L3 Cache L1 Cache pref 0(r2) pref 4(r7) pref 0(r3) pref 8(r9)

7 @ Carnegie Mellon Databases - 7 - Challenges for Cache Prefetching  Difficult to obtain memory addresses early  Randomness of hashing prohibits address prediction  Data dependencies within the processing of a tuple  Naïve approach does not work  Complexity of hash join code  Ambiguous pointer references  Multiple code paths  Cannot apply compiler prefetching techniques

8 @ Carnegie Mellon Databases - 8 - Our Solution  Dependencies are rare across subsequent tuples  Exploit inter-tuple parallelism  Overlap cache misses of one tuple with computation and cache misses of other tuples  We propose two prefetching techniques  Group prefetching  Software-pipelined prefetching

9 @ Carnegie Mellon Databases - 9 - Outline  Overview  Our Proposed Techniques  Simplified Probing Algorithm  Naïve Prefetching  Group Prefetching  Software-Pipelined Prefetching  Dealing with Complexities  Experimental Results  Conclusions

10 @ Carnegie Mellon Databases - 10 - Simplified Probing Algorithm foreach probe tuple { (0)compute bucket number; (1)visit header; (2)visit cell array; (3)visit matching build tuple; } Hash Bucket Headers Hash Cell (hash code, build tuple ptr) Build Partition

11 @ Carnegie Mellon Databases - 11 - Naïve Prefetching foreach probe tuple { (0)compute bucket number; prefetch header; (1)visit header; prefetch cell array; (2)visit cell array; prefetch matching build tuple; (3)visit matching build tuple; } 0 1 2 3 0 1 2 3 0 1 2 3 tim e Cache miss latency Data dependencies make it difficult to obtain addresses early

12 @ Carnegie Mellon Databases - 12 - Group Prefetching foreach group of probe tuples { foreach tuple in group { (0)compute bucket number; prefetch header; } foreach tuple in group { (1)visit header; prefetch cell array; } foreach tuple in group { (2)visit cell array; prefetch build tuple; } foreach tuple in group { (3)visit matching build tuple; } } 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 a group

13 @ Carnegie Mellon Databases - 13 - Software-Pipelined Prefetching Prologue; for j=0 to N-4 do { tuple j+3: (0)compute bucket number; prefetch header; tuple j+2: (1)visit header; prefetch cell array; tuple j+1: (2)visit cell array; prefetch build tuple; tuple j: (3)visit matching build tuple; } Epilogue; prologue epilogue j j+3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

14 @ Carnegie Mellon Databases - 14 - Dealing with Multiple Code Paths  Previous compiler techniques cannot handle this A BCF DG E cond  Multiple code paths:  There could be 0 or many matches  Hash buckets could be empty or full  Keep state information for tuples being processed Record state Test state to decide: Do nothing, if state = B Execute D, if state = C Execute G, if state = F

15 @ Carnegie Mellon Databases - 15 - Dealing with Read-write Conflicts  Use busy flag in bucket header to detect conflicts  Postpone hashing 2 nd tuple until finish processing 1 st  Compiler cannot perform this transformation Hash Bucket Headers Build Tuples  In hash table building:

16 @ Carnegie Mellon Databases - 16 - More Details In Paper  General group prefetching algorithm  General software-pipelined prefetching algorithm  Analytical models  Discussion of important parameters:  group size, prefetching distance  Implementation details

17 @ Carnegie Mellon Databases - 17 - Outline  Overview  Our Proposed Techniques  Experimental Results  Setup  Performance of Our Techniques  Comparison with Cache Partitioning  Conclusions

18 @ Carnegie Mellon Databases - 18 - Experimental Setup  Relation schema: 4-byte join attribute + fixed length payload  No selection, no projection  50MB memory available for the join phase  Detailed cycle-by-cycle simulations  1GHz superscalar processor  Memory hierarchy is based on Compaq ES40

19 @ Carnegie Mellon Databases - 19 - Joining a Pair of Build and Probe Partitions  Our techniques achieve 2.1-2.9X speedups over original hash join A 50MB build partition joins a 100MB probe partition 1:2 matching Number of tuples decreases as tuple size increases

20 @ Carnegie Mellon Databases - 20 - Varying Memory Latency processor to memory latency execution time (M cycles) A 50MB build partition joins a 100MB probe partition 1:2 matching 100 B tuples  150 cycles: default parameter  1000 cycles: memory latency in future  Our techniques achieve 9X speedups over baseline at 1000 cycles  Absolute performances of our techniques are very close

21 @ Carnegie Mellon Databases - 21 - Comparison with Cache Partitioning  Cache partitioning: generating cache sized partitions [Shatdal et al., 94], [Boncz et al., 99], [Manegold et al.,00]  Additional in-memory partition step after I/O partitioning  At least 50% worse than our prefetching schemes A 200MB build relation joins a 400MB probe relation 1:2 matching Partitioning + join

22 @ Carnegie Mellon Databases - 22 - Robustness: Impact of Cache Interference  Cache partitioning relies on exclusive use of cache  Periodically flush cache: worst case interference  Self normalized to execution time when no flush  Cache partitioning degrades 8-38%  Our prefetching schemes are very robust

23 @ Carnegie Mellon Databases - 23 - Conclusions  Exploited inter-tuple parallelism  Proposed group prefetching and software-pipelined prefetching  Prior prefetching techniques cannot handle code complexity  Our techniques achieve dramatically better performance  2.1-2.9X speedups for join phase  1.4-2.6X speedups for partition phase  9X speedups at 1000 cycle memory latency in future  Absolute performances are close to that at 150 cycles  Robust against cache interference  Unlike cache partitioning  Our prefetching techniques are effective for hash joins

24 @ Carnegie Mellon Databases - 24 - Thank you !

25 @ Carnegie Mellon Databases - 25 - Back Up Slides

26 @ Carnegie Mellon Databases - 26 - Is Hash Join CPU-bound ?  Quad-processor Pentium III, four disks  A 1.5 GB build relation, a 3GB probe relation  Main thread: GRACE hash join  Background I/O thread per disk: I/O prefetching and writing  Hash join is CPU-bound with reasonable I/O bandwidth  Still large room for CPU performance improvement Partition phaseJoin phase 550MHz CPUs, 512MB RAM, Seagate Cheetah X15 36LP SCSI disks (max transfer rate 68MByte/sec), Linux 2.4.18 100B tuples, 4B keys, 1:2 matching Striping unit = 256KB 10 measurements, std < 10% mean or std < 1s

27 @ Carnegie Mellon Databases - 27 - Hiding Latency within A Group  Hide cache miss latency across multiple tuples within a group  Group size can be increased to hide most cache miss latency for hash joins  Generic algorithm and analytical model (please see paper)  There are gaps between groups 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Group size = 3 Group size = 5 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

28 @ Carnegie Mellon Databases - 28 - 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 D=2 Prefetching Distance  Prefetching distance (D) : the number of iterations between two subsequent code stages for a single tuple  Increase prefetching distance to hide all cache miss latency  Generic algorithm and analytical model (please see paper) 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 D=1

29 @ Carnegie Mellon Databases - 29 - Group Pref: Multiple Code Paths  We keep state information for tuples in a group  One of the states decides which code path to take A BCF DG E cond A B st=1 cond C st=2 F st=3 DG st 23 EEmpty st 2 or 3other

30 @ Carnegie Mellon Databases - 30 - Prefetching Distance = D Prologue; for j=0 to N-3D-1 do { tuple j+3D: compute hash bucket number; prefetch the target bucket header; tuple j+2D: visit the hash bucket header; prefetch the hash cell array; tuple j+D: visit the hash cell array; prefetch the matching build tuple; tuple j: visit the matching build tuple to compare keys and produce output tuple; } Epilogue;

31 @ Carnegie Mellon Databases - 31 - Experiment Setup  We have implemented our own hash join engine:  Relations are stored in files with slotted page structure  A simple XOR and shift based hash function is used  GRACE hash join (baseline), Simple prefetching, Group prefetching, Software-pipelined prefetching  Experiment Design:  Same schema for build and probe relations:  4-byte key + fixed length payload  No selection and projection  50MB memory available for the joins

32 @ Carnegie Mellon Databases - 32 - Simulation Platform  Detailed cycle-by-cycle simulations  Out-of-order processor pipeline  Integer multiply and divide latency are based on Pentium4  Memory hierarchy is based on Compaq ES40  memory system parameters in near future  better prefetching support  Supports TLB prefetching

33 @ Carnegie Mellon Databases - 33 - Simulation Parameters Processor Pipeline Parameters Clock Rate1 GHz Issue Width4 insts/cycle Functional Units2 Int, 2 FP, 1 Int Divide, 2 Mem, 1 Branch Reorder Buffer Size128 insts Integer Multiply/Divide15/56 cycles All Other Integer1 cycle Branch Prediction Scheme gshare Memory Parameters Line Size64 bytes Primary Inst Cache64 KB, 2-way set assoc. Primary Data Cache64 KB, 4-way set-assoc. Miss Handlers32 for data, 2 for inst DTLB64 entries, fully-assoc. DTLB Miss Handlers1 Page Size8 KB Unified Secondary Cache1 MB, 4-way set assoc. Primary-to-Secondary Miss Latency 15 cycles (plus contention) DTLB Miss Latency20 cycles Primary-to-Memory Miss Latency 150 cycles (plus contention) Main Memory Bandwidth1 access per 10 cycles

34 @ Carnegie Mellon Databases - 34 - Simulator vs. Real Machine  Better prefetching support:  Never ignore prefetching  Our prefetches are not hints!  TLB prefetching  When a prefetch incurs a TLB miss, perform TLB loading  More miss handlers:  32 for data

35 @ Carnegie Mellon Databases - 35 - Varying Group Size and Prefetching Distance  Too small: latencies are not fully hidden  Too large: many prefetched cache lines are replaced by other memory references  Similar performance even when latency increases to 1000 cycles! Varying params for the 20B case in the previous figure Hash table probing

36 @ Carnegie Mellon Databases - 36 - Breakdowns of Cache Misses to Understand the Tuning Curves

37 @ Carnegie Mellon Databases - 37 - Cache Performance Breakdowns  Our schemes indeed hide most of the data cache miss latencies  Overheads lead to larger portions of busy times Join (100B tuples) Partition (800 partitions)

38 @ Carnegie Mellon Databases - 38 - Partition Phase Performance  When the number of partitions is small, use simple prefetching  When the number of partitions is large, use group or software- pipelined prefetching  Combined: 1.9-2.6X speedups over the baseline

39 @ Carnegie Mellon Databases - 39 - Cache Partitioning Schemes  Two ways to employ cache partitioning in GRACE join:  Direct cache: generate the cache partitions in the I/O partition phase  Two-step cache: generate the cache partitions in the join phase as a preprocessing step  Requires to generate larger number of smaller I/O partitions  Bounded by available memory  Bounded by requirements of underlying storage managers  So cache partitioning may not be used when joining relations are very large  Not robust with multiple activities going on  Require exclusive use of (part of) cache  Performance penalty due to cache conflicts

40 @ Carnegie Mellon Databases - 40 - Comparison with Cache Partitioning I/O partition phase Join phaseOverall performance  Direct cache suffers from larger number of partitions generated in the I/O partition phase  Two-step cache suffers from the additional partition step  Our schemes are the best (slightly better than direct cache)

41 @ Carnegie Mellon Databases - 41 - Robustness: Impact of Cache Interference  Performance degradation when the cache is periodically flushed  The worst cache interference  Direct cache and 2-step cache degrade 15-67% and 8-38%  Our prefetching schemes are very robust ‘100’ corresponds to the join phase execution time when there is no cache flush

42 @ Carnegie Mellon Databases - 42 - Group Pref vs. Software-pipelined Pref  Hiding latency:  Software-pipelined pref is always able to hide all latencies (according to our analytical model)  Book-keeping overhead:  Software-pipelined pref has more overhead  Code complexity:  Group prefetching is easier to implement  Natural group boundary provides a place to do necessary processing left (e.g. for read-write conflicts)  A natural place to send outputs to the parent operator if pipelined operator is needed

43 @ Carnegie Mellon Databases - 43 -  Try to hide the latency within the processing of a single tuple  Example: hash table probing  Does not work:  Dependencies essentially form a critical path  Randomness makes prediction almost impossible Hash Bucket Headers Hash Cell Array Build Partition Challenges in Applying Prefetching

44 @ Carnegie Mellon Databases - 44 - 0 1 2 3 0 1 2 3 Naïve Prefetching foreach probe tuple { compute bucket number; prefetch header; visit header; prefetch cell array; visit cell array; prefetch matching build tuple; visit matching build tuple; } 0 1 2 3 Cache miss latency 0 1 2 3 time Data dependencies make it difficult to obtain addresses early

45 @ Carnegie Mellon Databases - 45 - Group Prefetching foreach group of probe tuples { foreach tuple in group { compute bucket number; prefetch header; } foreach tuple in group { visit header; prefetch cell array; } foreach tuple in group { visit cell array; prefetch matching build tuple; } foreach tuple in group { visit matching build tuple; } } 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 time a group

46 @ Carnegie Mellon Databases - 46 - 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Software-Pipelined Prefetching Prologue; for j=0 to N-4 do { tuple j+3: compute bucket number; prefetch header; tuple j+2: visit header; prefetch cell array; tuple j+1: visit cell array; prefetch matching build tuple; tuple j: visit matching build tuple; } Epilogue; prologue epilogue j j+3


Download ppt "@ Carnegie Mellon Databases Improving Hash Join Performance Through Prefetching Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki ‡ Carnegie."

Similar presentations


Ads by Google