@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research Pittsburgh 2 1,
@ Carnegie Mellon Databases Inspector Joins 2 Exploiting Information about Data Ability to improve query depends on information quality General stats on relations are inadequate May lead to incorrect decisions for specific queries Especially true for join queries Previous approaches exploiting dynamic information Collecting information from previous queries Multi-query optimization [Sellis’88] Materialized views [Blakeley et al. 86] Join indices [Valduriez’87] Dynamic re-optimization of query plans [Kabra&DeWitt’98] [Markl et al. 04] This study exploits the inner structure of hash joins
@ Carnegie Mellon Databases Inspector Joins 3 Idea: Examine the actual data in I/O partitioning phase Extract useful information to improve join phase Exploiting Multi-Pass Structure of Hash Joins I/O Partitioning Join Extra information greatly helps phase 2 Inspection
@ Carnegie Mellon Databases Inspector Joins 4 Using Extracted Information Enable a new join phase algorithm Reduce the primary performance bottleneck in hash joins i.e. Poor CPU cache performance Optimized for multi-processor systems Choose the most suitable join phase algorithm for special input cases I/O Partitioning decide Cache Partitioning Cache Prefetching Simple Hash Join Inspection Join Phase New Algorithm Extracted Information
@ Carnegie Mellon Databases Inspector Joins 5 Outline Motivation Previous hash join algorithms Hash join performance on SMP systems Inspector join Experimental results Conclusions
@ Carnegie Mellon Databases Inspector Joins 6 Hash Table Join Phase: (simple hash join) Build hash table, then probe hash table GRACE Hash Join I/O Partitioning Phase: Divide input relations into partitions with a hash function Build Probe Build Probe Random memory accesses cause poor CPU cache performance Over 70% execution time stalled on cache misses!
@ Carnegie Mellon Databases Inspector Joins 7 Cache Partitioning [Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00] Recursively produce cache-sized partitions after I/O partitioning Avoid cache misses when joining cache-sized partitions Overhead of re-partitioning Build Probe Memory-sized Partitions Cache-sized Partitions
@ Carnegie Mellon Databases Inspector Joins 8 Cache Prefetching [Chen et al. 04] Reduce impact of cache misses Exploit available memory bandwidth Overlap cache misses and computations Insert cache prefetch instructions into code Still incurs the same number of cache misses Hash Table Probe Build
@ Carnegie Mellon Databases Inspector Joins 9 Outline Motivation Previous hash join algorithms Hash join performance on SMP systems Inspector join Experimental results Conclusions
@ Carnegie Mellon Databases Inspector Joins 10 Hash Joins on SMP Systems Previous studies mainly focus on uni-processors Memory bandwidth is precious Each processor joins a pair of partitions in join phase Main Memory Shared bus Cache CPU Cache CPU Cache CPU Cache CPU Build 1 Probe 1 Build 4 Probe 4 Build 2 Probe 2 Build 3 Probe 3
@ Carnegie Mellon Databases Inspector Joins 11 Previous Algorithms on SMP Systems Join phase performance of joining a 500MB and a 2GB relations (details later in the talk) Aggregate performance degrades dramatically over 4 CPUs Reduce data movement (memory to memory, memory to cache) Wall clock timeAggregate time on all CPUs Re-partition cost Bandwidth- sharing
@ Carnegie Mellon Databases Inspector Joins 12 Inspector Joins Extracted information: summary of matching relationships Every K contiguous pages in a build partition forms a sub-partition Tells which sub-partition(s) every probe tuple matches Build Partition Sub-partition 0 Sub-partition 1 Sub-partition 2 Probe Partition I/O Partitioning Join Summary of Matching Relationship
@ Carnegie Mellon Databases Inspector Joins 13 Cache-Stationary Join Phase Recall cache partitioning: re-partition cost I/O Partitioning Join Build Partition Probe Partition Hash Table CPU Cache We want to achieve zero copying Copying cost
@ Carnegie Mellon Databases Inspector Joins 14 Cache-Stationary Join Phase Joins a sub-partition and its matching probe tuples Sub-partition is small enough to fit in CPU cache Cache prefetching for the remaining cache misses Zero copying for generating recursive cache-sized partitions I/O Partitioning Join Build Partition Probe Partition Hash Table CPU Cache Sub-partition 0 Sub-partition 1 Sub-partition 2
@ Carnegie Mellon Databases Inspector Joins 15 Filters in I/O Partitioning How to extract the summary efficiently? Extend filter scheme in commercial hash joins Conventional single-filter scheme Represent all build join keys Filter out probe tuples having no matches Build Relation Filter Mem-sized Partitions Construct Test I/O Partitioning Join Probe Relation
@ Carnegie Mellon Databases Inspector Joins 16 Background: Bloom Filter A bit vector A key is hashed d (e.g. d=3) times and represented by d bits Construct: for every build join key, set its 3 bits in vector Test: given a probe join key, check if all its 3 bits are 1 Discard the tuple if some bits are 0 May have false positives Bit 0 =H 0 (key)Bit 1 =H 1 (key)Bit 2 =H 2 (key) Filter
@ Carnegie Mellon Databases Inspector Joins 17 Multi-Filter Scheme Single filter: a probe tuple entire build relation Our goal: a probe tuple sub-partitions Construct a filter for every sub-partition Replace a single large filter with multiple small filters Single Filter Build Relation Partition 0 Partition 1 Partition 2 Sub0,0 Sub0,1 Sub0,2 Sub1,0 Sub1,1 Sub1,2 Sub2,0 Sub2,1 Sub2,2 Multi-Filter I/O Partitioning Join
@ Carnegie Mellon Databases Inspector Joins 18 Testing Multi-Filters When partitioning the probe relation Test a probe tuple against all the filters of a partition Tells which sub-partition(s) the tuple may have matches Store summary of matching relationships in partitions Probe Relation Partition 0 Partition 1 Partition 2 Multi- Filter Test I/O Partitioning Join
@ Carnegie Mellon Databases Inspector Joins 19 Minimizing Cache Misses for Testing Filters Single filter scheme: Compute 3 bit positions Test 3 bits Multi-filter scheme: if there are S sub-partitions in a partition Compute 3 bit positions Test the same 3 bits for every filter, altogether 3*S bits May cause 3*S cache misses ! Test Probe Relation Partition 0 Partition 1 Partition 2 Multi- Filter S filters
@ Carnegie Mellon Databases Inspector Joins 20 Vertical Filters for Testing Bits at the same position are contiguous in memory 3 cache misses instead of 3*S cache misses! Horizontal vertical conversion after partitioning build relation Very small overhead in practice Probe Relation Partition 0 Partition 1 Partition 2 Test S filters Contiguous in memory I/O Partitioning Join
@ Carnegie Mellon Databases Inspector Joins 21 More Details in Paper Moderate memory space requirement for filters Summary information representation in intermediate partitions Preprocessing for cache-stationary join phase Prefetching for improving efficiency and robustness
@ Carnegie Mellon Databases Inspector Joins 22 Outline Motivation Previous hash join algorithms Hash join performance on SMP systems Inspector join Experimental results Conclusions
@ Carnegie Mellon Databases Inspector Joins 23 Experimental Setup Relation schema: 4-byte join attribute + fixed length payload No selection, no projection 50MB memory per CPU available for the join phase Same join algorithm run on every CPU joining different partitions Detailed cycle-by-cycle simulations A shared-bus SMP system with 1.5GHz processors Memory hierarchy is based on Itanium 2 processor
@ Carnegie Mellon Databases Inspector Joins 24 Partition Phase Wall-Clock Time I/O partitioning can take advantage of multiple CPUs Cut input relations into equal-sized chunks Partition one chunk on every CPU Concatenate outputs from all CPUs Enhanced cache partitioning: cache partitioning + advanced prefetching Inspection incurs very small overhead GRACE Cache prefetching Cache partitioning Enhanced cache partitioning Inspector join 500MB joins 2GB 100B tuples, 4B keys 50% probe tuples no matches A build matches 2 probe tuples Number of CPUs used
@ Carnegie Mellon Databases Inspector Joins 25 Join Phase Aggregate Time Inspector join achieves significantly better performance when 8 or more CPUs are used X speedups over cache prefetching X speedups over enhanced cache partitioning 500MB joins 2GB 100B tuples, 4B keys 50% probe tuples no matches A build matches 2 probe tuples Number of CPUs used GRACE Cache prefetching Cache partitioning Enhanced cache partitioning Inspector join
@ Carnegie Mellon Databases Inspector Joins 26 Results on Choosing Suitable Join Phase Case #1: a large number of duplicate build join keys Choose enhanced cache partitioning When a probe tuple on average matches 4 or more sub-partitions Case #2: nearly sorted input relations Surprisingly: cache-stationary join is very good I/O Partitioning decide Cache Partitioning Cache Prefetching Simple Hash Join Inspection Join Phase Cache Stationary Extracted Info
@ Carnegie Mellon Databases Inspector Joins 27 Conclusions Exploit multi-pass structure for higher quality info about data Achieve significantly better cache performance 1.6X speedups over previous cache-friendly algorithms When 8 or more CPUs are used Choose most suitable algorithms for special input cases Idea may be applicable to other multi-pass algorithms
@ Carnegie Mellon Databases Inspector Joins 28 Thank You !
@ Carnegie Mellon Databases Inspector Joins 29 Partition Phase Wall-Clock Time I/O partitioning can take advantage of multiple CPUs Cut input relations into equal-sized chunks Partition one chunk on every CPU Concatenate outputs from all CPUs Inspection incurs very small overhead 500MB joins 2GB 100B tuples, 4B keys 50% probe tuples no matches A build matches 2 probe tuples Number of CPUs used GRACE Cache prefetching Cache partitioning Inspector join
@ Carnegie Mellon Databases Inspector Joins 30 Join Phase Aggregate Time Inspector join achieves significantly better performance when 8 or more CPUs are used X speedups over cache prefetching X speedups over enhanced cache partitioning 500MB joins 2GB 100B tuples, 4B keys 50% probe tuples no matches A build matches 2 probe tuples Number of CPUs used GRACE Cache prefetching Cache partitioning Inspector join
@ Carnegie Mellon Databases Inspector Joins 31 CPU-Cache-Friendly Hash Joins Recent studies focus on CPU cache performance I/O partitioning gives good I/O performance Random memory accesses cause poor CPU cache performance Cache Partitioning [Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00] Recursively produce cache-sized partitions from memory-sized partitions Avoid cache misses during join phase Pay re-partitioning cost Cache Prefetching [Chen et al. 04] Exploit memory system parallelism Use prefetches to overlap multiple cache misses and computations Hash Table Probe Build
@ Carnegie Mellon Databases Inspector Joins 32 Example Special Input Cases Example case #1: a large number of duplicate build join keys Count the average number of sub-partitions a probe tuple matches Must check the tuple against all possible sub-partitions If too large, cache stationary join works poorly Example case #2: nearly sorted input relations A merge-based join phase might be better? Build Partition Probe Partition Sub-partition 0 Sub-partition 1 Sub-partition 2 A probe tuple
@ Carnegie Mellon Databases Inspector Joins 33 Varying Number of Duplicates per Build Join Key Join phase aggregate performance Choose enhanced cache part When a probe tuple on average matches 4 or more sub-partitions
@ Carnegie Mellon Databases Inspector Joins 34 Nearly Sorted Cases Sort both input relations, then randomly move 0%-5% of tuples Join phase aggregate performance Surprisingly: cache-stationary join is very good Even better than merge join when over 1% tuples are out-of-order
@ Carnegie Mellon Databases Inspector Joins 35 Analyzing Nearly Sorted Case Partitions are also nearly sorted Probe tuples matching a sub-partition are almost contiguous Similar memory behavior as merge join No cost for sorting out-of-order tuples Build Partition Probe Partition Sub-partition 0 Sub-partition 1 Sub-partition 2 A probe tuple Nearly Sorted