Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research John Cieslewicz Columbia University Kenneth A. Ross Columbia University Mihir Shah Columbia University
Simultaneous Multithreading (SMT) Available on modern CPUs: “Hyperthreading” on Pentium 4 and Xeon. IBM POWER5 Sun UltraSparc IV Challenge: Design software to efficiently utilize SMT. This talk: Database software Intel Pentium 4 with Hyperthreading
Superscalar Processor (no SMT) Instruction Stream Superscalar pipeline (up to 2 instructions/cycle)... Time Improved instruction level parallelism CPI = 3/4
SMT Processor Instruction Streams... Time Improved thread level parallelism More opportunities to keep the processor busy But sometimes SMT does not work so well CPI = 5/8...
Stalls Instruction Stream 1... Time Instruction Stream 2... CPI = 3/4. Progress despite stalled thread. Stall Stalls due to cache misses ( cycles for L2 cache), branch mispredictions (20-30 cycles), etc.
Memory Consistency Instruction Stream 1... Time Instruction Stream 2... Detect conflicting access to common cache line flush pipeline + sync cache with RAM “MOMC Event” on Pentium 4. ( cycles)
SMT Processor Exposes multiple “logical” CPUs (one per instruction stream) One physical CPU (~5% extra silicon to duplicate thread state information) Better than single threading: Increased thread-level parallelism Improved processor utilization when one thread blocks Not as good as two physical CPUs: CPU resources are shared, not replicated
SMT Challenges Resource Competition Shared Execution Units Shared Cache Thread Coordination Locking, etc. has high overhead False Sharing MOMC Events
Approaches to using SMT Ignore it, and write single threaded code. Naïve parallelism Pretend the logical CPUs are physical CPUs SMT-aware parallelism Parallel threads designed to avoid SMT-related interference Use one thread for the algorithm, and another to manage resources E.g., to avoid stalls for cache misses
Naïve Parallelism Treat SMT processor as if it is multi-core Databases already designed to utilize multiple processors - no code modification Uses shared processor resources inefficiently: Cache Pollution / Interference Competition for execution units
SMT-Aware Parallelism Exploit intra-operator parallelism Divide input and use a separate thread to process each part E.g., one thread for even tuples, one for odd tuples. Explicit partitioning step not required. Sharing input involves multiple readers No MOMC events, because two reads don’t conflict
SMT-Aware Parallelism (cont.) Sharing output is challenging Thread coordination for output read/write and write/write conflicts on common cache lines (MOMC Events) “Solution:” Partition the output Each thread writes to separate memory buffer to avoid memory conflicts Need an extra merge step in the consumer of the output stream Difficult to maintain input order in the output
Managing Resources for SMT Cache misses are a well-known performance bottleneck for modern database systems Mainly L2 data cache misses, but also L1 instruction cache misses [Ailamaki et al 98]. Goal: Use a “helper” thread to avoid cache misses in the “main” thread load future memory references into the cache explicit load, not a prefetch
Data Dependency Memory references that depend upon a previous memory access exhibit a data dependency E.g., Lookup hash table: Hash BucketsOverflow Cells Tuple
Data Dependency (cont.) Data dependencies make instruction level parallelism harder Modern architectures provide prefetch instructions. Request that data be brought into the cache Non-blocking Pitfalls: Prefetch instructions are frequently dropped Difficult to tune Too much prefetching can pollute the cache
Staging Computation Hash BucketsOverflow Cells Tuple ABC 1.Preload A. 2.(other work) 3.Process A. 4.Preload B. 5.(other work) 6.Process B. 7.Preload C. 8.(other work) 9.Process C. 10.Preload Tuple. 11.(other work) 12.Process Tuple. (Assumes each element is a cache line.)
Staging Computation (cont.) By overlapping memory latency with other work, some cache miss latency can be hidden. Many probes “in flight” at the same time. Algorithms need to be rewritten. E.g. Chen, et al. [2004], Harizopoulos, et al. [2004].
Work-Ahead Set: Main Thread Writes memory address + computation state to the work-ahead set Retrieves a previous address + state Hope that helper thread can preload data before retrieval by the main thread Correct whether or not helper thread succeeds at preloading data helper thread is read-only
Work-ahead Set Data Structure stateaddress Main Thread A1 B1 C1 D 1 E1 F 1
Work-ahead Set Data Structure stateaddress Main Thread A1 B1 C1 D 1 E1 F 1 G 1 H2 I 2 J 2 K2 L 2
Work-Ahead Set: Helper Thread Reads memory addresses from the work-ahead set, and loads their contents Data becomes cache resident Tries to preload data before main thread cycles around If successful, main thread experiences cache hits
G H2 I2 J2 Work-ahead Set Data Structure stateaddress E F Helper Thread “ temp += *slot[i] ”
G H2 I2 J2 Iterate Backwards! stateaddress E F Helper Thread i = i-1 mod size i Why? See Paper.
Helper Thread Speed If helper thread faster than main thread: More computation than memory latency Helper thread should not preload twice (wasted CPU cycles) See paper for how to stop redundant loads If helper thread is slower: No special tuning necessary Main thread will absorb some cache misses
Work-Ahead Set Size Too Large: Cache Pollution Preloaded data evicts other preloaded data before it can be used Too Small: Thread Contention Many MOMC events because work-ahead set spans few cache lines Just Right: Experimentally determined But use the smallest size within the acceptable range (performance plateaus), so that cache space is available for other purposes (for us, 128 entries) Data structure itself much smaller than L2 cache
Experimental Workload Two Operators: Probe phase of Hash Join CSB+ Tree Index Join Operators run in isolation and in parallel Intel VTune used to measure hardware events CPU Pentium GHz Memory2 GB DDR L1, L2 Size8 KB, 512 KB L1, L2 Cache-line Size 64 B, 128 B L1 Miss Latency18 cycles L2 Miss Latency 276 Cycles MOMC Latency ~300+ Cycles
Experimental Outline 1. Hash join 2. Index lookup 3. Mixed: Hash join and index lookup
Hash Join Comparative Performance
Hash Join L2 Cache Misses Per Tuple
CSB + Tree Index Join Comparative Performance
CSB + Tree Index Join L2 Cache Misses Per Tuple
Parallel Operator Performance 52%55%20%
Parallel Operator Performance 26%29%
Conclusion Naïve parallelSMT-AwareWork-Ahead Impl. EffortSmallModerate Data FormatUnchangedSplit outputUnchanged Data OrderUnchangedChangedUnchanged* Performance (row) ModerateHigh Performance (col) ModerateHighModerate Control of CacheNo Yes