Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research John Cieslewicz Columbia.

Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia University johnc@cs.columbia.edu Kenneth A. Ross Columbia University kar@cs.columbia.edu Mihir Shah Columbia University ms2604@columbia.edu

Simultaneous Multithreading (SMT) Available on modern CPUs:  “Hyperthreading” on Pentium 4 and Xeon.  IBM POWER5  Sun UltraSparc IV Challenge: Design software to efficiently utilize SMT.  This talk: Database software Intel Pentium 4 with Hyperthreading

Superscalar Processor (no SMT) Instruction Stream Superscalar pipeline (up to 2 instructions/cycle)... Time Improved instruction level parallelism CPI = 3/4

SMT Processor Instruction Streams... Time Improved thread level parallelism More opportunities to keep the processor busy But sometimes SMT does not work so well  CPI = 5/8...

Stalls Instruction Stream 1... Time Instruction Stream 2... CPI = 3/4. Progress despite stalled thread. Stall Stalls due to cache misses (200-300 cycles for L2 cache), branch mispredictions (20-30 cycles), etc.

Memory Consistency Instruction Stream 1... Time Instruction Stream 2... Detect conflicting access to common cache line flush pipeline + sync cache with RAM “MOMC Event” on Pentium 4. (300-350 cycles)

SMT Processor Exposes multiple “logical” CPUs (one per instruction stream) One physical CPU (~5% extra silicon to duplicate thread state information) Better than single threading:  Increased thread-level parallelism  Improved processor utilization when one thread blocks Not as good as two physical CPUs:  CPU resources are shared, not replicated

SMT Challenges Resource Competition  Shared Execution Units  Shared Cache Thread Coordination  Locking, etc. has high overhead False Sharing  MOMC Events

Approaches to using SMT Ignore it, and write single threaded code. Naïve parallelism  Pretend the logical CPUs are physical CPUs SMT-aware parallelism  Parallel threads designed to avoid SMT-related interference Use one thread for the algorithm, and another to manage resources  E.g., to avoid stalls for cache misses

Naïve Parallelism Treat SMT processor as if it is multi-core Databases already designed to utilize multiple processors - no code modification Uses shared processor resources inefficiently:  Cache Pollution / Interference  Competition for execution units

SMT-Aware Parallelism Exploit intra-operator parallelism Divide input and use a separate thread to process each part  E.g., one thread for even tuples, one for odd tuples.  Explicit partitioning step not required. Sharing input involves multiple readers  No MOMC events, because two reads don’t conflict

SMT-Aware Parallelism (cont.) Sharing output is challenging  Thread coordination for output  read/write and write/write conflicts on common cache lines (MOMC Events) “Solution:” Partition the output  Each thread writes to separate memory buffer to avoid memory conflicts  Need an extra merge step in the consumer of the output stream  Difficult to maintain input order in the output

Managing Resources for SMT Cache misses are a well-known performance bottleneck for modern database systems  Mainly L2 data cache misses, but also L1 instruction cache misses [Ailamaki et al 98]. Goal: Use a “helper” thread to avoid cache misses in the “main” thread  load future memory references into the cache  explicit load, not a prefetch

Data Dependency Memory references that depend upon a previous memory access exhibit a data dependency E.g., Lookup hash table: Hash BucketsOverflow Cells Tuple

Data Dependency (cont.) Data dependencies make instruction level parallelism harder Modern architectures provide prefetch instructions.  Request that data be brought into the cache  Non-blocking Pitfalls:  Prefetch instructions are frequently dropped  Difficult to tune  Too much prefetching can pollute the cache

Staging Computation Hash BucketsOverflow Cells Tuple ABC 1.Preload A. 2.(other work) 3.Process A. 4.Preload B. 5.(other work) 6.Process B. 7.Preload C. 8.(other work) 9.Process C. 10.Preload Tuple. 11.(other work) 12.Process Tuple. (Assumes each element is a cache line.)

Staging Computation (cont.) By overlapping memory latency with other work, some cache miss latency can be hidden. Many probes “in flight” at the same time. Algorithms need to be rewritten. E.g. Chen, et al. [2004], Harizopoulos, et al. [2004].

Work-Ahead Set: Main Thread Writes memory address + computation state to the work-ahead set Retrieves a previous address + state Hope that helper thread can preload data before retrieval by the main thread Correct whether or not helper thread succeeds at preloading data  helper thread is read-only

Work-ahead Set Data Structure stateaddress Main Thread A1 B1 C1 D 1 E1 F 1

Work-ahead Set Data Structure stateaddress Main Thread A1 B1 C1 D 1 E1 F 1 G 1 H2 I 2 J 2 K2 L 2

Work-Ahead Set: Helper Thread Reads memory addresses from the work-ahead set, and loads their contents Data becomes cache resident Tries to preload data before main thread cycles around If successful, main thread experiences cache hits

G H2 I2 J2 Work-ahead Set Data Structure stateaddress E F 1 1 1 Helper Thread “ temp += *slot[i] ”

G H2 I2 J2 Iterate Backwards! stateaddress E F 1 1 1 Helper Thread i = i-1 mod size i Why? See Paper.

Helper Thread Speed If helper thread faster than main thread:  More computation than memory latency  Helper thread should not preload twice (wasted CPU cycles)  See paper for how to stop redundant loads If helper thread is slower:  No special tuning necessary  Main thread will absorb some cache misses

Work-Ahead Set Size Too Large: Cache Pollution  Preloaded data evicts other preloaded data before it can be used Too Small: Thread Contention  Many MOMC events because work-ahead set spans few cache lines Just Right: Experimentally determined  But use the smallest size within the acceptable range (performance plateaus), so that cache space is available for other purposes (for us, 128 entries) Data structure itself much smaller than L2 cache

Experimental Workload Two Operators:  Probe phase of Hash Join  CSB+ Tree Index Join Operators run in isolation and in parallel Intel VTune used to measure hardware events CPU Pentium 4 3.4 GHz Memory2 GB DDR L1, L2 Size8 KB, 512 KB L1, L2 Cache-line Size 64 B, 128 B L1 Miss Latency18 cycles L2 Miss Latency 276 Cycles MOMC Latency ~300+ Cycles

Experimental Outline 1. Hash join 2. Index lookup 3. Mixed: Hash join and index lookup

Hash Join Comparative Performance

Hash Join L2 Cache Misses Per Tuple

CSB + Tree Index Join Comparative Performance

CSB + Tree Index Join L2 Cache Misses Per Tuple

Parallel Operator Performance 52%55%20%

Parallel Operator Performance 26%29%

Conclusion Naïve parallelSMT-AwareWork-Ahead Impl. EffortSmallModerate Data FormatUnchangedSplit outputUnchanged Data OrderUnchangedChangedUnchanged*  Performance (row) ModerateHigh  Performance (col) ModerateHighModerate Control of CacheNo Yes

Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research John Cieslewicz Columbia.

Similar presentations

Presentation on theme: "Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research John Cieslewicz Columbia."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research John Cieslewicz Columbia.

Similar presentations

Presentation on theme: "Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research John Cieslewicz Columbia."— Presentation transcript:

Similar presentations

About project

Feedback