DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison
© 1999 Anastassia Ailamaki2 Higher DBMS Performance Sophisticated, powerful new processors + Compute, memory intensive DB apps = Suboptimal performance for DBMSs Where is query execution time spent? Look for performance bottlenecks in processor and memory components
© 1999 Anastassia Ailamaki3 Outline Introduction Background Query execution time breakdown Experimental results Conclusions
© 1999 Anastassia Ailamaki4 Hardware Performance Evaluation Benchmarks: SPEC, SPLASH, LINPACK Enterprise servers run commercial apps How do database systems perform?
© 1999 Anastassia Ailamaki5 The DBMS New Bottleneck Earlier bottleneck was I/O, now memory and compute intensive apps Modern platforms: 3sophisticated execution hardware 3fast, non-blocking caches and memory still... DBMSs hardware behavior is suboptimal, compared to scientific workloads.
© 1999 Anastassia Ailamaki6 An Execution Pipeline FETCH/ DECODE UNIT DISPATCH EXECUTE UNIT RETIRE UNIT INSTRUCTION POOL L1 I-CACHEL1 D-CACHE L2 CACHE : Branch prediction, non-blocking caches, out-of-order MAIN MEMORY
© 1999 Anastassia Ailamaki7 Where Does Time Go? “Measured” and “estimated” components Computation Memory Branch Mispredictions Hardware Resources } Stalls Overlap opportunity: Load A D=B+C Load E
© 1999 Anastassia Ailamaki8 Setup and Methodology Four commercial DBMSs: A, B, C, D 6400 PII Xeon/MT running Windows NT 4 Used processor counters Range Selection (sequential, indexed) select avg (a3) from R where a2 > Lo and a2 < Hi Equijoin (sequential) select avg (a3) from R, S where R.a2 = S.a1 WHY ME?
© 1999 Anastassia Ailamaki9 Why Simple Queries? Easy to setup and run Fully controllable parameters Enable iterative hypotheses Allow to isolate behavior of basic loops (workload not good for comparing speed) Building blocks for complex workloads?
© 1999 Anastassia Ailamaki10 Execution Time Breakdown (%) Stalls at least 50% of time Memory stalls are major bottleneck
© 1999 Anastassia Ailamaki11 CPI (Clocks Per Instruction) Breakdown CPI is high (compared to scientific workloads) Indexed access more memory stalls / instruction
© 1999 Anastassia Ailamaki12 Memory Stalls Breakdown (%) Role of L1 data cache unimportant L1 instruction and L2 data stalls dominate Memory bottlenecks across DBMSs and queries vary
© 1999 Anastassia Ailamaki13 Effect of Record Size 10% Sequential Scan L2D increase: locality + page crossing (exc. D) L1I increase: page boundary crossing costs
© 1999 Anastassia Ailamaki14 Memory Bottlenecks Memory is important -Increasing memory-processor performance gap -Deeper memory hierarchies expected Stalls due to L2 cache data misses -Compulsory or repeated -L2 grows (8MB), but will be slower Stalls due to L1 I-cache misses -Buffer pool code is expensive -L1 I-cache not likely to grow as much as L2
© 1999 Anastassia Ailamaki15 Branch Mispredictions Are Expensive Rates are low, but contribution is significant A compiler task, but decisive for L1I performance
© 1999 Anastassia Ailamaki16 Branch Mispredictions Vs. L1 I-cache Misses More branch mispredictions incur more L1I misses Index code more complicated - needs optimization
© 1999 Anastassia Ailamaki17 Resource-related Stalls High T DEP for all systems : Low ILP opportunity A’s sequential scan: Memory unit load buffers? Dependency-related stalls (T DEP )Functional Unit-related stalls (T FU )
© 1999 Anastassia Ailamaki18 Microbenchmarks vs. TPC CPI Breakdown Sequential scan breakdown similar to TPC-D 2ary index and TPC-C: higher CPI, memory stalls
© 1999 Anastassia Ailamaki19 Conclusions Execution time breakdown shows trends L1I and L2D are major memory bottlenecks We need to: 3 reduce page crossing costs 3 optimize instruction stream 3 optimize data placement in L2 cache 3 reduce stalls at all levels TPC may not be necessary to locate bottlenecks