DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

Slides:



Advertisements
Similar presentations
DBMS S O N A M ODERN P ROCESSOR : W HERE D OES T IME G O ? Anatassia Ailamaki David J DeWitt Mark D. Hill David A. Wood Presentation by Monica Eboli.
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Nikos Hardavellas, Northwestern University
1 Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Analysis of Database Workloads on Modern Processors Advisor: Prof. Shan Wang P.h.D student: Dawei Liu Key Laboratory of Data Engineering and Knowledge.
What will my performance be? Resource Advisor for DB admins Dushyanth Narayanan, Paul Barham Microsoft Research, Cambridge Eno Thereska, Anastassia Ailamaki.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Project 4 U-Pick – A Project of Your Own Design Proposal Due: April 14 th (earlier ok) Project Due: April 25 th.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Multiscalar processors
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Continuous resource monitoring for self-predicting DBMS Dushyanth Narayanan 1 Eno Thereska 2 Anastassia Ailamaki 2 1 Microsoft Research-Cambridge, 2 Carnegie.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
Waleed Alkohlani 1, Jeanine Cook 2, Nafiul Siddique 1 1 New Mexico Sate University 2 Sandia National Laboratories Insight into Application Performance.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
CSC 7080 Graduate Computer Architecture Lec 12 – Advanced Memory Hierarchy 2 Dr. Khalaf Notes adapted from: David Patterson Electrical Engineering and.
Buffering Database Operations for Enhanced Instruction Cache Performance Jingren Zhou, Kenneth A. Ross SIGMOD International Conference on Management of.
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
By: Sang K. Cha, Sangyong Hwang, Kihong Kim and Kunjoo Kwon
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
On Tuning Microarchitecture for Programs Daniel Crowell, Wenbin Fang, and Evan Samanas.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
Pipelining and Parallelism Mark Staveley
Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison.
Sunpyo Hong, Hyesoon Kim
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Computer Sciences Department University of Wisconsin-Madison
The University of Adelaide, School of Computer Science
Memory System Characterization of Commercial Workloads
5.2 Eleven Advanced Optimizations of Cache Performance
CS203 – Advanced Computer Architecture
Drinking from the Firehose Decode in the Mill™ CPU Architecture
Lecture 14: Reducing Cache Misses
Presented by: Eric Carty-Fickes
Performance of computer systems
Lecture 10: Branch Prediction and Instruction Delivery
Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.
Instruction Level Parallelism (ILP)
Computer Evolution and Performance
Performance of computer systems
Presentation transcript:

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison

© 1999 Anastassia Ailamaki2 Higher DBMS Performance Sophisticated, powerful new processors + Compute, memory intensive DB apps = Suboptimal performance for DBMSs Where is query execution time spent? Look for performance bottlenecks in processor and memory components

© 1999 Anastassia Ailamaki3 Outline Introduction Background Query execution time breakdown Experimental results Conclusions

© 1999 Anastassia Ailamaki4 Hardware Performance Evaluation Benchmarks: SPEC, SPLASH, LINPACK Enterprise servers run commercial apps How do database systems perform?

© 1999 Anastassia Ailamaki5 The DBMS New Bottleneck Earlier bottleneck was I/O, now memory and compute intensive apps Modern platforms: 3sophisticated execution hardware 3fast, non-blocking caches and memory still... DBMSs hardware behavior is suboptimal, compared to scientific workloads.

© 1999 Anastassia Ailamaki6 An Execution Pipeline FETCH/ DECODE UNIT DISPATCH EXECUTE UNIT RETIRE UNIT INSTRUCTION POOL L1 I-CACHEL1 D-CACHE L2 CACHE : Branch prediction, non-blocking caches, out-of-order MAIN MEMORY

© 1999 Anastassia Ailamaki7 Where Does Time Go?  “Measured” and “estimated” components Computation Memory Branch Mispredictions Hardware Resources } Stalls Overlap opportunity: Load A D=B+C Load E

© 1999 Anastassia Ailamaki8 Setup and Methodology Four commercial DBMSs: A, B, C, D 6400 PII Xeon/MT running Windows NT 4 Used processor counters Range Selection (sequential, indexed) select avg (a3) from R where a2 > Lo and a2 < Hi Equijoin (sequential) select avg (a3) from R, S where R.a2 = S.a1 WHY ME?

© 1999 Anastassia Ailamaki9 Why Simple Queries? Easy to setup and run Fully controllable parameters Enable iterative hypotheses Allow to isolate behavior of basic loops (workload not good for comparing speed) Building blocks for complex workloads?

© 1999 Anastassia Ailamaki10 Execution Time Breakdown (%) Stalls at least 50% of time Memory stalls are major bottleneck

© 1999 Anastassia Ailamaki11 CPI (Clocks Per Instruction) Breakdown CPI is high (compared to scientific workloads) Indexed access  more memory stalls / instruction

© 1999 Anastassia Ailamaki12 Memory Stalls Breakdown (%) Role of L1 data cache unimportant L1 instruction and L2 data stalls dominate Memory bottlenecks across DBMSs and queries vary

© 1999 Anastassia Ailamaki13 Effect of Record Size 10% Sequential Scan L2D increase: locality + page crossing (exc. D) L1I increase: page boundary crossing costs

© 1999 Anastassia Ailamaki14 Memory Bottlenecks Memory is important -Increasing memory-processor performance gap -Deeper memory hierarchies expected Stalls due to L2 cache data misses -Compulsory or repeated -L2 grows (8MB), but will be slower Stalls due to L1 I-cache misses -Buffer pool code is expensive -L1 I-cache not likely to grow as much as L2

© 1999 Anastassia Ailamaki15 Branch Mispredictions Are Expensive Rates are low, but contribution is significant A compiler task, but decisive for L1I performance

© 1999 Anastassia Ailamaki16 Branch Mispredictions Vs. L1 I-cache Misses More branch mispredictions incur more L1I misses Index code more complicated - needs optimization

© 1999 Anastassia Ailamaki17 Resource-related Stalls High T DEP for all systems : Low ILP opportunity A’s sequential scan: Memory unit load buffers? Dependency-related stalls (T DEP )Functional Unit-related stalls (T FU )

© 1999 Anastassia Ailamaki18 Microbenchmarks vs. TPC CPI Breakdown Sequential scan breakdown similar to TPC-D 2ary index and TPC-C: higher CPI, memory stalls

© 1999 Anastassia Ailamaki19 Conclusions Execution time breakdown shows trends L1I and L2D are major memory bottlenecks We need to: 3 reduce page crossing costs 3 optimize instruction stream 3 optimize data placement in L2 cache 3 reduce stalls at all levels TPC may not be necessary to locate bottlenecks