DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.

Slides:

Advertisements

Similar presentations

DBMS S O N A M ODERN P ROCESSOR : W HERE D OES T IME G O ? Anatassia Ailamaki David J DeWitt Mark D. Hill David A. Wood Presentation by Monica Eboli.

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors Jack L. Lo, Luiz André Barroso, Susan Eggers Kourosh Gharachorloo,

Analysis of Database Workloads on Modern Processors Advisor: Prof. Shan Wang P.h.D student: Dawei Liu Key Laboratory of Data Engineering and Knowledge.

Chapter 4 M. Keshtgary Spring 91 Type of Workloads.

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

Memory System Characterization of Big Data Workloads

Evaluation of branch-prediction methods on traces from commercial applications R.B. Hilgendorf, G. J. Helm, W. Rosenstiel.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.

Chapter 17 Parallel Processing.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

CSC 7080 Graduate Computer Architecture Lec 12 – Advanced Memory Hierarchy 2 Dr. Khalaf Notes adapted from: David Patterson Electrical Engineering and.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

Cache-Conscious Structure Definition By Trishul M. Chilimbi, Bob Davidson, and James R. Larus Presented by Shelley Chen March 10, 2003.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Srihari Makineni & Ravi Iyer Communications Technology Lab

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.

DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Pipelining and Parallelism Mark Staveley

Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison.

MEMORY SYSTEM CHARACTERIZATION OF COMMERCIAL WORKLOADS Authors: Luiz André Barroso (Google, DEC; worked on Piranha) Kourosh Gharachorloo (Compaq, DEC;

RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

An Architectural Evaluation of Java TPC-W Harold “Trey” Cain, Ravi Rajwar, Morris Marden, Mikko Lipasti University of Wisconsin-Madison

DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.

DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Computer Sciences Department University of Wisconsin-Madison

CSL718 : Superscalar Processors

Computer Architecture

Memory System Characterization of Commercial Workloads

/ Computer Architecture and Design

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Module 3: Branch Prediction

Presented by: Eric Carty-Fickes

Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.

Hardware Counter Driven On-the-Fly Request Signatures

Dynamic Hardware Prediction

Virtual Memory: Working Sets

rePLay: A Hardware Framework for Dynamic Optimization

Presentation transcript:

DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept. Madison, WI Presented by Derwin Halim

Agenda Database and DBMS Motivation for DBMS performance study Proposed DBMS performance study Processor model Query execution time breakdown Database workload Experimental setup and results Conclusion

Database and DBMS Database is a collection of data, typically describing the activities of one or more related organizations: entities and relationships DBMS (Database Management System) is a software designed to assist in maintaining and utilizing large collections of data

Motivation for DBMS Performance Study DBMSs are becoming compute and memory bound Modern processors do not improve database system performance to the same extent as scientific workloads Contrasting commercial DBMSs and identifying common characteristics are difficult Urgent need to evaluate and understand the processor and memory behavior of commercial DBMSs on existing hardware platform

Proposed DBMS Performance Study Analyze the execution time breakdown of multiple different commercial DBMSs on the same hardware platform Use workload consists of simple queries on a memory resident database Isolate basic operations and identify common trends across the DBMSs Identify and analyze bottlenecks and provide solutions

Processor Model: Basic Pipeline Operation

Processor Model: Handling Pipeline Stall Non-blocking cache Out-of-order execution Speculative execution with branch prediction

Query Execution Time Breakdown T Q = T C + T M + T B + T R – T OVL

Database Workload Single-table range selections and two table equijoins over a memory resident database, running a single command stream Eliminates dynamic and random parameters Isolate basic operations: sequential access and index selection Allows examination of the processor and memory behavior without I/O interference

Database Workload Table: create table R (a1integer not null, a2integer not null, a3integer not null, ) Sequential range selection: select avg(a3) from R where a2 Lo

Database Workload Indexed range selection: construct non-clustered index on R.a2 then resubmitted the range selection Sequential join: select avg(R.a3) from R, S where R.a2 = S.a1 40, byte records in S, each of which joins with 30 records in R

Experimental Setup: Hardware and Software Platform 400MHz PII Xeon/MT Workstation 512 MB main memory with 100 MHz system bus Out-of-order engine and speculative instruction execution Non-blocking cache Separate data and instruction first level caches Unified second level cache 4 commercial DBMSs on Windows NT 4.0 Service Pack 4 Event measurement counters and emon

Experimental Setup: PII Xeon Cache Characteristics

Experimental Setup: Measuring Stall Time Components

Results: Execution Time Breakdown Processor spends most of the time stalled The problem will be exacerbated by the ever increasing processor-memory gap Bottleneck shifts

Results: Memory Stalls Breakdown L1 D-cache, L2 I-cache, ITLB stall time are insignificant Focus on L1 I-cache and L2 D-cache stall time component

Results: L2 D-cache Stall Time Position of the accessed data in the records and the record size L2 D-cache miss is much more expensive than L1 D-cache miss Only gets worse as processor-memory performance gap increases Larger cache => longer latency

Results: L1 I-cache Stall Time L1 I-cache miss is difficult to overlap and causes serial bottleneck in the pipeline L1 cache size vs. latency L1 cache miss increases as data record size increases - Inclusion: L2 cache replacement forces L1 cache replacement - OS interrupt: periodical context switching - Page boundary crossing

Results: Branch Mis-prediction Serial bottleneck and instruction cache misses 40% BTB misses on average => more static prediction L1 I-cache miss follows branch mis-prediction behavior

Results: Resource Stall Time Dominated by dependency and/or functional unit stalls Dependency stalls are the most important resource stalls due to low ILP except for System A FU stalls are caused by contention in the execution unit

Results: Simple Query vs TPC Benchmarks Simple Query vs TPC-D (DSS): - Similar CPI breakdown - Still dominated by L1 I-cache and L2 D-cache miss Simple Query vs TPC-C (OLTP): - CPI rate of TPC-C is much higher - Resource stalls are higher - Dominated by L2 D- and I-cache miss

Conclusion Memory stall is a serious performance bottleneck Focus on L1 I-cache and L2 D-cache misses Improvements should address all of the stall components due to possibility of bottleneck shifts Simple query offers methodological advantage TPC-D has similar execution time breakdown, while TPC-C incur more second level cache and resource stalls