Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.

Slides:



Advertisements
Similar presentations
From A to E: Analyzing TPCs OLTP Benchmarks Pınar Tözün Ippokratis Pandis* Cansu Kaynak Djordje Jevdjic Anastasia Ailamaki École Polytechnique Fédérale.
Advertisements

To Share or Not to Share? Ryan Johnson Nikos Hardavellas, Ippokratis Pandis, Naju Mancheril, Stavros Harizopoulos**, Kivanc Sabirli, Anastasia Ailamaki,
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.
Nikos Hardavellas, Northwestern University
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
1 Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
Multi-core architectures. Single-core computer Single-core CPU chip.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Multi-Core Architectures
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Niagara: a 32-Way Multithreaded SPARC Processor
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Classic Model of Parallel Processing
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Computer Sciences Department University of Wisconsin-Madison
COMP 740: Computer Architecture and Implementation
Reducing OLTP Instruction Misses with Thread Migration
Simultaneous Multithreading
Computer Structure Multi-Threading
Memory System Characterization of Commercial Workloads
5.2 Eleven Advanced Optimizations of Cache Performance
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
CMSC 611: Advanced Computer Architecture
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Lecture 14: Reducing Cache Misses
Hardware Multithreading
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Adaptive Single-Chip Multiprocessing
Chapter 1 Introduction.
(A Research Proposal for Optimizing DBMS on CMP)
Chapter 8. Pipelining.
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
CSC3050 – Computer Architecture
Chapter 4 Multiprocessors
Hardware Multithreading
Presentation transcript:

Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia Ailamaki, Babak Falsafi

Hardware Integration Trends traditional multiprocessors chip multiprocessors L1 L2 core Memory core core L1 L1 L2 L2 Processor designers are constantly looking for new ways to exploit the transistors available on chip more efficiently. As Moore’s Law continues, we have moved from the era of pipelined architectures (80’s) to the era of ILP (90’s), and more recently we entered the era of multi-threaded and multi-core processors. This shift poses an imminent technical and research challenge of adapting high-performance data management software to a shifting hardware landscape, as modern DBMS are optimized primarily for pipelined and ILP architectures. At the same time, technological advancements have allowed on-chip sizes to increase, leading to large but slow caches. In this work we investigate the combined effects of these two trends on DBMS performance. Memory Moore’s Law: 2x trans. = 2x cores, 2x caches Trends to use larger but slower caches © Hardavellas

Contributions We show that: L2 caches growing bigger and slower Bottleneck shifts from Mem to L2 DBMS absolute performance drops Must enhance DBMS L1 locality HW parallelism scales exponentially DBMS cannot exploit parallelism under light load Need inherent DBMS parallelism © Hardavellas

Methodology Flexus simulator (developed at CMU) Cycle-accurate, full-system OLTP: TPC-C, 100wh, in memory DSS: TPC-H throughput, 1GB db, in memory Scan- and join-bound queries (1, 6, 13, 16) Saturated: 64/16 clients (OLTP/DSS) Unsaturated (light load): 1 client © Hardavellas

Observation #1 Bottleneck Shift to L2-hit Stalls Xeon 7100 (2006) To drive the points home, we first look at the how the performance bottlenecks shift when the L2 cache size increases on a 4-core CMP running DSS. Explain axis, lines. Performance studies in the recent literature are on the left side of the graph (1-4 MB) where Mem stalls are the dominant execution time component and L2-hit stalls are virtually non-existent. Since then, however, as we move to the right side in this graph, L2-hit stalls have risen from oblivion to become the dominant execution time component. As shown in the paper, the bottleneck shift to L2-hit stalls has severe ramifications to DBMS performance. Instead of obtaining significant speedup (up to 1.7x) due to lower miss rates as cache size increases, the increased cache latency causes performance to degrade by up to 30%. At the largest cache size we simulated, only half of the potential performance is realized by the system. PIII Xeon 500 (1999) Itanium2 9050 (2006) Bottleneck shift from Mem stalls to L2-hit stalls © Hardavellas

Impact of L2-hit Stalls Increasing cache size reduces throughput Shifting the bottleneck to L2-hit stalls has severe ramifications to DBMS performance. Explain axis, lines. Instead of obtaining significant speedup due to lower miss rates as cache size increases, the increased cache latency causes performance to degrade by up to 30%. At the largest cache size we simulated, only half of the potential performance is realized by the system, which is a significant loss. Increasing cache size reduces throughput Must enhance L1 locality © Hardavellas

Observation #2 Parallelism in Modern CMPs Fat Camp (FC) wide-issue, OOO e.g., IBM Power5 Lean Camp (LC) in-order, multi-threaded e.g., Sun UltraSparc T1 cite in brackets To address stalls, chip designers follow two distinct schools of thought, OOO execution and multithreading, which have found their way in CMP as well. Thus, we divide CMPs into two camps, the FC which includes CMPs built out of wide-issue OOO cores (e.g., Power5) and the LC which includes CMPs built out of simple in-order MT cores (e.g., Niagara). The naming scheme is prompted by the relative sizes of the individual cores, depicted by the red squares. The two camps exhibit different behavior on the same workloads. one core FC: parallelism within thread, LC: across threads © Hardavellas

How Camps Address Stalls computation data stall Saturated Unsaturated LC thread1 thread2 thread3 FC TODO: make FC 3-wide. Collapse. Fix indentation. Animate. Time goes from left to right. Each box represents an instruction slot (1 processor cycle), yellow denotes a miss, green denotes data stall cycles, and black is useful instructions executing. When running an unsaturated workload, e.g., a single thread, a LC core utilizes only one hardware context and executes the program sequentially, stalling the processor for every miss. On the contrary, a FC core can exploit the available ILP to overlap stalls with computation or execute more instructions in a single cycle, leading to faster execution time. However, when there is an abundance of threads, the LC core exploits TLP (which is abundant in DB workloads) to overlap stalls with other stalls or computation, while FC is constrained by the limited ILP available in DB workloads. Unsaturated workloads suffer primarily from lack of parallelism, as both entire cores in the CMP as well as hardware contexts within each core are idling. Increasing the number of available threads by decomposing a single request into multiple sub-tasks may improve performance significantly. Saturated workloads, on the other hand, suffer primarily from exposed data stalls, which can be alleviated by enhancing the locality of the first-level cache. thread1 thread2 thread3 LC: stalls can dominate under unsaturated FC: stalls exposed in all cases © Hardavellas

Prevalence of Data Stalls corroborate ranganathan, cite We investigate the different behavior of the two camps by characterizing their execution time on OLTP and DSS workloads. On the x-axis we have the FC and LC CMPs for each workload configuration. We run both unsaturated workloads, essentially looking at single-thread performance, and saturated workloads where there is an abundance of software threads for the processors to execute. The y-axis is % excution time. We observe that data stalls dominate execution on all combinations of CMP designs and workloads by at least 64%, except LC/saturated. While LC outperforms FC for saturated workloads, the exact opposite is happening when running unsaturated workloads. The different execution behavior of each configuration allows us to devise a list of requirements for modern DBMS to attain maximum performance. DBMS need parallelism & L1D locality © Hardavellas

Impact L2 caches growing bigger and slower HW parallelism scales exponentially Bottlenecks shift, data stalls are exposed DBMS must provide both Fine-grain parallelism across and within queries L1 locality http://www.cs.cmu.edu/~stageddb/ [...] We believe that staged DBMS are uniquely positioned to address the constantly shifting performance bottlenecks. Staged DBMS decompose a request into multiple sub-tasks that can execute in parallel in a pipelined fashion, naturally providing more parallelism. Because staged DBMS are constructed in a modular way and the modules are exposed to the execution system, the runtime environment can make intelligent mapping of resources and scheduling decisions to enhance locality, thereby improving performance. © Hardavellas