OLTP on Hardware Islands Danica Porobic, Ippokratis Pandis*, Miguel Branco, Pınar Tözün, Anastasia Ailamaki Data-Intensive Application and Systems Lab,

Slides:



Advertisements
Similar presentations
© 2010 IBM Corporation What computer architects need to know about memory throttling WEED 2010 June 20, 2010 IBM Research – Austin Heather Hanson Karthick.
Advertisements

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.
1 NUMA-aware algorithms: the case of data shuffling Yinan Li* Ippokratis Pandis Rene Mueller Vijayshankar Raman Guy Lohman *University of Wisconsin - MadisonIBM.
SE-292 High Performance Computing
From A to E: Analyzing TPCs OLTP Benchmarks Pınar Tözün Ippokratis Pandis* Cansu Kaynak Djordje Jevdjic Anastasia Ailamaki École Polytechnique Fédérale.
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan Erez The University of Texas at Austin Salishan 2011.
Critical Sections: Re-emerging Concerns for DBMS Ryan JohnsonIppokratis Pandis Anastasia Ailamaki Carnegie Mellon University École Polytechnique Féderale.
@ Carnegie Mellon Databases Data-oriented Transaction Execution VLDB 2010 Ippokratis Pandis Ryan Johnson Nikos Hardavellas Anastasia Ailamaki Carnegie.
ITEC 352 Lecture 25 Memory(2). Review RAM –Why it isnt on the CPU –What it is made of –Building blocks to black boxes –How it is accessed –Problems with.
A hierarchy independent approach (ongoing work) Michael Monerau Chris Hankin Courant Institute, NYU Ecole Normale Supérieure de Paris, France Imperial.
© 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki.
RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Improving OLTP scalability using speculative lock inheritance Ryan Johnson, Ippokratis Pandis, Anastasia Ailamaki.
SE-292 High Performance Computing
Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don.
Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain.
1 CIS 461 Compiler Design and Construction Fall 2014 Instructor: Hugh McGuire slides derived from Tevfik Bultan, Keith Cooper, and Linda Torczon Lecture-Module.
Parallel Processing with OpenMP
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
To Include or Not to Include? Natalie Enright Dana Vantrease.
Nikos Hardavellas, Northwestern University
2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),
1 Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*
Review: Multiprocessor Systems (MIMD)
Virtual Memory Virtual Memory Management in Mach Labels and Event Processes in Asbestos Ingar Arntzen.
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Behavior of Synchronization Methods in Commonly Used Languages and Systems Yiannis Nikolakopoulos Joint work with: D. Cederman, B.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos.
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Srihari Makineni & Ravi Iyer Communications Technology Lab
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
Distributed Computing CSC 345 – Operating Systems By - Fure Unukpo 1 Saturday, April 26, 2014.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Storage Manager Scalability on CMPs Ippokratis Pandis CIDR Gong Show.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.
Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.
1 Architectural Blueprints—The “4+1” View Model of Software Architecture (
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Xbox 360 Architecture Presenter: Ataç Deniz Oral Date: 30/11/06.
Computer Sciences Department University of Wisconsin-Madison
Reducing OLTP Instruction Misses with Thread Migration
Framework For Exploring Interconnect Level Cache Coherency
Introduction to parallel programming
Many-core Software Development Platforms
CLUSTER COMPUTING.
Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.
Presentation transcript:

OLTP on Hardware Islands Danica Porobic, Ippokratis Pandis*, Miguel Branco, Pınar Tözün, Anastasia Ailamaki Data-Intensive Application and Systems Lab, EPFL *IBM Research - Almaden

Hardware topologies have changed Topology Core to core latency Core Island CMP SMP of CMP low variable SMP high (~10ns) (10-100ns) (~100ns) 2 Variable latencies affect performance & predictability Should we favor shared-nothing software configurations?

The Dreaded Distributed Transactions 3 % Multisite Transactions Throughput Shared-Nothing Shared-Everything Shared-nothing OLTP configurations need to execute distributed transactions Extra lock holding time Extra sys calls for comm. Extra logging Extra 2PC logic/instructions ++ At least SE provide robust perf. Some opportunities for constructive interference (e.g. Aether logging) Shared-nothing configurations not the panacea!

Deploying OLTP on Hardware Islands 4 HW + Workload -> Optimal OLTP configuration Shared-everything Single address space Easy to program х Random read-write accesses х Many synchronization points х Sensitive to latency Shared-nothing Thread-local accesses х Distributed transactions х Sensitive to skew

Outline Introduction Hardware Islands OLTP on Hardware Islands Conclusions and future work 5

6 Multisocket multicores Communication latency varies significantly Core L1 L2 L1 L2 L1 L2 L3 Memory controller Core L1 L2 Core L1 L2 L1 L2 L1 L2 L3 Core L1 L2 L1 L3 Inter-socket links Memory controller Inter-socket links 50 cycles 500 cycles <10 cycles

OS Placement of application threads Counter microbenchmark TPC-C Payment 8socket x 10cores 4socket x 6cores 39% 7 Unpredictable 40% 47% ? ? ? ? ? ? ? ? ? ? ? ? SpreadIsland SpreadIsland Islands-aware placement matters

8 Impact of sharing data among threads Counter microbenchmarkTPC-C Payment – local-only 18.7x 516.8x 4.5x Fewer sharers lead to higher performance 8socket x 10cores4socket x 6cores Log scale (vs 8x) (vs 80x)

Outline Introduction Hardware Islands OLTP on Hardware Islands –Experimental setup –Read-only workloads –Update workloads –Impact of skew Conclusions and future work 9

Experimental setup Shore-MT –Top-of-the-line open source storage manager –Enabled shared-nothing capability Multisocket servers –4-socket, 6-core Intel Xeon E7530, 64GB RAM –8-socket, 10-core Intel Xeon E7-L8867, 192GB RAM Disabled hyper-threading OS: Red Hat Enterprise Linux 6.2, kernel Profiler: Intel VTune Amplifier XE 2011 Workloads: TPC-C, microbenchmarks 10

Microbenchmark workload Single-site version –Probe/update N rows from the local partition Multi-site version –Probe/update 1 row from the local partition –Probe/update N-1 rows uniformly from any partition –Partitions may reside on the same instance Input size: rows/core 11

Software System Configurations 1 Island Shared-everything 24 Islands Shared-nothing 4 Islands 12

13 Increasing % of multisite xcts: reads Contention for shared data No locks or latches Messaging overhead Fewer msgs per xct Finer grained configurations are more sensitive to distributed transactions

Where are the bottlenecks? Read case 14 4 Islands 10 rows Communication overhead dominates

15 Increasing size of multisite xct: read case Physical contention More instances per transaction All instances involved in a transaction Communication costs rise until all instances are involved in every transaction

16 Increasing % of multisite xcts: updates No latches 2 round of messages +Extra logging +Lock held longer Distributed update transactions are more expensive

Where are the bottlenecks? Update case 17 4 Islands 10 rows Communication overhead dominates

18 Increasing size of multisite xct: update case Efficient logging with Aether* More instances per transaction Increased contention *R. Johnson, et al: Aether: a scalable approach to logging, VLDB 2010 Shared everything exposes constructive interference

19 Effects of skewed input 4 Islands effectively balance skew and contention Few instances are highly loaded Still few hot instances Contention for hot data Larger instances can balance load

OLTP systems on Hardware Islands Shared-everything: stable, but non-optimal Shared-nothing: fast, but sensitive to workload OLTP Islands: a robust, middle-ground –Runs on close cores –Small instances limits contention between threads –Few instances simplify partitioning Future work: –Automatically choose and setup optimal configuration –Dynamically adjust to workload changes Thank you! 20