To Share or Not to Share? Ryan Johnson Nikos Hardavellas, Ippokratis Pandis, Naju Mancheril, Stavros Harizopoulos**, Kivanc Sabirli, Anastasia Ailamaki,

Slides:

Advertisements

Similar presentations

Critical Sections: Re-emerging Concerns for DBMS Ryan JohnsonIppokratis Pandis Anastasia Ailamaki Carnegie Mellon University École Polytechnique Féderale.

Advertisements

@ Carnegie Mellon Databases Data-oriented Transaction Execution VLDB 2010 Ippokratis Pandis Ryan Johnson Nikos Hardavellas Anastasia Ailamaki Carnegie.

CGrid 2005, slide 1 Empirical Evaluation of Shared Parallel Execution on Independently Scheduled Clusters Mala Ghanesh Satish Kumar Jaspal Subhlok University.

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Grand Challenge: The BlueBay Soccer Monitoring Engine Hans-Arno Jacobsen Kianoosh Mokhtarian Tilmann Rabl Mohammad.

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

1 Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

Analysis of Database Workloads on Modern Processors Advisor: Prof. Shan Wang P.h.D student: Dawei Liu Key Laboratory of Data Engineering and Knowledge.

CoPhy: A Scalable, Portable, and Interactive Index Advisor for Large Workloads Debabrata Dash, Anastasia Ailamaki, Neoklis Polyzotis 1.

VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

DaMoN 2011 Paper Preview Organized by Stavros Harizopoulos and Qiong Luo Athens, Greece Jun 13, 2011.

Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.

Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.

Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.

Ch 4. The Evolution of Analytic Scalability

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.

DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

A Paradigm Shift in Database Optimization: From Indices to Aggregates Presented to: The Data Warehousing & Data Mining mini-track – AMCIS 2002 as Research-in-Progress.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Scaling up analytical queries with column-stores Ioannis Alagiannis Manos Athanassoulis Anastasia Ailamaki École Polytechnique Fédérale de Lausanne.

Classic Model of Parallel Processing

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison.

Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.

Sunpyo Hong, Hyesoon Kim

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

@ Carnegie Mellon Databases Staged Database Systems Thesis Oral Stavros Harizopoulos.

Gargamel: A Conflict-Aware Contention Resolution Policy for STM Pierpaolo Cincilla, Marc Shapiro, Sébastien Monnet.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

Lecture 2: Performance Evaluation

Scaling the Memory Power Wall with DRAM-Aware Data Management

Augmented Sketch: Faster and More Accurate Stream Processing

Department of Computer Science University of California, Santa Barbara

Akshay Tomar Prateek Singh Lohchubh

Ch 4. The Evolution of Analytic Scalability

Adaptive Code Unloading for Resource-Constrained JVMs

Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.

Performance And Scalability In Oracle9i And SQL Server 2000

Department of Computer Science University of California, Santa Barbara

Relax and Adapt: Computing Top-k Matches to XPath Queries

Presentation transcript:

To Share or Not to Share? Ryan Johnson Nikos Hardavellas, Ippokratis Pandis, Naju Mancheril, Stavros Harizopoulos**, Kivanc Sabirli, Anastasia Ailamaki, Babak Falsafi PARALLEL DATA LABORATORY Carnegie Mellon University **HP LABS

Ryan Johnson © April 15http:// Motivation For Work Sharing scan join aggregate scan output StudentDept Query: What is the highest undergraduate GPA? aggregate output scan Student Query: What is the average GPA in the ECE dept.?

Ryan Johnson © April 15http:// Motivation For Work Sharing Many queries in system Similar requests Redundant work Work Sharing Detect redundant work Compute results once and share Big win for I/O, uniprocessors scan join aggregate scan output StudentDept aggregate output scan Student 2x speedup for TPC-H queries [hariz05]

Ryan Johnson © April 15http:// Shared Queries Speedup due to WS 1 CPU 7x Work Sharing on Modern Hardware 8 CPU Memory L1 L2 core L1 L2 core L2 Memory Work sharing can hurt performance!

Ryan Johnson © April 15http:// Contributions Observation Work sharing can hurt performance on parallel hardware Analysis Develop intuitive analytical model of work sharing Identify trade-off between total work, critical path Application Model-based policy outperforms static ones by up to 6x

Ryan Johnson © April 15http:// Outline Introduction Part I: Intuition and Model Part II: Analysis and Experiments

Ryan Johnson © April 15http:// Challenges of Exploiting Work Sharing Independent execution only? Load reduction from work sharing can be useful Work sharing only? Indiscriminate application can hurt performance To share or not to share? System and workload dependent Adapt decisions at runtime Must understand work sharing to exploit it fully

Ryan Johnson © April 15http:// Work Sharing vs. Parallelism Query 2 Independent Execution Query 1 Query 2 response timeQuery 1 response time Scan Join Aggregate P = 4.33 Critical Paths

Ryan Johnson © April 15http:// Work Sharing vs. Parallelism Query 2 Query 1 Query 2 response timeQuery 1 response time Shared Execution Penalty Scan Join Aggregate P = 2.75 P = 4.33 Critical path now longer Total work and critical path both important

Ryan Johnson © April 15http:// Understanding Work Sharing Performance depends on two factors: Work sharing presents a trade-off Reduces total work Potentially lengthens critical path Balance both factors or performance suffers

Ryan Johnson © April 15http:// Basis for a Model “Closed” system Consistent high load Throughput computing Assumed in most benchmarks Fixed number of clients Little’s Law governs throughput Higher response time = lower throughput Total work not a direct factor! Load reduction secondary to response time

Ryan Johnson © April 15http:// Predicting Response Time Case 1: Compute-bound Case 2: Critical path-bound Larger bottleneck determines response time Model provides u and p max

Ryan Johnson © April 15http:// An Analytical Model of Work Sharing Sharing helpful when X shared > X alone Throughput for m queries and n processors Potentially worsened by work sharing Improved by work sharing U = requested utilization Pmax = longest pipe stage

Ryan Johnson © April 15http:// Outline Introduction Part I: Intuition and Model Part II: Analysis and Experiments

Ryan Johnson © April 15http:// Experimental Setup Hardware Sun T2000 “Niagara” with 16 GB RAM 8 cores (32 threads) Solaris processor sets vary effective CPU count Cordoba Staged DBMS Naturally exposes work sharing Flexible work sharing policies 1GB TPCH dataset Fixed Client and CPU counts per run

Ryan Johnson © April 15http:// Predicted vs. Measured Performance Shared Queries Speedup due to WS 1 CPU 1 CPU model 2 CPU 2 CPU model 8 CPU 8 CPU model 32 CPU 32 CPU model Model Validation: TPCH Q1 Avg/max error: 5.7% / 22%

Ryan Johnson © April 15http:// Model Validation: TPCH Q4 Behavior varies with both system and workload

Ryan Johnson © April 15http:// Exploring WS vs. Parallelism Work sharing splits query into three parts Independent work –Per-query, parallel –Total work Serial work –Per-query, serial –Critical path Shared work –Computed once –“Free” after first query Serial - 4% Independent - 37% Shared - 59% Example:

Ryan Johnson © April 15http:// Shared Queries Benefit from Work Sharing Exploring WS vs. Parallelism Behavior matches previously published results Potential Speedup CPUs

Ryan Johnson © April 15http:// Shared Queries Benefit from Work Sharing 4 8 Exploring WS vs. Parallelism Potential Speedup CPUs Saturated

Ryan Johnson © April 15http:// Shared Queries Benefit from Work Sharing Exploring WS vs. Parallelism Potential Speedup CPUs Saturated

Ryan Johnson © April 15http:// Shared Queries Benefit from Work Sharing Exploring WS vs. Parallelism More processors shift bottleneck to critical path Saturated Potential Speedup CPUs

Ryan Johnson © April 15http:// Performance Impact of Serial Work Critical path quickly becomes major bottleneck Impact of Serial Work (32 CPU) Shared Queries Benefit from Work Sharing 0% 1% 2% 7%

Ryan Johnson © April 15http:// Model-guided Work Sharing Integrate predictive model into Cordoba Predict benefit of work sharing for each new query Consider multiple groups of queries at once Shorter critical path, increased parallelism Experimental setup Profile run with 2 clients, 2 CPUs Extract model parameters with profiling tools 20 clients submit mix of TPCH Q1 and Q4 Compare against always-, never-share policies

Ryan Johnson © April 15http:// Comparison of Work Sharing Strategies Model-based policy balances critical path and load 2 CPU All Q1 50/50All Q4 Query Ratio Queries/min always share model guided never share 32 CPU Queries/min All Q1 50/50All Q4 Query Ratio

Ryan Johnson © April 15http:// Related Work Many existing work sharing schemes Identification occurs at different stages in the query’s lifetime All allow pipelined query execution Materialized Views [rouss82] Multiple Query Optimization [roy00] Staged DBMS [hariz05] Synchronized Scanning [lang07] Schema design Buffer Pool Access Query compilation Query execution Early Late Model describes all types of work sharing

Ryan Johnson © April 15http:// Conclusions Work sharing can hurt performance Highly parallel, memory resident machines Intuitive analytical model captures behavior Trade-off between load reduction and critical path Model-guided work sharing highly effective Outperforms static policies by up to 6x

Ryan Johnson © April 15http:// References [hariz05] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. “QPipe: A Simultaneously Pipelined Relational Query Engine.” In Proc. SIGMOD, [lang07] C. Lang, B. Bhattacharjee, T. Malkemus, S. Padmanabhan, and K. Wong. “Increasing Buffer-Locality for Multiple Relational Table Scans through Grouping and Throttling.” In Proc. ICDE, [rouss82] N. Roussopoulos. “View Indexing in Relational databases.” In ACM TODS, 7(2): , [roy00] P. Roy, S. Seshadri, S. Sudarshan, and S. Bhobe. “Efficient and Extensible Algorithms for Multi Query Optimization.” In Proc. SIGMOD,