A first step towards GPU-assisted Query Optimization

Slides:



Advertisements
Similar presentations
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Advertisements

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.
Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
GPGPU platforms GP - General Purpose computation using GPU
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Access Path Selection in a Relational Database Management System Selinger et al.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Sunpyo Hong, Hyesoon Kim
Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Computer Graphics Graphics Hardware
Getting the Most out of Scientific Computing Resources
Applied Operating System Concepts
Generalized and Hybrid Fast-ICA Implementation using GPU
OPERATING SYSTEMS CS 3502 Fall 2017
Getting the Most out of Scientific Computing Resources
Chapter 1: Introduction
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Parallel Data Laboratory, Carnegie Mellon University
Chapter 1: Introduction
Real-Time Ray Tracing Stefan Popov.
Chapter 1: Introduction
Database Performance Tuning and Query Optimization
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Chapter 1: Introduction
Linchuan Chen, Xin Huo and Gagan Agrawal
CSCI1600: Embedded and Real Time Software
The Yin and Yang of Processing Data Warehousing Queries on GPUs
Communication and Memory Efficient Parallel Decision Tree Construction
Objective of This Course
Akshay Tomar Prateek Singh Lohchubh
Selected Topics: External Sorting, Join Algorithms, …
Operating System Concepts
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Chapter 1: Introduction
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Computer Graphics Graphics Hardware
Chapter 1: Introduction
Chapter 1: Introduction
Chapter 11 Database Performance Tuning and Query Optimization
Chapter 1: Introduction
Operating System Concepts
6- General Purpose GPU Programming
CSCI1600: Embedded and Real Time Software
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Chapter 1: Introduction
Presentation transcript:

A first step towards GPU-assisted Query Optimization 08.11.2018 A first step towards GPU-assisted Query Optimization Max Heimel Volker Markl Presented by Kostas Tzoumas

Introduction & Motivation. A Proof of concept. Evaluation. Agenda Introduction & Motivation. A Proof of concept. Evaluation. Conclusion & Future Work. 08.11.2018 DIMA – TU Berlin

Introduction & Motivation. A Proof of concept. Evaluation. Agenda Introduction & Motivation. A Proof of concept. Evaluation. Conclusion & Future Work. 08.11.2018 DIMA – TU Berlin

The advent of GPGPU in databases Modern GPUs are fully programmable, massively parallel co-processors. Frameworks like OpenCL and CUDA allow to run arbitrary computations Raw compute power outperforms modern CPUs by a huge margin (~ 10x). They are coupled with high-bandwidth memory. GPUs are an interesting platform for accelerating databases. Active research topic for several years. Main Focus: Running database operators on the GPU. Selection, Join, Aggregation, Grouping, Sorting, Transactions... Often demonstrate significant speedups. However: There are some limitations ... GPUs offer much more raw compute power than CPUs (roughly a factor of ten). For cmp: Latest Sandy Bridge Intel CPU: ~100Gflops. Latest Nvidia GPU: ~1TFlops Graphics cards couple high compute power with strong IO performance  Very interesting target for running database operations. GPU-accelerated databases has been an active research topic for several years. Efficient GPU-accelerated versions of almost all major database operators have been presented. However, there are some limitations when running database code on a graphics card.

Background: Graphics Cards Architecture 08.11.2018 Background: Graphics Cards Architecture GPU can only operate on data in device memory  All data has to pass through here! Quick excursuson graphics card architecture to understand the next slide. Graphics Card = GPU + Device Memory GPU can only operate on data that is located in the device memory. Someone might object here: Nvidia offers unified host addressing (i.e., GPU can access main memory). However, this is only a convenience technology! The data is still transferred to the device memory on the graphics card behind the curtains.  All payload has to pass through the PCIexpress bus, which is comparably slow.  This transfer is the bottleneck and should be avoided where possible! Since the PCIexpress bus is slower than main memory, I/O bound problems are always faster on the CPU if the data is not already on the graphics card. In the time that the data has arrived in device memory, the CPU has already finished the computation. Transfer to graphics card is slower than Access to main memory!  Minimize data transfer between host and graphics card.  I/O-bound problems can only benefit if data is already on the GPU!

Problems with GPU-accelerated operators 08.11.2018 Problems with GPU-accelerated operators Restricted to small data sets: Operators are typically I/O-bound & device memory is rather small (2-8 GB). Requires selecting a hot-set of data that is GPU-resident. Another problem: Unpredicted large return sizes might overflow available memory. Query optimization becomes more difficult: Plan-space is inherently larger. Complicated cross-device cost functions are required. They only benefit in-memory databases systems: If the data is kept on disk, most time is spent on disk I/O. GPU can only accelerate once data is in main memory, which is typically neglible.  Further research is required to make GPUs a useful platform for production database systems! Based on the previous slide (avoid data transfer) make the point that the existing approach (running database operators on the GPU) has some major drawbacks: It is restricted to small datasets. Operators are I/O bound (i.e. require their input to be in device memory to be efficient, see last slide) and device memory is limited. Thus, only a small subset of the data can usually be accelerated. Also: Unpre Query optimization gets more complicated due to the additional options and cross-device execution costs. They only benefit in-memory database systems. „Traditional“ DB systems would only see minor improvements from using a graphics card. 08.11.2018 DIMA – TU Berlin

GPU-assisted Query Optimization 08.11.2018 GPU-assisted Query Optimization Suggestion: Investigate GPU usage in Query Optimization. Typically compute-intensive and compute-bound. Negligible – and predictable – result sizes. Operates on comparably small input data. What do we gain from this? Faster optimization (not really an issue). However: Additional compute power can be used for „more expensive“ operations: Better cardinality estimates. Reduced search space pruning. More complex optimization algorithms.  Main Goal: Improved Plan Quality!  Indirect acceleration of the database, without actually running operator code on the GPU! Would allow to accelerate disk-based systems with a GPU. Approach can complement existing work on GPU-accelerated operators. Here, we introduce 08.11.2018 DIMA – TU Berlin

Introduction & Motivation. A Proof of concept. Evaluation. Agenda Introduction & Motivation. A Proof of concept. Evaluation. Conclusion & Future Work. 08.11.2018 DIMA – TU Berlin

GPU-Assisted Selectivity Estimation 08.11.2018 GPU-Assisted Selectivity Estimation Statistics Statistics Collection Relational Data Statistics Statistics Collection System Catalog Query AST Plan Result SQL Parser Query Optimizer Execution Engine Query Optimizer needs accurate estimates of intermediate result sizes. Selectivity estimation: Estimate how many tuples qualify a given predicate. Bad estimates often lead to the optimizer making wrong decisions. Idea: Improve estimation quality by using the GPU. Keep a statistical model of the data in the device memory. Then run a more expensive - but more accurate - estimation algorithm. Introduce the idea of running Selectivity Estimation on the graphics card as a proof of concept. 08.11.2018 DIMA – TU Berlin

Background: Kernel Density Estimation Kernel Density Estimation (KDE) is a non-parametric method for density estimation. General idea: Use the observed points (sample) to construct the estimator. Each observed point distributes „probability mass“ to its neighborhood.  Shape of distribution controlled by local density functions (Kernels).  The estimate at a point is the total „probability mass“ from all observed points. Amount of „probability mass“ from . Local probability density function. Kernel: Shape of the local probability density. Bandwidth: Controls spread and orientation of the local density. 11/8/2018 DIMA – TU Berlin

Why Kernel Density Estimation? 08.11.2018 Why Kernel Density Estimation? Nice statistical properties: Proven to converge faster to the true distribution than histograms. Probability estimate is continuous (smooth). Simple to construct and maintain: Only requires collecting and maintain a representative sample. KDE automatically captures correlation patterns within the data: No need for Independence Assumptions or Multidimensional Histograms. But: Curse of Dimensionality still applies (~ problematic for dimensions > 10). Easy to run efficiently on a graphics card: Trivially to parallelize: Compute local contributions in parallel, then sum them up. Only significant required data-transfer is copying model This happens only once! KDE Parallelization: Compute contribution from each point in the sample, then sum all contributions up. 08.11.2018 DIMA – TU Berlin

Implementation Overview I 08.11.2018 Implementation Overview I We integrated a GPU-accelerated KDE variant into the query optimizer of PostgreSQL. Supports estimates for range queries on real-valued columns. Usage is completely transparent to the user. We use OpenCL to run code on the GPU. Open, platform-independent standard. Allows running code on multiple hardware platforms: GPUs, Multi-Core CPUs, FPGAs, ... Allows dynamic compilation of GPU code at runtime. E.g. used to dynamically unroll loops. Two main code-parts: Estimator. Integration into the Postgres Optimizer (Setup, Glue-Code). Completely transparent: The system automatically generates the model on the GPU during ANALYZE if: a) a GPU is present b) GPU mode was activated by the administrator (SET enable_ocl TO true) c) The analyzed table is real-valued. The optimizer transparently uses the GPU model, if one is available. GPU-accelerated 08.11.2018 DIMA – TU Berlin

Implementation Overview II Model Creation: „Generate the KDE model for a relation.“ Runs once during ANALYZE command. Main Tasks: Select the sample size & generate the sample. Model Setup: „Prepare the KDE model for a relation.“ Runs once at server startup. Set up all required resources on the GPU. Transfer the sample to the graphics card’s device memory. Estimator: „Base-table Selectivity Estimation.“ Runs whenever the optimizer needs a selectivity estimate. Compute the estimated selectivity for the given query on the GPU. 08.11.2018 DIMA – TU Berlin

Introduction & Motivation. A Proof of concept. Evaluation. Agenda Introduction & Motivation. A Proof of concept. Evaluation. Conclusion & Future Work. 08.11.2018 DIMA – TU Berlin

Evaluation Disclaimer Evaluation System: CPU: Intel Xeon E5620 (Quadcore @ 2.4GHz). Graphics Card: Nvidia GTX460 (336 compute cores, 4GB device memory). Main goal: Demonstrate that using a GPU can improve estimation quality. Compare runtime between GPU and CPU. Demonstrate higher estimation quality. We won‘t demonstrate that improved estimations lead to better plans.  This has been discussed extensively in previous work. Benchmarks against the CPU were made using the Intel OpenCL SDK. Identical codebase was used for both GPU and CPU. SDK automatically vectorizes and parallelizes code to run efficiently on the CPU. 08.11.2018 DIMA – TU Berlin

Evaluation: Performance 08.11.2018 Evaluation: Performance Is the GPU actually faster than the CPU? Data: Random, synthetic data. Experiment: Measure estimation time on both CPU and GPU with increasing sample size. GPU gives a ~10x speed-up for identical sample sizes. Or: Given a time budget, the GPU can use a ten times larger sample!  In general, larger samples result in better estimation quality. Note: Even for extreme large samples (~512 MB), the GPU computes an estimate in less than 25ms! 08.11.2018 DIMA – TU Berlin

Evaluation: Quality Impact of Sample Size. Which impact has the increased sample size on estimation quality? Data: Six-dimensional, real-world dataset from UCI Machine Learning Repository. Experiment: Measure average relative estimation error with increasing sample size for random, non-empty queries. Estimation error behaves like O 𝑛 − 4 5 wrt sample size n.  Increasing the sample by a factor of 10 reduces the error roughly by a factor 7! Est. error for Vanilla Postgres roughly here. Complete dataset in sample. 11/8/2018 DIMA – TU Berlin

Evaluation: Impact of Correlation How does the estimator compare for correlated data? Data: Synthetic, two-dimensional random normal data sets, generated with increasing covariance between dimensions. Experiment: Measure average relative estimation error for Postgres and KDE for non-empty range queries. 11/8/2018 DIMA – TU Berlin

Introduction & Motivation. A Proof of concept. Evaluation. Agenda Introduction & Motivation. A Proof of concept. Evaluation. Conclusion & Future Work. 08.11.2018 DIMA – TU Berlin

Take-Home Messages Modern GPUs offer a massive amount of raw computational power.  We want to harness this power in a database setting! However: GPU-accelerated database operators are hard to get right. Limited device memory, unpredictable result sizes, PCIExpress transfer Bottleneck. GPU will only speed up some cases  Cross-Device Query Optimization required. Complementary approach: Utilize GPU for Query Optimization. Additional compute power allows spending more resources on Query Optimization. This could lead to better plan quality (e.g.: due to less search space pruning).  GPU could accelerate database without running operator code! Proof-of-concept: Run selectivity estimation on the GPU. Additional compute power allows using more elaborate estimation techniques.  Improved selectivity estimates.  Improved plan quality. 08.11.2018 DIMA – TU Berlin

Future Work Identify further operations in the optimizer where additional compute power could improve quality. Join-order enumeration. Top-down query optimization. Continue evaluating GPU-accelerated selectivity estimation. Are other estimation algorithms better suited? Histograms, Wavelets Can we extend the model to discrete and categorical data?

Thank you!