Sina Meraji sinamera@ca.ibm.com Towards A Combined Grouping and Aggregation Algorithm for Fast Query Processing in Columnar Databases with GPUs A Technology.

Slides:

Advertisements

Similar presentations

1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1.

Please Note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

A Fast Growing Market. Interesting New Players Lyzasoft.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Presented by Marie-Gisele Assigue Hon Shea Thursday, March 31 st 2011.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Data Analysis and Visualization Dr. Frank van Ham, IBM Netherlands Target Conference 2014, Groningen Nov 4 th, 2014.

CON Software-Defined Networking in a Hybrid, Open Data Center Krishna Srinivasan Senior Principal Product Strategy Manager Oracle Virtual Networking.

Building Functional Hybrid Apps For The iPhone And Android “The Zen of Mobile Apps”

Panda: MapReduce Framework on GPU’s and CPU’s

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

GPGPU platforms GP - General Purpose computation using GPU

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Selecting and Implementing An Embedded Database System Presented by Jeff Webb March 2005 Article written by Michael Olson IEEE Software, 2000.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

1 Mobile Document Capture using Apple iPhone and IBM Content Navigator October, 2012.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

 What is OS? What is OS?  What OS does? What OS does?  Structure of Operating System: Structure of Operating System:  Evolution of OS Evolution of.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Extracted directly from:

Designing and Deploying a Scalable EPM Solution Ken Toole Platform Test Manager MS Project Microsoft.

Building Cognitive Apps with IBM Watson on Bluemix

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

GPU Architecture and Programming

© 2011 IBM Corporation January 2011 Pam Denny, IBM V7 Reporting.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

IBM Software Group AIM Enterprise Platform Software IBM z/Transaction Processing Facility Enterprise Edition © IBM Corporation 2005 TPF Users Group.

Full and Para Virtualization

Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.

Operating Systems.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.

IBM Innovate 2012 Title Presenter’s Name Presenter’s Title, Organization Presenter’s Address Session Track Number (if applicable)

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Computer Engg, IIT(BHU)

Understanding and Improving Server Performance

Gwangsun Kim, Jiyun Jeong, John Kim

Enabling Effective Utilization of GPUs for Data Management Systems

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Software Architecture in Practice

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Lecture 5: GPU Compute Architecture

Migration Strategies – Business Desktop Deployment (BDD) Overview

OpenWorld How to Prepare Data from Business Intelligence Cloud Service

IBM Blockchain An Enterprise Deployment of a Distributed Consensus-based Transaction Log Ben Smith & Kostantinos Christidis 1 ©2016 IBM Corporation.

Ch 4. The Evolution of Analytic Scalability

Chapter 1 Introduction.

IBM Power Systems.

Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J

What YOUR ORGANIZATION CAN be doing to prepare

Fast Accesses to Big Data in Memory and Storage Systems

Presentation transcript:

Sina Meraji sinamera@ca.ibm.com Towards A Combined Grouping and Aggregation Algorithm for Fast Query Processing in Columnar Databases with GPUs A Technology Demonstration Sina Meraji sinamera@ca.ibm.com Add the session id date and time

IBM IOD 2012 4/25/2017 Please Note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. Drury Design Dynamics

Outline DB2 BLU Acceleration Hardware Acceleration Nvidia GPU Key Analytic Database Operators Group by GPU Kernels Our Acceleration Design

DB2 with BLU Acceleration Next generation database Super Fast (query performance) Super Simple (load-and-go) Super Small (storage savings) Seamlessly integrated Built seamlessly into DB2 Consistent SQL, language interfaces, administration Dramatic simplification Hardware optimized Memory optimized CPU-optimized I/O optimized

6 hours! Installing BLU to query results Risk system injects 1/2 TB per night from 25 different source systems. “Impressive load times.” Some queries achieved an almost 100x speed up with literally no tuning. 6 hours! Installing BLU to query results On this slide we show the happy and impressive results from a customer who used BLU Acceleration. This is a major European bank whose work with BLU Acceleration saw the simplicity and performance we speak about: 100x speed up for queries, speed of implementation, etc. This is just one of many customer examples who have used BLU Acceleration with great results. Handelsbanken One of the world’s most profitable and secure rated banks. © 2015 IBM Corporation

DB2 with BLU Acceleration: The 7 Big Ideas

Hardware Acceleration Use specific hardware to execute software functions faster Popular accelerator technology SIMD Present in every CPU GPUs Easy to program FPGA Hard to program

Nvidia GPU NVIDIA Tesla K40 Kepler technology Peak double precession performance: 1.66 TFLOPs Peak single precession performance: 5 TFLOPs High Memory Bandwidth: up to 288 GB/Sec Memory Size: 12GB Number of cores: 2880

Key Analytic Database Operators GROUP BY / Aggregation SELECT column_name, aggregate_function(column_name) FROM table_name WHERE column_name operator value GROUP BY column_name; Join SELECT column_name(s) FROM table1 JOIN table2 ON table1.column_name=table2.column_name; Sort SELECT column_name FROM table_name ORDER BY column_name;

Hardware Configuration POWER8 S824L 2 sockets, 12 cores per socket, SMT-8, 512GB Ubuntu LE 14.04.02 LTS GPU: 2 NVIDIA Tesla K40

Infrastructure Adding the support for Nvidia GPU Memory Management CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA Memory Management Pin/Unpin memory To run on GPU, threads asks for pinned memory This is for fast transfers to/from GPU PCI-E Gen3 Will be improved in 2016 with Nvlink on POWER

GPU Scheduler Each CPU thread can submit tasks to GPU scheduler Should submit memory requirement on GPU The scheduler checks all the GPUs on the system Reserve the memory on the GPU Create a stream Return to the CPU thread with GPU number and stream Id

Our Acceleration Design Use parallel POWER8 threads for reading/pre-processing data Transfer data to GPU Have the GPU to process the query Transfer the result back to host machine

Hybrid Design: Use Both POWER8 and GPU for Query Processing Decide where to execute the query dynamically at runtime Use GPU only Use CPU only Use both

GPU Kernels Design and develop our own GPU runtime Developed fast kernels e.g. GROUP BY, aggregation Use Nvidia CUDA calls e.g. Atomic operations Use Nvidia fast kernels e.g. sort

Group By/Aggregation What does it do? SELECT C1, SUM(C2) FROM simple_table GROUP BY C1 Simple_Table C1 C2 9 98 2 21 92 3 38 90 22 C1 SUM(C2) 9 280 3 38 2 43

Group by/Aggregate Operation in GPU Hash based algorithm Use grouping keys and a hash function to insert keys to a hash table Aggregation Use Nvidia Atomic CUDA calls for some data types (INT32, INT64,etc) Use Locks for other data types (Double, FixedString, etc) Three main steps Initialize the hash table Grouping/Aggregation in a global hash table Probing the global hash bale to retrieve groups

Hash Based GroupBy/Aggregate SELECT C1, SUM(C2), FROM Simple_Table GROUP BY C1 Simple_Table Hash Table KEY Value 93 5 23 2 1 1000 Key Aggregated Value 23 7 93 1006 Parallel HT Creation Result Parallel Probe 23 7 93 1006

Assumptions Table has 2 integer columns 32 bit key, 32 bit value Number of groups =3% of number of rows Comparing against single core implementation Create hash table in Global memory Kernel 1: Read 32 bit key and then 32 bit value Atomic operations to insert a value(CAS) Atomic operations to aggregate(AtomicAdd) Kernel 2: Read 64 bit key-value 64bit atomic operations Memcpy for data transfers

Kernel2 VS Kernel1

Using Shared Memory Atomic operations are costly New kernel Much faster if we use shared memory Works well if the number of groups is small New kernel Use shared memory to create local hash tables Use global memory to merge local hash tables Much less atomic operations in the global memory

Using Shared memory Table Hash Table KEY Value Key Aggregated Value SM Result Parallel Merge Shared Memory Key Aggregated value Parallel Probe SM

Kernel3 speed up(Shared memory)

Kernel4 We still have too many atomic operation in the same location if the number of groups is small Solution: multiple small hash tables within each shared memory instead of one big hash Merge all the small hash tables in the global memory Better utilization of computation power and shared memory if we launch lots of groups

Multi-Level Hashing Table Hash Table SM1 KEY Value Key Aggregated Value Hash Table1 … Hash Table N Result Key Aggregated value Parallel Probe … SM15

Kernel4

Aggregation CUDA atomic operations for Implemented in hardware(very fast) Use for both global and shared memory Specific data types(INT32, INT64, etc) Use AtomicCAS for specific data types e.g. Double Specific Agg functions/data types __device__ double atomicAdd(double* address, double val) { unsigned long long int* address_as_ull = (unsigned long long int*)address; unsigned long long int old = *address_as_ull, assumed; do { assumed = old; old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + _longlong_as_double(assumed))); }while(assumed != old); return __longlong_as_double(old); } Check Nvidia docs for more details: http://docs.nvidia.com/cuda/cuda-c- programming-guide/#atomic-functions

Aggregation(Continued) Locks For datatypes that are larger than 64 bits Decimal, FixedString Each thread needs to perform following Acquire a lock which is associated with the corresponding entry in hash table Apply the AGG function Release the lock Costly operation

Supported Data Types & AGG functions SQL-- MAX MIN SUM COUNT SINT8 Cast to SINT32 Cast to SINT 32 AtomicCount SINT16 SINT32 AtomicMax AtomicMin AtomicAdd SINT64 REAL Use AtomicCAS DOUBLE DECIMAL Lock Use AtomicADD(2-3 steps) DATE CAST to SINT32 N/A TIME TIMESTAMP(64bit) AtomicMIn FixedString

Acceleration Demonstration Accelerating DB2 BLU Query Processing with Nvidia GPUs on POWER8 Servers A Hardware/Software Innovation Preview Compare query acceleration of DB2 BLU with GPU vs. non- GPU baseline Show CPU offload by demonstrating increased multi-user throughput with DB2 BLU with GPU

BLU Analytic Workload A set of Queries from existing BLU Analytic workloads TPC-DS database schema Based on a retail database with in-store, on-line, and catalog sales of merchandise 15% of queries use GPU heavily 50% of queries use GPU moderately 35% of queries do not use GPU at all Benchmark Configuration 100 GB (raw) Data set 10 concurrent users

Performance Result ~2x improvement in workload throughput Query Avg GPU duration(ms) Avg CPU duration(ms) BDI CQ1 22895 30647 BDI CQ3 21377 32018 GROUP BY Store Sales 3253 8850 GROUP BY Web Sales 1371 3974 ROLAP Q11 25326 27543 ROLAP Q2 31675 56976 ROLAP Q20 31567 45559 ROLAP Q24 30823 40754 ROLAP Q5 27512 51060 ROLAP Q9 22454 22748 Simple Non-GPU Q15 223 242 Simple Non-GPU Q33 416 408 Simple Non-GPU Q47 557 380 Simple Non-GPU Q48 691 567 ~2x improvement in workload throughput CPU Offload + improved query runtimes are the main factors Most individual queries improve in end-to-end run time

GPU Utilization The DB2 BLU GPU demo technology will attempt to balance GPU operations across the available GPU devices These measurements are taken from the Demo Workload running in continuous mode.

Summary Hardware/Software Innovation Preview demonstrated GPU Acceleration Improved DB2 BLU query throughput Use both POWER8 processor and Nvidia GPUs Design and develop fast GPU kernels Use Nvidia kernels, function calls, etc Hardware Acceleration shows potential for Faster execution time CPU off-loading