Skew Handling in Aggregate Streaming Queries on GPUs Georgios Koutsoumpakis 1, Iakovos Koutsoumpakis 1 and Anastasios Gounaris 2 1 Uppsala University,

Slides:

Advertisements

Similar presentations

Speed, Accurate and Efficient way to identify the DNA.

Advertisements

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

OpenFOAM on a GPU-based Heterogeneous Cluster

Weekly Report Ph.D. Student: Leo Lee date: Oct. 9, 2009.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Lecture 10: GPU as part of the PC Architecture.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Efficient Lists Intersection by CPU-GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab,

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Lecture 25 PC System Architecture PCIe Interconnect

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Sunpyo Hong, Hyesoon Kim

Martin Kruliš by Martin Kruliš (v1.0)1.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.

Parallel Programming Models

Gwangsun Kim, Jiyun Jeong, John Kim

Accelerating MapReduce on a Coupled CPU-GPU Architecture

The Yin and Yang of Processing Data Warehousing Queries on GPUs

6- General Purpose GPU Programming

Presentation transcript:

Skew Handling in Aggregate Streaming Queries on GPUs Georgios Koutsoumpakis 1, Iakovos Koutsoumpakis 1 and Anastasios Gounaris 2 1 Uppsala University, Sweden 2 Aristotle University of Thessaloniki, Greece

Talk Outline 1.Setting of our work 2.Our load-balancing framework 3.Load balancing techniques 4.Experimental results 5.Conclusions and future work Anastasios Gounaris

Target applications Data-intensive continuous aggregate queries –E.g., continuously report the average share price of each company in all European stock markets. –They form the basis of many online analysis tasks. –They implicitly assume a (possibly infinite) data stream Anastasios Gounaris

Scalability requirements CQs may be CPU-intensive due to the –Sheer amount of data –Possibly complex aggregate tasks CQs may also be memory-intensive. –E.g., continuously report the median share price of each company in all European stock markets in the last secs. –We need to keep all the values within a (sliding) window of appropriate size. The standard solution is parallelism. –Partitioned parallelism has been widely investigated and used for CQs Anastasios Gounaris

Imbalance problems In partitioned parallelism each group is allocated to a distinct processor unit (PU). If the workload is predictable, we can allocate equal amount of work to each PU. But often, it is not! –E.g., continuously report the median size of messages originated from each IP taking into account the last messages. Skew problems arise when groups incur different amounts of workload Anastasios Gounaris

Our Goal Parallelise CQs on GPUs using CUDA. Balance the load on-the-fly. –Revise the assignment of groups to PUs Anastasios Gounaris

A brief note on CUDA CUDA stands for “Compute Unified Device Architecture” It is a general purpose programming model that makes it easy batches of threads to run on the GPU. The GPU acts as a dedicated super-threaded, massively data parallel co- processor Serial Code (host)‏... Parallel Kernel (device)‏ KernelA >>(args); Serial Code (host)‏ Parallel Kernel (device)‏ KernelB >>(args); The material of this slide is from David Kirk/NVIDIA and Wen-mei W. Hwu Anastasios Gounaris

Talk Outline 1.Setting of our work 2.Our load-balancing framework 3.Load balancing techniques 4.Experimental results 5.Conclusions and future work Anastasios Gounaris

Main rationale Data arrive continuously and we buffer them in batches, –which are processed in iterations. CPU responsibilities: To prepare the data in order to achieve coalesced memory access. To detect and correct imbalances. GPU responsibilities: To perform the actual data processing Anastasios Gounaris

Mappings on the CPU We assume a fixed number of threads. –Each group is fully processed by a single GPU thread. We keep 2 hashmaps for group-to-thread and thread-to- group mappings: GroupThread …… Group 01, ,5 3… 4… …… Anastasios Gounaris

id:5, attr: 1 id:2, attr: 4 id:3, attr: 1 id:1, attr: 5 id:2, attr: 2 id:6, attr:1 … Data Stream id:3, attr: 1 id:1, attr: 5 id:2, attr: 4 id:2, attr: 2 id:5, attr: 1 id:6, attr:1 thread0thread1thread2 1. Copies the next batch of the streaming data to a new matrix 2. Counts the number of tuples of each thread id:5, attr: 1 id:2, attr: 4 id:3, attr: 1 id:1, attr: 5 id:2, attr: 2 id:6, attr:1 1. Reorders data so that groups of the same thread are together 2.creates matrix threadDataIndicator Reordered data matrix Data matrix 024 threadDataIndicator repeat Operations on the CPU Check/correct imbalances Copy data to GPU /launch the kernel Anastasios Gounaris

Data on the GPU Copied from the CPU Maintained on the GPU id:3, attr: 1 id:1, attr: 5 id:2, attr: 4 id:2, attr: 2 id:5, attr: 1 id:6, attr:1 thread0thread1thread2 Reordered data matrix 024 threadDataIndicator Windows Group nextPos Anastasios Gounaris

Talk Outline 1.Setting of our work 2.Our load-balancing framework 3.Load balancing techniques 4.Experimental results 5.Conclusions and future work Anastasios Gounaris

id:5, attr: 1 id:2, attr: 4 id:3, attr: 1 id:1, attr: 5 id:2, attr: 2 id:6, attr:1 … Data Stream id:3, attr: 1 id:1, attr: 5 id:2, attr: 4 id:2, attr: 2 id:5, attr: 1 id:6, attr:1 thread0thread1thread2 1. Copies the next batch of the streaming data to a new matrix 2. Counts the number of tuples of each thread id:5, attr: 1 id:2, attr: 4 id:3, attr: 1 id:1, attr: 5 id:2, attr: 2 id:6, attr:1 1. Reorders data so that groups of the same thread are together 2.creates matrix threadDataIndicator Reordered data matrix Data matrix 024 threadDataIndicator repeat Operations on the CPU Check/correct imbalances Copy data to GPU/ launch the kernel Anastasios Gounaris

Load balancing algorithms - 1 Try to smooth differences between the workload of threads We use two heaps in order to detect tmax and tmin in O(1) Anastasios Gounaris

Load balancing algorithms - 2 getFirst simply chooses the first group upon detection of the most imbalanced pair. checkAll examines all the groups of the most loaded threaded and moves the biggest one. probCheck makes a probabilistic choice of the biggest group in the most loaded threaded. bestBalance examines all the groups of the most loaded threaded and moves the one that leads to the smallest difference in the workload. shift allows moves of groups only to neighboring threads. –E.g., the first group of thread 14 can be moved only to thread 13. shiftLocal does not detect tmax/tmin and checks only adjacent threads Anastasios Gounaris

Experimental setting Two systems used. –PC1: Intel Core2 Duo E6750 CPU at 2.66GHz NVidia 460GTX (GF104) graphics processor at 810 Mhz on a PCIe v2.0 x16 slot (5GB/s transfer rate). –PC2: Intel P4 550 CPU at 3.4 GHz NVidia 550GTX Ti (GF116) at 910 MHz on a PCIe v1.1 x16 (2.5GB/s transfer rate) slot. Three datasets. –DS1: no imbalance –DS2: high imbalance, group sizes follow a zipf distribution –DS3: low imbalance, group sizes follow a zipf distribution but groups are randomly permuted Fixed parameters: –Block size is fixed to 256 threads. –Batch size is fixed to 50K tuples. –Window size is 100 and there are always 40K groups Anastasios Gounaris

Impact of imbalance PC1 w/o load balancing – time to process 100M tuples (2K iterations) Grid size = 4 Anastasios Gounaris

High Imbalance Speedups of up to 4.27 are observed. Increasing the grid size seems to work …but it is not always applicable! Simple heuristics perform similarly to (if not better than) the most sophisticated ones. Less sophisticated and approximate load balancing techniques are more appropriate for GPGPU –Basically because they require less computational effort for the balancing itself Grid size = 4Grid size = 64 Anastasios Gounaris

Low imbalance No technique is actually effective Grid size = 4Grid size = 64 Anastasios Gounaris

Talk Outline 1.Setting of our work 2.Our load-balancing framework 3.Load balancing techniques 4.Experimental results 5.Conclusions and future work Anastasios Gounaris

Summary In this work we presented: 1.A GPGPU load balancing framework. 2.Load balancing algorithms. Lessons learnt: –Load imbalances can lead to serious performance degradations. –In high imbalances, we have achieved speedups of more than 4 times. –Load balancing techniques need not be very sophisticated. –Small imbalances cannot be tackled Anastasios Gounaris

Future Work - Points not considered Varying dynamically the grid/block/batch size. Investigation in light of the most recent dynamic parallelism extensions in Kepler architectures. Handling of cases where the gpu capacity is lower than the data arrival rate –Use of approximate/load shedding techniques Anastasios Gounaris

Thank you! … and apologies to all reviewers, whose comments have not been addressed due to tight time contraints Anastasios Gounaris

Back-up slides - Overheads For grid size 4, the CPU operations are (almost) fully hidden Grid size = 4Grid size = 64 Anastasios Gounaris

Anastasios Gounaris