Enabling Effective Utilization of GPUs for Data Management Systems

Slides:

Advertisements

Similar presentations

Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

CS 345 Computer System Overview

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

Panda: MapReduce Framework on GPU’s and CPU’s

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

What is Concurrent Programming? Maram Bani Younes.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Computer System Architectures Computer System Software

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Lesson 7 – World Wide Web. What is the World Wide Web?  The content of the worldwide web is held on individual web pages gathered together to form websites.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

Full and Para Virtualization

CS4315A. Berrached:CMS:UHD1 Introduction to Operating Systems Chapter 1.

My Coordinates Office EM G.27 contact time:

Background Computer System Architectures Computer System Software.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Computer Engg, IIT(BHU)

NFV Compute Acceleration APIs and Evaluation

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Introduction to Distributed Platforms

Big Data A Quick Review on Analytical Tools

CS427 Multicore Architecture and Parallel Computing

CS6401- OPERATING SYSTEMS L T P C

Processes and Threads Processes and their scheduling

Spark Presentation.

Enabling machine learning in embedded systems

Task Scheduling for Multicore CPUs and NUMA Systems

Software Architecture in Practice

Linchuan Chen, Xin Huo and Gagan Agrawal

Pipeline parallelism and Multi–GPU Programming

CS 179 Lecture 14.

The Yin and Yang of Processing Data Warehousing Queries on GPUs

GEOMATIKA UNIVERSITY COLLEGE CHAPTER 2 OPERATING SYSTEM PRINCIPLES

MASS CUDA Performance Analysis and Improvement

Chapter 17: Database System Architectures

Characteristics of Reconfigurable Hardware

Operating System Concepts

Overview of big data tools

What is Concurrent Programming?

What is Concurrent Programming?

Subject Name: Operating System Concepts Subject Number:

Prof. Leonardo Mostarda University of Camerino

Cloud-Enabling Technology

Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J

Database System Architectures

Chapter 2 Operating System Overview

TensorFlow: A System for Large-Scale Machine Learning

Multicore and GPU Programming

Operating System Concepts

6- General Purpose GPU Programming

Multicore and GPU Programming

Fast Accesses to Big Data in Memory and Storage Systems

Xiaodong Zhang The Ohio State University, USA

Presentation transcript:

Enabling Effective Utilization of GPUs for Data Management Systems Xiaodong Zhang The Ohio State University

Moore’s Law is reaching to the end We are closing to the final usable size limit a transistor gate length: 5 nm Minimum cost per transistor has been rising since 28 nm chips a few years ago

Moore’s Law created a powerful homogeneous ecosystem A simple and one-size-fit-all computing abstraction Any task is divided by a sequence of standard execution time unit Under multiprogramming model, OS assigns time unit to each task by turns Data blocks are moved around in the memory hierarchy Programing and execution models are standardized Moore’s law continues to reduce execution time, but Efficiency continues to become low in both power and execution Other models are not included, e.g. SIMD, and other domain specific ones As Moore’s law is ending General-purpose computing needs a lot of external help Hardware accelerators provide such a support

Continued Demand: High Throughput and Low Latency Optimal Point Latency (locality) Throughput (parallelism) Number of Concurrent Processes

Hardware devices Becomes Heterogeneous CPU GPU FPGA Data processing systems must efficiently utilize external hardware devices for continued high performance

High Performance Computing is Scale-up based Scaling up Maximize parallelism as the number of computing nodes and workload scale, and minimize latency High scalability is achieved by hardware/software Achieving high parallelism Exploiting locality to reduce latency High performance is gained by HPC scalability = parallelism / latency Implementation Upgrading hardware and software optimization

Same Efforts to achieve high Scalability in BD systems Scaling out Raising throughput as the number of computing nodes and workload scale, subject to an acceptable latency High scalability is achieved by Maximizing parallelism to raise throughput Exploiting locality to reduce latency

A sustained scalability is archived in large data systems by finding a proper balanced point between parallelism and latency

Architecture Differences: CPU vs. GPU Control Cache Massive ALUs Intel Xeon E5-2650v2: 2.3 billion Transistors 8 Cores 59.7 GB/s memory bandwidth (2013) Nvidia GTX 780: 7 billion Transistors 2,304 Cores 288.4 GB/s memory bandwidth (2013)

GPU is fast to compute and access own memory GPUs are powerful (2016) GPUs become easy to program CUDA and OpenCL Computing Power (GFLOPS) Memory Bandwidth (GB/s) CPU 1,436 102 10,609 (7.4X) GPU 720 (7.1X)

Advantages on Parallelism and Low Latency Massive number of Parallel Processing Units in GPU To process a large number of simple and independent memory access operations Massively Hiding Random Memory Access Latency GPU can effectively hide memory latency with massive hardware threads and zero-overhead thread scheduling by hardware support We find that GPUs have several advantages for kv stores. First, it has massive processing units. The operations in kv stores are simple independent memory accesses, while the thousands of cores in GPUs are ideal for massive parallel processing. Second, it has an ability of massively hiding memory access latency. As we have shown previously, lots of random memory accesses are involved in index operations, and GPUs can effectively hide them with …

Can we simply offload parallel operations on CPU to GPUs to further improve performance? No Since Spark is designed around RDD, Can we simply offload RDD operations to GPUs to get better performance? The answer is no due to the mismatches between Spark’s designs and GPU’s unique characteristics

Challenge #1: Mismatch of Programming Model E.g., Spark is implemented in Scala and runs on top of Java Virtual Machine GPU is usually programed with CUDA and OpenCL A pined memory area by CudaMalloctHost for “non-swampble” RDD Task Spark Executor Java Virtual Machine RDD Task Spark Executor Java Virtual Machine GPU Kernels OS Pageble Memory 1 GPU Memory 3 2 Pinned Memory PCIe

Challenge #2: Mismatch in Execution Models and Data Formats E.g. Spark: one element a time iterator model, row format GPU: block processing in SIMD fashion, column format 2 1 Burk CS Professor 110,000 People RDD Filter RDD Terry CS Professor 95,000 3 4 Luis EE Staff 80,000 Start Block 1

Fine grained management Coarse grained management Challenge #3: Inability of time/space sharing and scheduling in GPUs or between CPU and GPU Fine grained management Coarse grained management These operations can be overlapped GPU Task 1 GPU Task 2 JVM Host to GPU Kernel GPU to Host JVM CPU 1 GPU Task 1 JVM Host to GPU Kernel GPU to Host GPU Task 3 GPU Task 4 CPU 2 GPU Task 2 JVM Host to GPU Kernel GPU to Host JVM Spark worker GPU Task 3 JVM Host to GPU Kernel GPU to Host JVM CPU 3 4- Core CPU GPU CPU 4 GPU Task 4 JVM Host to GPU Kernel GPU to Host JVM Time Time

Evolution of GPU Programming and Execution Environment CPU Advanced System APIs GPU CPU System APIs GPU GPU CPU Separate interfaces +: high performance on one +: general purpose for CPU -: app-dependent for GPU -: hard to coordinate e.g. Hive, Spark, … Manually program +: high performance -: app-dependent -: GPU only e.g. Cuda Library Manually program both +: high performance on one -: app-dependent -: hard to coordinate e.g. GPU-DB, Caffe, … An inclusive environment +: Retain all the merits +: address all the limits -: is overhead affordable? It is a challenging task

Internet is an Inclusive Environment Foundation Uniform Resource Identifier (URL): a character string to identify data source Hypertext Transfer Protocol (HTTP): Web content exchange among nodes Hypertext Markup Language (HTML): to express Web content Web browser: an execution platform to retrieve contents from other nodes First Web page in the world and after then http://info.cern.ch, August 6, 1991 Less than 3,000 Websites, 1994 More than 1 billion Website in the world today The inclusive environment support all kinds of human interactions Web pages, Youtube, Skype, WeChat, Internet-Shopping, …

Four different efforts to develop a GPU-inclusive system Highly optimized libraries Commonly used components that are best suitable to GPUs with a well-defined interface for users of different applications. Specifically defined Framework and abstractions A programming framework to include GPU execution Domain specific languages A language for a specifying class of applications on GPUs Inclusive system software OS to manage both CPU and GPU

GPU Libraries: a simple and effective approach Based on highly optimized algorithms and functions Written in a commonly adopted language, such as CUDA can easily incorporate into users’ applications with minimal changes to the existing code PixelBox algorithm (VLDB’12) is such an example We provide GPU solutions for pathology image processing The PixelBox algorithm accelerates computational geometry The work has been adopted by commercial software (GPP) Included in GPU-Accelerated Libraries of NVIDIA Limits this approach GPU programming only, and GPU is still an external device

GPU-RDD: an abstraction to include GPU in SPARK GPU-RDD is designed for GPU processing Support both column and row format Manage data in OS native memory and GPU memory to minimize data movement overhead Application provides customized GPU functions that operate on the data

System Architecture of Spark-GPU SQL Queries Applications implemented with procedural processing interface User Programs Spark-GPU Driver GPU-Aware Query Optimizer GPU-Aware Task Scheduler GPU-Aware Resource Manager Spark-GPU Workers OS GPU Management Library Spark Executor JVM GPU CPU Task OS GPU Management Library Spark Executor JVM GPU CPU Task OS GPU Management Library Spark Executor JVM GPU CPU Task OS GPU Management Library Spark Executor JVM GPU CPU Task OS GPU Management Library Spark Executor JVM GPU CPU Task OS GPU Management Library Spark Executor JVM GPU CPU Task

How dose Spark become a GPU inclusive? CPU Spark (RDD) GPU CPU Spark-GPU (RDD, GPU-RDD) GPU CPU Spark (RDD) GPU CPU Spark (RDD) Specialized Library Poor performance - Miss matches of both Sub-optimal perf - GPU is isolated - No cordination Sub-optimal Perf - GPU is independent - No coordination An Effective Approach -: Spark is changed

Domain Specific Language Latte (PLDL 2016), a deep learning language a language for describing deep neuron networks A compiler to construct an implicit data-flow graph with optimization Support data-parallelism Can be well used in GPU environment for deep learning applications Merits and Limits highly efficient in domain specific areas Not general purpose

System Software Support Enhancing OS functions to manage GPUs OS not only launches a GPU task but also cares about its Process scheduling, memory management, and I/O Adding additional management functions (VLDB’14) Multiprogramming in GPU Virtual memory in GPU Merits and Limits Raising GPU efficiency, avoid page faults How to coordinate the two difference execution models

Conclusion Existing system was not built to be inclusive Three mismatches between GPU and the main stream These mismatches also apply to others, such as FPGA. Four different efforts to include GPU in the mainstream They all have limits and merits. We still have a long way to be totally inclusive Best suitable operations should be automatically switch to GPUs Programming environment should not GPU/CPU explicit

Thank you