Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Slides:

Advertisements

Similar presentations

Prasanna Pandit R. Govindarajan

Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

Optimization on Kepler Zehuan Wang

APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team.

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

PVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments Palden Lama, Xiaobo Zhou, University of Colorado at Colorado Springs.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

OpenFOAM on a GPU-based Heterogeneous Cluster

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Evaluating GPU Passthrough in Xen for High Performance Cloud Computing Andrew J. Younge 1, John Paul Walters 2, Stephen P. Crago 2, and Geoffrey C. Fox.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.

Panda: MapReduce Framework on GPU’s and CPU’s

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.

STRATEGIES INVOLVED IN REMOTE COMPUTATION

Supporting GPU Sharing in Cloud Environments with a Transparent

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

Pooja Shetty Usha B Gowda.  Network File Systems (NFS)  Drawbacks of NFS  Parallel Virtual File Systems (PVFS)  PVFS components  PVFS application.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Presenter: Hung-Fu Li HPDS Lab. NKUAS vCUDA: GPU Accelerated High Performance Computing in Virtual Machines Lin Shi, Hao Chen and Jianhua.

Architectural Optimizations David Ojika March 27, 2014.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Semantics-based Distributed I/O with the ParaMEDIC Framework P. Balaji, W. Feng, H. Lin Math. and Computer Science, Argonne National Laboratory Computer.

DMA-Assisted, Intranode Communication in GPU-Accelerated Systems Feng Ji*, Ashwin M. Aji†, James Dinan‡, Darius Buntinas‡, Pavan Balaji‡, Rajeev Thakur‡,

Advanced / Other Programming Models Sathish Vadhiyar.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

GPU Architecture and Programming

Ashwin M. Aji, Lokendra S. Panwar, Wu-chun Feng …Virginia Tech Pavan Balaji, James Dinan, Rajeev Thakur …Argonne.

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator- Based Systems Presented by: Ashwin M. Aji PhD Candidate, Virginia Tech,

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Efficient Intranode Communication in GPU-Accelerated Systems

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Synchronization These notes introduce:

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Transparent Accelerator Migration in Virtualized GPU Environments Shucai Xiao 1, Pavan Balaji 2, James Dinan 2, Qian Zhu 3, Rajeev Thakur 2, Susan Coghlan.

Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)

VOCL-FT: Introducing Techniques for Efficient Soft Error Coprocessor Recovery Antonio J. Peña, Wesley Bland, Pavan Balaji.

Parallel Virtual File System (PVFS) a.k.a. OrangeFS

NFV Compute Acceleration APIs and Evaluation

Productive Performance Tools for Heterogeneous Parallel Computing

Accelerating MapReduce on a Coupled CPU-GPU Architecture

GPU Programming using OpenCL

Linchuan Chen, Xin Huo and Gagan Agrawal

Pipeline parallelism and Multi–GPU Programming

CS 179 Lecture 14.

Department of Computer Science University of California, Santa Barbara

rvGAHP – Push-Based Job Submission Using Reverse SSH Connections

Graphics Processing Unit

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3, Rajeev Thakur 2, Susan Coghlan 2, Heshan Lin 1, Gaojin Wen 4, Jue Hong 4, and Wu-chun Feng 1 1. Virginia Tech 2. Argonne National Laboratory 3. Accenture Technology Labs 4. Chinese Academy of Sciences

synergy.cs.vt.edu Motivation GPUs are widely used as accelerators for scientific computation –Many applications are parallelized on GPUs –Speedup is reported compared to execution on CPUs 2

synergy.cs.vt.edu GPU-Based Supercomputers 3

synergy.cs.vt.edu Challenges of GPU Computing Provisioning Limitations –Not all computing nodes are configured with GPUs Budget and power consumption consideration Multiple stages of investment Programmability –Current GPU programming models: CUDA and OpenCL –CUDA and OpenCL only support the utilization of local GPUs 4

synergy.cs.vt.edu Our Contributions Virtual OpenCL (VOCL) framework for transparent virtualization of GPUs –Remote GPUs look like “virtual” local GPUs A program can use non-local GPUs A program can use more GPUs than that can be installed locally Efficient resource management –Optimization of data transfer across different machines 5 OpenCL VOCL OpenCL MPI

synergy.cs.vt.edu Outline Motivation and Contributions Related Work VOCL Framework VOCL Optimization Experimental Results Conclusion & Future work 6

synergy.cs.vt.edu Existing Frameworks for GPU Virtualization rCUDA –Good performance Relative performance overhead is about 2% compared to the execution on a local GPU (GeForce 9800) –Lack of support for CUDA C extensions __kernel >>() –Partial support for asynchronous data transfer MOSIX-VCL –Transparent virtualization –Large overhead even for local GPUs Average overhead: local GPU 25.95%; remote GPU % –No support for asynchronous data transfer 7

synergy.cs.vt.edu Outline Motivation and Contributions Related Work VOCL Framework VOCL Optimization Experimental Results Conclusion & Future work 8

synergy.cs.vt.edu Virtual OpenCL (VOCL) Framework Components VOCL library and proxy process 9 Proxy Native OpenCL Library Application VOCL Library MPI Local nodeRemote node Proxy OpenCL API GPU

synergy.cs.vt.edu VOCL Library Located on each local node Implements OpenCL functionality –Application Programming Interface (API) compatibility API functions in VOCL have the same interface as that in OpenCL VOCL is transparent to application programs –Application Binary Interface (ABI) compatibility No recompilation is needed; Needs relinking for static libraries Uses an environment variable to preload the library for dynamic libraries Deals with both local and remote GPUs in a system –Local GPUs: Calls native OpenCL functions –Remote GPUs: Uses MPI API functions to send function calls to remote nodes 10

synergy.cs.vt.edu VOCL Abstraction: GPUs on Multiple Nodes OpenCL object handle value –Same node, each OpenCL object has a unique handle value –Different nodes, different OpenCL objects could share the same handle value VOCL abstraction –VOCL object GPU Native OpenCL Library GPU Native OpenCL Library OCLH1 OCLH3 VOCL Library OCLH2 OCLH1 OCLH2 OCLH3 OCLH1 != OCLH2 Application OCLH2 == OCLH3 struct voclObj { voclHandle vocl; oclHandle ocl; MPI_Comm com; int nodeIndex; } Each OpenCL object is translated to a VOCL object with a different handle value 11 VOCLH1 VOCLH2 VOCLH3 VOCL object

synergy.cs.vt.edu VOCL Proxy Daemon process: Initialized by the administrator Located on each remote node –Receives data communication requests (a separate thread) –Receives input data from and send output data to the application process –Calls native OpenCL functions for GPU computation 12 Native OpenCL Library Remote node 2 GPU VOCL Library Local node App GPU Native OpenCL Library Remote node 1 Proxy GPU MPI Proxy

synergy.cs.vt.edu Outline Motivation and Contributions Related Work VOCL Framework VOCL Optimization Experimental Results Conclusion & Future work 13

synergy.cs.vt.edu Overhead in VOCL Local GPUs –Translation between OpenCL and VOCL handles –Overhead is negligible Remote GPUs –Translation between VOCL and OpenCL handles –Data communication between different machines 14 GPU VOCL OpenCL VOCL OpenCL GPU OpencL Local node Remote node host

synergy.cs.vt.edu Data Transfer: Between Host Memory and Device Memory Pipelining approach –Single block, each stage is transferred after another –Multiple blocks, transfer of first stage of one block can be overlapped by the second stage of another block –Pre-allocate buffer pool for data storage in the proxy 15 GPU and memory CPU and memory 2 1 Buffer pool Local nodeRemote node

synergy.cs.vt.edu Environment for Program Execution 16 Node Configuration –Local Node 2 Magny-Cours AMD CPUs 64 GB Memory –Remote Node Host: 2 Magny-Cours AMD CPUs (64GB memory) 2 Tesla M2070 GPUs (6GB global memory each) CUDA 3.2 (OpenCL 1.1 specification) –Network Connection – QDR InfiniBand CPU1 CPU0 GPU1GPU0 InfiniBand PCIe CPU3 CPU2 InfiniBand PCIe Local node Remote node Proxy App

synergy.cs.vt.edu Micro-benchmark Results Continuously transfer a window of data blocks one after another Call the clFinish() function to wait for completion 17 for (i = 0; i < N; i++) { clEnqueueWriteBuffer() } clFinish() GPU memory write bandwidth Bandwidth increases from 50% to 80% of that of the local GPU CPU3CPU2 InfiniBand PCIe App CPU1CPU0 InfiniBand PCIe Proxy GPU1GPU0

synergy.cs.vt.edu Kernel Argument Setting Overhead of kernel execution for aligning one pair of sequences (6K letters) with Smith-Waterman 18 Function Name Runtime Local GPU Runtime Remote GPU Overhead Number of Calls clSetKernelArg ,028 clEnqueueNDRangeKernel ,288 Total time (Unit: ms) Local nodeRemote node clSetKernelArg() clEnqueueND- RangeKernel() 42.97% int a; cl_mem b; b = clCreateBuffer(…,…); clSetKernelArg(hFoo, 0, sizeof(int), &a); clSetKernelArg(hFoo, 1, sizeof(cl_mem), &b) clEnqueueNDRangeKernel(…,hFoo,…); __kernel foo(int a, __global int *b) {}

synergy.cs.vt.edu Kernel Argument Setting Caching Overhead of functions related to kernel execution for aligning the same pair of sequence 19 Function Name Runtime Local GPU Runtime Remote GPU Overhead Number of Calls clSetKernelArg ,028 clEnqueueNDRangeKernel ,288 Total time (Unit: ms) Local nodeRemote node clSetKernelArg() clEnqueueND- RangeKernel() 10.92% Store arguments locally

synergy.cs.vt.edu Outline Motivation and Contributions Related Work VOCL Framework VOCL Optimization Experimental Results Conclusion & Future work 20

synergy.cs.vt.edu Evaluation via Application Kernels Three application kernels –Matrix multiplication –Matrix transpose –Smith-Waterman Program execution time Relative overhead Relationship to time percentage of kernel execution 21 CPU0 InfiniBand PCIe App CPU1 InfiniBand PCIe Proxy GPU

synergy.cs.vt.edu Matrix Multiplication 22 Time percentage of kernel execution Kernel execution time and performance overhead Multiple problem instances are issued consecutively for (i = 0; i < N; i++) { clEnqueueWriteBuffer(); clEnqueueNDRangeKernel(); clEnqueueReadBuffer(); } clFinish(); CPU0 InfiniBand PCIe App CPU1 InfiniBand PCIe Proxy GPU

synergy.cs.vt.edu Matrix Transpose 23 Time percentage of kernel execution Kernel execution time and performance overhead Multiple problem instances are issued consecutively for (i = 0; i < N; i++) { clEnqueueWriteBuffer(); clEnqueueNDRangeKernel(); clEnqueueReadBuffer(); } clFinish();

synergy.cs.vt.edu Smith-Waterman 24 Time percentage of kernel execution Kernel execution time and performance overhead Two Observations 1.SW needs a lot of kernel launches and large number of small messages are transferred 2.MPI in the proxy is initialized to support multiple threads, which supports the transfer of small messages poorly for (i = 0; i < N; i++) { clEnqueueWriteBuffer(); for (j = 0; j < M; j++) { clEnqueueNDRangeKernel(); } clEnqueueReadBuffer(); } clFinish();

synergy.cs.vt.edu Outline Motivation and Contributions Related Work VOCL Framework VOCL Optimization Experimental Results Conclusion & Future work 25

synergy.cs.vt.edu Conclusions Virtual OpenCL Framework –Based on the OpenCL programming model –Internally use MPI for data communication VOCL Framework Optimization –Kernel arguments caching –GPU memory write and read pipelining Application Kernel Verification –SGEMM, n-body, Matrix transpose, and Smith-Waterman –Reasonable virtualization cost 26

synergy.cs.vt.edu Future Work Extensions to the VOCL Framework –Live task migration (already done) –Super-GPU –Performance model for GPU utilization –Resource management strategies –Energy-efficient computing 27

synergy.cs.vt.edu For More Information Shucai Xiao – -- Synergy –Website -- Thanks Question? 28