A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Introduction to Openmp & openACC
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Presented by Rengan Xu LCPC /16/2014
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
SAGE: Self-Tuning Approximation for Graphics Engines
Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Synchronization These notes introduce:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
Single Node Optimization Computational Astrophysics.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Employing compression solutions under openacc
CS427 Multicore Architecture and Parallel Computing
Exploiting NVIDIA GPUs with OpenMP
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
6- General Purpose GPU Programming
CUDA Fortran Programming with the IBM XL Fortran Compiler
Presentation transcript:

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering, University of Tsukuba †2. Center for Computational Sciences, University of Tsukuba

Outline Background OpenACC Compiler Implementation Performance Evaluation Conclusion & Future Work

Background Accelerator programming model – CUDA (for NVIDIA GPU) – OpenCL (for various accelerators) Accelerator programming is complex – memory management, kernel function, … – low productivity & low portability OpenACC is proposed to solve these problems

OpenACC The directive-based programming model for accelerators – support C, C++ and Fortran Offloading model – offload a part of code to an accelerator High productivity – only adding directives High portability – run on any accelerators as long as the compiler supports it

Example of OpenACC int main(){ int i; int a[N], b[N], c[N]; /* initialize array ‘a’ and ‘b’ */ #pragma acc parallel loop copyin(a,b) copyout(c) for(i = 0; i < N; i++){ c[i] = a[i] + b[i]; } This directive specifies data transfers and loop offloading and parallelization

Purpose of Research Designing and implementing an open source OpenACC compiler – Target language: C – Target accelerator: NVIDIA GPU – Source-to-source approach C + OpenACC → C + CUDA API This approach enables to leave detailed machine- specific code optimization to the mature CUDA compiler by NVIDIA – The result of compilation is a executable file

Related Work Commercial compiler – PGI Accelerator compiler – CAPS HMPP – Cray compiler Open source compiler – accULL developed at University of La Laguna in Spain Source-to-source translation Backend is CUDA and OpenCL Output is codes and a Makefile

OpenACC directives parallel kernels loop data host_data update wait cache declare parallel loop kernels loop (OpenACC specification 1.0)

data construct int a[4]; #pragma acc data copy(a) { /* some codes using ‘a’ */ } host memorydevice memory computation on device computation on device Data management on Accelerator If an array is specified in “copy” clause … 1.Device memory allocation 2.Data transfer from host to device 3.Data transfer from device to host 4.Device memory release at the beginning of region at the end of region

Translation of data construct int a[4]; #pragma acc data copy(a) { /* some codes using ‘a’ */ } int a[4]; { void *_ACC_DEVICE_ADDR_a,*_ACC_HOST_DESC_a; _ACC_gpu_init_data(&_ACC_HOST_DESC_a, &_ACC_DEVICE_ADDR_a, a, 4*sizeof(int)); _ACC_gpu_copy_data(_ACC_HOST_DESC_a, 400); { /* some codes using ‘a’ */ } _ACC_gpu_copy_data(_ACC_HOST_DESC_a, 401); _ACC_gpu_finalize_data(_ACC_HOST_DESC_b); } allocate ‘a’ on GPU copy ‘a’ to GPU from host free ‘a’ on GPU copy ‘a’ to host from GPU host address device address size ….

Codes in parallel region are executed on device Three levels of parallelism – gang – worker – vector parallel construct #pragma acc parallel num_gangs(1) vector_length(128) { /* codes in parallel region */ } OpenACCCUDA gangthread block worker(warp) vectorthread The number of gang or worker or vector length can be specified by clauses

Translation of parallel construct #pragma acc parallel num_gangs(1) vector_length(128) { /* codes in parallel region */ } __global__ static void _ACC_GPU_FUNC_0_DEVICE(... ) { /* codes in parallel region */ } extern "C” void _ACC_GPU_FUNC_0( … ) { dim3 _ACC_block(1, 1, 1), _ACC_thread(128, 1, 1); _ACC_GPU_FUNC_0_DEVICE >>(... ); _ACC_GPU_M_BARRIER_KERNEL(); } GPU kernel function kernel launch function kernel launch function

loop construct /* inside parallel region */ #pragma acc loop vector for(i = 0; i < 256; i++){ a[i]++; } Loop construct describes parallelism of loop – Distribute loop iteration among gang, worker or vector – Two or more parallelisms can be specified for a loop Loops with no loop directive in parallel region is basically executed serially.

Translation of loop construct (1/3) /* inner parallel region */ #pragma acc loop vector for(i = 0; i < N; i++){ a[i]++; } 1.A virtual index which is the same length as loop iteration is prepared 2.The virtual index is divided and distributed among blocks and/or threads Each thread calculates the value of loop variable from the virtual index and executes loop body

Translation of loop construct (2/3) /* inner parallel region */ #pragma acc loop vector for(i = 0; i < N; i++){ a[i]++; } /* inner gpu kernel code */ int i, _ACC_idx; int _ACC_init, _ACC_cond, _ACC_step; _ACC_gpu_init_thread_x_iter(&_ACC_init, &_ACC_cond, &_ACC_step, 0, N, 1); for(_ACC_idx = _ACC_init; _ACC_idx < _ACC_cond; _ACC_idx += _ACC_step){ _ACC_gpu_calc_idx(_ACC_idx, &i, 0, N, 1); a[i]++; } calculate ‘i’ from virtual index virtual index : _ACC_idx virtual index range : _ACC_init, cond, step calculate the range of virtual index virtual index range variables loop body

Translation of loop construct(3/3) Our compiler supports 2D blocking for nested loops – Nested loops are distributed among the 2D blocks in the 2D grid in CUDA (default block size is 16x16) – But it’s not allowed in OpenACC 2.0 and “tile” clause is provided instead #pragma acc loop gang vector for( i = 0; i < N; i++) #pragma acc loop gang vector for(j = 0; j < N; j++) /* … */ distribute 2D Grid 2D Block

Compiler Implementation Our compiler translates C with OpenACC directives to C with CUDA API – read C code with directives and output translated code – using Omni compiler infrastructure Omni compiler infrastructure – a set of programs for a source-to-source compiler with code analysis and transformation – supports C and Fortran95

Flow of Compilation Omni compiler infrastructure sample.gpu.o acc runtime sample_tmp.o Omni Frontend OpenACC translator OpenACC translator C compiler nvcc a.out sample.c sample.xml sample _tmp.c sample _tmp.c sample.cu XcodeML C with ACC API CUDA C with OpenACC directives

Performance Evaluation Benchmark – Matrix multiplication – N-body problem – NAS Parallel Benchmarks – CG Evaluation environment – 1 node of Cray XK6m-200 CPU: AMD Opteron Processor 6272 (2.1GHz) GPU: NVIDIA X2090 (MatMul, N-body) : NVIDIA K20 (NPB CG)

Performance Comparison Cray compiler Our compiler Hand written CUDA – The code is written in CUDA and compiled by NVCC – The code doesn’t use shared memory of GPU Our compiler (2D-blocking) – The code uses 2D blocking and is compiled by our compiler – This is applied to only matrix multiplication

Matrix multiplication 4.6x 5.5x 1.5x 1.4x The performance of our compiler using 2D-blocking and hand-written CUDA are slightly lower

Matrix multiplication Our compiler achieves better performance than that of Cray compiler – The PTX code directly generated by Cray compiler has more operations in the innermost loop – Our compiler outputs CUDA code, and NVCC generates more optimized PTX code 2D-blocking is lower performance – default 2D block size (16x16) is not adequate to this program – the best block size was 512x2 – Hand-written CUDA code also uses 16x16 block

N-body 5.4x 31x 0.95x 1.2x At the small problem size, the performance of our compiler is lower than that of Cray compiler

N-body At small problem size, the performance became worse – Decline in the utilization of Streaming Multiprocessors(SMs) A kernel is executed by SMs per thread block – If the number of blocks is smaller than that of SMs, the performance of the kernel becomes low. Default block size – Cray compiler : 128 threads / block – Our compiler : 256 threads / block

NPB CG the performance is lower than that of CPU and Cray compiler 0.66x 9.7x 0.74x 2.1x

NPB CG At class S, the performance of GPU is lower than that of CPU – Overheads are larger compared with kernel execution time launching kernel functions synchronization with device data allocation / release / transfer The overhead is larger than that of Cray compiler – large overhead of reduction The performance of GPU kernels are better than that of Cray compiler

Conclusion We implemented a source-to-source OpenACC compiler for CUDA – C with OpenACC directives → C with CUDA API – Using Omni compiler infrastructure In most case, the performance of GPU code by our compiler is higher than that of CPU single core – Speedup of up to 31 times at N-body Our compiler makes use of CUDA backend successfully by source-to-source approach – the performance is often better than that of Cray compiler There is room for performance improvement – using suitable grid size and block size – reducing overhead of synchronization and reduction

Future Work Optimization – tuning block size at compile time – reducing overhead from synchronization and reduction Support the full set of directives for conforming to OpenACC specification in our compiler – We will release our compiler at next SC