Barcelona, 2013 Vicenç Beltran, Programming with OmpSs Seminaris d’Empresa 2013.

Slides:

Advertisements

Similar presentations

Barcelona Supercomputing Center. The BSC-CNS objectives: R&D in Computer Sciences, Life Sciences and Earth Sciences. Supercomputing support to external.

Advertisements

Prasanna Pandit R. Govindarajan

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Parallel Processing with OpenMP

Introductions to Parallel Programming Using OpenMP

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

The Open Standard for Parallel Programming of Heterogeneous systems James Xu.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

Краткое введение в OpenCL. Примеры использования в научных вычислениях А.С. Айриян 7 мая 2015 Лаборатория информационных технологий.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

CIS 565 Fall 2011 Qing Sun

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Architecture and Programming

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Belgrade, 26 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Overview of on-going work on NMMB HPC performance at BSC.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

MPI and OpenMP.

Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Martin Kruliš by Martin Kruliš (v1.0)1.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Martin Kruliš by Martin Kruliš (v1.0)1.

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.

OpenCL The Open Standard for Heterogenous Parallel Programming.

Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.

Heterogeneous Programming OpenCL and High Level Programming Tools Stéphane BIHAN, CAPS Stream Computing Workshop, Dec , KTH, Stockholm.

Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Martin Kruliš by Martin Kruliš (v1.1)1.

OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Parallel Computing on Graphics Cards Keith Kelley, CS6260.

OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.

Lecture 15 Introduction to OpenCL

SHARED MEMORY PROGRAMMING WITH OpenMP

An Introduction to GPU Computing

Patrick Cozzi University of Pennsylvania CIS Spring 2011

Basic CUDA Programming

Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)

Lecture 11 – Related Programming Models: OpenCL

Tutorial: The Programming Interface

Using OpenMP offloading in Charm++

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

6- General Purpose GPU Programming

Presentation transcript:

Barcelona, 2013 Vicenç Beltran, Programming with OmpSs Seminaris d’Empresa 2013

2 Outline Motivation –Parallel programming –Heterogeneous Programming OMPSs –Philosophy –Tool-chain –Execution model –Integration with CUDA/OpenCL –Performance Conclusions

3 Motivation Parallel programming –Pthreads Hard and error prone (dead-locks, race-conditions, …) –OpenMP Limited to parallel loops on SMP machines –MPI Message passing for clusters –New parallel programming models MapReduce, Intel TBB, PGAS, … More powerful and safe, but … Effort to port legacy applications too high

4 Motivation Heterogeneous Programming –Two main alternatives CUDA/OpenCL (very similar) Accelerator language (CUDA C/OpenCL C) Host API –Data transfers (two address spaces) –Kernel management (compilation, execution, …) Host memory Device memory cudaMemcpy(devh,h,sizeof(*h)*nr*DIM2_H, cudaMemcpyHostToDevice);

5 Motivation Heterogeneous Programming –T–Two main alternatives CUDA/OpenCL (very similar) Accelerator language (CUDA C/OpenCL C) Host API –D–Data transfers (two address spaces) –K–Kernel management (compilation, execution, …) Main.c // Initialize device, context, and buffers... memobjs[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float4) * n, srcB, NULL); // create the kernel kernel = clCreateKernel (program, “dot_product”, NULL); // set the args values err = clSetKernelArg (kernel, 0, sizeof(cl_mem), (void *) &memobjs[0]); err |= clSetKernelArg (kernel, 1, sizeof(cl_mem), (void *) &memobjs[1]); err |= clSetKernelArg (kernel, 2, sizeof(cl_mem), (void *) &memobjs[2]); // set work-item dimensions global_work_size[0] = n; local_work_size[0] = 1; // execute the kernel err = clEnqueueNDRangeKernel (cmd_queue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL); // read results err = clEnqueueReadBuffer (cmd_queue, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);... __kernel void dot_product ( __global const float4 * a, __global const float4 * b, __global float4 * c) { int gid = get_global_id(0); c[gid] = dot(a[gid], b[gid]); } kernel.cl

6 Outline Motivation –Parallel programming –Heterogeneous Programming OMPSs –Philosophy –Tool-chain –Execution model –Integration with CUDA/OpenCL –Performance Conclusions

7 OmpSs Philosophy –Based/compatible with OpenMP Write sequential programs an run them in parallel Support most of the OpenMP annotations –Extend OpenMP with function-tasks and parameter annotations Provide dynamic parallelism and automatic dependency management #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { int j; for (j=0; j < size; j++) b[j] = scalar*c[j]; } #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { int j; for (j=0; j < size; j++) b[j] = scalar*c[j]; }

8 OmpSs Tool-chain –Mercurium Source-to-source compiler Supports Fortran, C and C++ –Nanos++ Common execution runtime (C, C++ and Fortran) Task creation, dependency management, task scheduling, …

9 OmpSs Execution model –Dataflow execution model (deps. based on in/out annotations) –Dynamic task-scheduling on available resource void Cholesky(int NT, float *A[NT][NT] ) { for (int k=0; k<NT; k++) { spotrf (A[k][k], TS) ; for (int i=k+1; i<NT; i++) strsm (A[k][k], A[k][i], TS); for (int i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) sgemm( A[k][i], A[k][j], A[j][i], TS); ssyrk (A[k][i], A[i][i], TS); } #pragma omp task inout ([TS][TS]A) void spotrf (float *A, int TS); #pragma omp task input ([TS][TS]T) inout ([TS][TS]B) void strsm (float *T, float *B, int TS); #pragma omp task input ([TS][TS]A,[TS][TS]B) inout ([TS][TS]C ) void sgemm (float *A, float *B, float *C, int TS); #pragma omp task input ([TS][TS]A) inout ([TS][TS]C) void ssyrk (float *A, float *C, int TS); TS NB TS

10 OmpSs Integration with CUDA/OpenCL –#pragma omp target device(CUDA|OCL) Identifies the following function as CUDA C/OpenCL C kernel –#pragma omp input(…) output(…) ndrange(dim, size, block_size) Specifies input/output as usual and provides the information to call the kernel. –No need to modify CUDA C code __global_ void scale_task_cuda (double *b, double *c, double scalar, int size) { int j = blockDim.x * blockIdx.x + threadIdx.x; if(j<size) { b[j] = scalar*c[j]; } __global_ void scale_task_cuda (double *b, double *c, double scalar, int size) { int j = blockDim.x * blockIdx.x + threadIdx.x; if(j<size) { b[j] = scalar*c[j]; } kernel.cu

11 OmpSs Integration with CUDA/OpenCL double A[1024], B[1024], C[1024] double D[1024], E[1024]; main(){ … scale_task_cuda(A, B, 10.0, 1024); //T1 scale_task_cuda(B, A, 0.01, 1024); //T2 scale_task (C, A, 2.0, 1024); //T3 scale_task_cuda (D, E, 5.0, 1024); //T4 scale_task_cuda(B, C, 3.0, 1024); //T5 #pragma omp taskwait // can access any of A,B,C,D,E } double A[1024], B[1024], C[1024] double D[1024], E[1024]; main(){ … scale_task_cuda(A, B, 10.0, 1024); //T1 scale_task_cuda(B, A, 0.01, 1024); //T2 scale_task (C, A, 2.0, 1024); //T3 scale_task_cuda (D, E, 5.0, 1024); //T4 scale_task_cuda(B, C, 3.0, 1024); //T5 #pragma omp taskwait // can access any of A,B,C,D,E } #pragma target device (smp) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { for (int j=0; j < size; j++) b[j] = scalar*c[j]; } #pragma target device (cuda) copy_deps ndrange(1, size, 128) #pragma omp task input ([size] c) output ([size] b) __global_ void scale_task_cuda (double *b, double *c, double scalar, int size); #pragma target device (smp) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { for (int j=0; j < size; j++) b[j] = scalar*c[j]; } #pragma target device (cuda) copy_deps ndrange(1, size, 128) #pragma omp task input ([size] c) output ([size] b) __global_ void scale_task_cuda (double *b, double *c, double scalar, int size); main.c A, B have to be transferred to device before task execution No data transfer. Will execute after T1 A, has to be transferred to host. Can be done in parallel with T2 D, E, have to be transferred to GPU. Can be done at the very beginning Copy D, E back to host C has to be transferred to GPU. Can be done when T3 finishes

12 OmpSs Performance –Dataflow-execution (asynchronous) –Overlapping of data transfers and computation CUDA streams / OpenCL async copies –Data prefetching from/to CPUs/GPUs Low level-optimizations Nanos++ mgt thread (host side) Copy outputs task (i-1) GPU side Data transfers (H to D stream) Kernel call task (i) Kernel exec Copy inputs task (i+1) Data transfers (D to H stream) Stream sync (H D streams)

13 Conclusions OmpSs is a programming model that enables –Incremental parallelization of sequential code –Data-flow execution model (asynchronous) –Nicely supports heterogeneous environments –Many optimizations under the hood Advanced scheduling policies Work stealing/load balancing Data prefetching –Advanced features MPI task offload Dynamic load balancing implements OmpSs is open source –Take a look at

Input/output specification –Whole (multidimensional) arrays –Array ranges 14 int off_x = …, size_x = …, off_y = …, size_y = …; #pragma omp target device(gpu) copy_deps #pragma omp task input(A) \ output(A[i][j]) \ output([2][3]A) \ output(A[off_x;size_x][off_y;size_y) void foo_task(float A[SIZE][SIZE], int i, int j); Appendix

Pragma “implements” 15 Appendix II #pragma target device (smp) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { for (int j=0; j < size; j++) b[j] = scalar*c[j]; } #pragma target device (cuda) copy_deps ndrange(1, size, 128) #pragma omp task input ([size] c) output ([size] b) implements(scale_task) __global_ void scale_task_cuda (double *b, double *c, double scalar, int size); #pragma target device (smp) copy_deps #pragma omp task input ([size] c) output ([size] b) void scale_task (double *b, double *c, double scalar, int size) { for (int j=0; j < size; j++) b[j] = scalar*c[j]; } #pragma target device (cuda) copy_deps ndrange(1, size, 128) #pragma omp task input ([size] c) output ([size] b) implements(scale_task) __global_ void scale_task_cuda (double *b, double *c, double scalar, int size); __global_ void scale_task_cuda (double *b, double *c, double scalar, int size) { int j = blockDim.x * blockIdx.x + threadIdx.x; if(j<size) { b[j] = scalar*c[j]; } __global_ void scale_task_cuda (double *b, double *c, double scalar, int size) { int j = blockDim.x * blockIdx.x + threadIdx.x; if(j<size) { b[j] = scalar*c[j]; } kernel.cu double A[1024], B[1024], C[1024] D[1024], E[1024]; main(){ … scale_task(A, B, 10.0, 1024); //T1 scale_task(B, A, 0.01, 1024); //T2 scale_task(C, A, 2.0, 1024); //T3 scale_task(D, E, 5.0, 1024); //T4 scale_task(B, C, 3.0, 1024); //T5 #pragma omp taskwait // can access any of A,B,C,D,E } double A[1024], B[1024], C[1024] D[1024], E[1024]; main(){ … scale_task(A, B, 10.0, 1024); //T1 scale_task(B, A, 0.01, 1024); //T2 scale_task(C, A, 2.0, 1024); //T3 scale_task(D, E, 5.0, 1024); //T4 scale_task(B, C, 3.0, 1024); //T5 #pragma omp taskwait // can access any of A,B,C,D,E } main.c

Known issues –Only functions that returns void can be tasks –No dependencies on parameters passed by value –Local variables may “escape” the scope of the executing task 16 Appendix III #pragma omp taskwait out([size]tmp) out(*res) void foo_task(int *tmp, int size, int *res); int main(…) { int res = 0; for(int i=0; …) { int tmp[N]; foo_task(tmp, N, &res); } #pragma omp taskwait }

17 Hands-on Account information –Host: bscgpu1.bsc.es –Username/password: nct01XXX/PwD.AE2013.XXX (XXX-> ) –My home: /home/nct/nct00002/seminario2003 First command –Read the README file on each directory hello_world cholesky nbody Job queue system –mnsubmit run.sh –mnq –mncancel

18 nct00002 nct