Compilers & Tools for HPC January 2014 www.pgroup.com.

Compilers & Tools for HPC January 2014 www.pgroup.com

2 Who is PGI? The Portland Group (PGI) Founded in 1989 – 20+ years in the HPC business Acquisition by NVIDIA closed on July 29th 2013 Compilers & SW Development Tools PGI-developed Compilers, Debugger, Performance Profiler 3rd party Integrated Development Environments (IDEs) HPC & Technical Computing Market Nat’l Labs, Research Labs, Research Universities, Oil & Gas, Aerospace, Pharmaceutical, …

3 PGI Installations PGI has over 25,000 users at over 5,000 sites worldwide

4 NVIDIA and PGI World-class HPC companies need world-class compiler teams NVIDIA + PGI = Integrated HPC system compilers As part of NVIDIA, PGI will Continue to create world-class HPC Fortran/C/C++ compilers for CPU+Accelerator systems Accelerate development and propagation of OpenACC and CUDA Fortran Increase velocity on Accelerator-enablement of HPC applications

5 www.pgroup.com C99, C++, Fortran 2003 Compilers Optimizing, Vectorizing, Parallelizing Graphical Parallel Tools PGDBG® debugger PGPROF® profiler AMD, Intel, NVIDIA Processors PGI Unified Binary® technology Performance portability Linux, OS X, Windows Visual Studio & Eclipse integration PGI Accelerator Features OpenACC C/C++/Fortran compilers CUDA Fortran compiler CUDA-x86

8 PGI OpenACC Compilers

9 OpenACC Open Programming Standard for Parallel Computing “PGI OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.” --Buddy Bland, Titan Project Director, Oak Ridge National Lab “OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.” --Michael Wong, CEO OpenMP Directives Board OpenACC Members

10 OpenACC Directives Program myscience... serial code... !$acc kernels do k = 1,n1 do i = 1,n2... parallel code... enddo enddo !$acc end kernels... End Program myscience CPU GPU OpenACC Compiler Directives Portable compiler hints Compiler parallelizes code Designed for multicore CPUs & many core GPUs / Accelerators

11 How Do OpenACC Compilers Work? #pragma acc kernels loop for( i = 0; i < nrows; ++i ){ float val = 0.0f; for( d = 0; d < nzeros; ++d ){ j = i + offset[d]; if( j >= 0 && j < nrows ) val += m[i+nrows*d] * v[j]; } x[i] = val; } matvec: subq $328, %rsp... call __pgi_cu_alloc... call __pgi_cu_uploadx... call __pgi_cu_launch2... call __pgi_cu_downloadx... call __pgi_cu_free....entry matvec_14_gpu(....reg.u32 %r... cvt.s32.u32 %r1, %tid.x; mov.s32 %r2, 0; setp.ne.s32 $p1, %r1, %r2 cvt.s32.u32 %r3, %ctaid.x; cvt.s32.u32 %r4, %ntid.x; mul.lo.s32 %r5, %r3, %r4; @%p1 bra $Lt_0_258; st.shared.s32 [__i2s], %r5 $Lt_0_258: bar.sync 0;... + compile Unified Object execute … no change to existing makefiles, scripts, IDEs, programming environment, etc. Link x86 asm codeGPU asm code

12 #pragma acc data \ copy(b[0:n][0:m]) \ create(a[0:n][0:m]) { for (iter = 1; iter <= p; ++iter){ #pragma acc kernels { for (i = 1; i < n-1; ++i){ for (j = 1; j < m-1; ++j){ a[i][j]=w0*b[i][j]+ w1*(b[i-1][j]+b[i+1][j]+ b[i][j-1]+b[i][j+1])+ w2*(b[i-1][j-1]+b[i-1][j+1]+ b[i+1][j-1]+b[i+1][j+1]); } } for( i = 1; i < n-1; ++i ) for( j = 1; j < m-1; ++j ) b[i][j] = a[i][j]; } S 2 (B) S 1 (B) S 2 (B) OpenACC Coding Example Host Memory GPU Memory AA BB S 1 (B) S p (B)

13 Matrix Multiply Source Code Size Comparison: 1 void 2 computeMM0_saxpy(float C[][WB], float A[][WA], float B[][WB], 3 int hA, int wA, int wB) 4 { 5 #pragma acc region 6 { 7 #pragma acc for parallel vector(16) unroll(4) 8 for (int i = 0; i < hA; ++i) { 9 for (int j = 0; j < wB; ++j) { 10 C[i][j] = 0.0 ; 11 } 12 for (int k = 0; k < wA; ++k) { 13 for (int j = 0; j < wB; ++j) { 14 C[i][j] = C[i][j]+A[i][k]*B[k][j]; 15 } 16 } 17 } 18 } 19 } } 1 void matrixMulGPU(cl_uint ciDeviceCount, cl_mem h_A, float* h_B_data, 2 unsigned int mem_size_B, float* h_C ) 2 { 3 cl_mem d_A[MAX_GPU_COUNT]; 4 cl_mem d_C[MAX_GPU_COUNT]; 5 cl_mem d_B[MAX_GPU_COUNT]; 6 7 cl_event GPUDone[MAX_GPU_COUNT]; 8 cl_event GPUExecution[MAX_GPU_COUNT]; 9 12 // Create buffers for each GPU 13 // Each GPU will compute sizePerGPU rows of the result 14 int sizePerGPU = HA / ciDeviceCount; 15 16 int workOffset[MAX_GPU_COUNT]; 17 int workSize[MAX_GPU_COUNT]; 18 19 workOffset[0] = 0; 20 for(unsigned int i=0; i < ciDeviceCount; ++i) 21 { 22 // Input buffer 23 workSize[i] = (i != (ciDeviceCount - 1)) ? sizePerGPU : (HA - workOffset[i]); 24 25 d_A[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, workSize[i] * sizeof(float) * WA, NULL,NULL); 26 27 // Copy only assigned rows from host to device 28 clEnqueueCopyBuffer(commandQueue[i], h_A, d_A[i], workOffset[i] * sizeof(float) * WA, 29 0, workSize[i] * sizeof(float) * WA, 0, NULL, NULL); 30 31 // create OpenCL buffer on device that will be initiatlize from the host memory on first use 32 // on device 33 d_B[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 34 mem_size_B, h_B_data, NULL); 35 36 // Output buffer 37 d_C[i] = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, workSize[i] * WC * sizeof(float), NULL,NULL); 38 39 // set the args values 40 clSetKernelArg(multiplicationKernel[i], 0, sizeof(cl_mem), (void *) &d_C[i]); 41 clSetKernelArg(multiplicationKernel[i], 1, sizeof(cl_mem), (void *) &d_A[i]); 42 clSetKernelArg(multiplicationKernel[i], 2, sizeof(cl_mem), (void *) &d_B[i]); 43 clSetKernelArg(multiplicationKernel[i], 3, sizeof(float) * BLOCK_SIZE *BLOCK_SIZE, 0 ); 44 clSetKernelArg(multiplicationKernel[i], 4, sizeof(float) * BLOCK_SIZE *BLOCK_SIZE, 0 ); 45 46 if(i+1 < ciDeviceCount) 47 workOffset[i + 1] = workOffset[i] + workSize[i]; 48 } 49 // Execute Multiplication on all GPUs in parallel 50 size_t localWorkSize[] = {BLOCK_SIZE, BLOCK_SIZE}; 51 size_t globalWorkSize[] = {shrRoundUp(BLOCK_SIZE, WC), shrRoundUp(BLOCK_SIZE, workSize[0])}; 52 // Launch kernels on devices 53 for(unsigned int i = 0; i < ciDeviceCount; i++) 54 { 55 // Multiplication - non-blocking execution 56 globalWorkSize[1] = shrRoundUp(BLOCK_SIZE, workSize[i]); 57 clEnqueueNDRangeKernel(commandQueue[i], multiplicationKernel[i], 2, 0, globalWorkSize, localWorkSize, 58 0, NULL, &GPUExecution[i]); 59 } 60 for(unsigned int i = 0; i < ciDeviceCount; i++) 61 { 62 clFinish(commandQueue[i]); 63 } 64 for(unsigned int i = 0; i < ciDeviceCount; i++) 65 { 66 // Non-blocking copy of result from device to host 67 clEnqueueReadBuffer(commandQueue[i], d_C[i], CL_FALSE, 0, WC * sizeof(float) * workSize[i], 68 h_C + workOffset[i] * WC, 0, NULL, &GPUDone[i]); 69 } 70 // CPU sync with GPU 71 clWaitForEvents(ciDeviceCount, GPUDone); 72 73 // Release mem and event objects 74 for(unsigned int i = 0; i < ciDeviceCount; i++) 75 { 76 clReleaseMemObject(d_A[i]); 77 clReleaseMemObject(d_C[i]); 78 clReleaseMemObject(d_B[i]); 79 clReleaseEvent(GPUExecution[i]); 80 clReleaseEvent(GPUDone[i]); 81 } 82 } 83 __kernel void 84 matrixMul( __global float* C, __global float* A, __global float* B, 85 __local float* As, __local float* Bs) 86 { 87 int bx = get_group_id(0), tx = get_local_id(0); 88 int by = get_group_id(1), ty = get_local_id(1); 89 int aEnd = WA * BLOCK_SIZE * by + WA - 1; 90 91 float Csub = 0.0f; 92 93 for (int a = WA*BLOCK_SIZE*by, b = BLOCK_SIZE * bx; 94 a <= aEnd; a += BLOCK_SIZE, b += BLOCK_SIZE*WB) { 95 As[tx + ty * BLOCK_SIZE] = A[a + WA * ty + tx]; 96 Bs[tx + ty * BLOCK_SIZE] = B[b + WB * ty + tx]; 97 barrier(CLK_LOCAL_MEM_FENCE); 98 for (int k = 0; k < BLOCK_SIZE; ++k) 99 Csub += As[k + ty * BLOCK_SIZE]*Bs[tx + k * BLOCK_SIZE] ; 101 barrier(CLK_LOCAL_MEM_FENCE); 102 } 103 C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = Csub; 104 105 } OpenCL OpenACC CUDA C 1 void 2 __global__ void matrixMul( float* C, float* A, float* B, int wA, int wB) 3 { 4 int bx = blockIdx.x; 5 int by = blockIdx.y; 6 int tx = threadIdx.x; 7 int ty = threadIdx.y; 8 int aBegin = wA * BLOCK_SIZE * by; 9 int aEnd = aBegin + wA - 1; 10 int aStep = BLOCK_SIZE; 11 int bBegin = BLOCK_SIZE * bx; 12 int bStep = BLOCK_SIZE * wB; 13 float Csub = 0; 14 for (int a = aBegin, b = bBegin; 15 a <= aEnd; 16 a += aStep, b += bStep) { 17 __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; 18 __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; 19 AS(ty, tx) = A[a + wA * ty + tx]; 20 BS(ty, tx) = B[b + wB * ty + tx]; 21 __syncthreads(); 22 for (int k = 0; k < BLOCK_SIZE; ++k) 23 Csub += AS(ty, k) * BS(k, tx); 24 __syncthreads(); 25 } 26 int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; 27 C[c + wB * ty + tx] = Csub; 28 } 29 30 void 31 domatmul( float* C, float* A, float* B, unsigned int hA, unsigned int wA, unsigned wB ) 32 { 33 unsigned int size_A = WA * HA; 34 unsigned int mem_size_A = sizeof(float) * size_A; 35 unsigned int size_B = WB * HB; 36 unsigned int mem_size_B = sizeof(float) * size_B; 37 unsigned int size_C = WC * HC; 38 unsigned int mem_size_C = sizeof(float) * size_C; 39 float *d_A, *d_B, *d_C; 40 41 cudaMalloc((void**) &d_A, mem_size_A); 42 cudaMalloc((void**) &d_B, mem_size_B); 43 cudaMalloc((void**) &d_C, mem_size_C); 44 cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice); 45 cudaMemcpy(d_B, h_B, mem_size_B, cudaMemcpyHostToDevice); 46 47 dim3 threads(BLOCK_SIZE, BLOCK_SIZE); 48 dim3 grid(WC / threads.x, HC / threads.y); 49 50 matrixMul >>(d_C, d_A, d_B, WA, WB); 51 52 cudaMemcpy(h_C, d_C, mem_size_C, cudaMemcpyDeviceToHost); 53 54 cudaFree(d_A); 55 cudaFree(d_B); 56 cudaFree(d_C); 57 }

14 Accelerating SEISMIC_CPML from the University of Pau Read this article online at www.pgroup.com/pginsider

15 SEISMIC_CPML Timings Version MPI Processes OpenMP ThreadsGPUsTime (sec) Approx. Programming Time (min) Original MPI/OMP 240951 ACC Steps 1/2202310010 ACC Step 320255060 ACC Step 4202124120 ACC Step 5202120 5x in 5 hours! System Info: 4 Core Intel Core-i7 920 Running at 2.67Ghz Includes 2 Tesla C2070 GPUs Problem Size: 101x641x128

17 Cloverleaf mini-App Performance Cloverleaf is a US National Labs Trinity/Coral mini-app benchmark https://github.com/Warwick-PCAV/CloverLeaf/wiki/Performance-Tablehttps://github.com/Warwick-PCAV/CloverLeaf/wiki/Performance-Table Run-time Better NVIDIA benchmarks: dual-socket Intel Xeon E5-2667

18 OpenACC: Performance with Less Effort Cloverleaf: http://www.computer.org/csdl/proceedings/sccompanion/2012/4956/00/4956a465-abs.htmlhttp://www.computer.org/csdl/proceedings/sccompanion/2012/4956/00/4956a465-abs.html

19 Why Use OpenACC Directives? Productivity Higher-level programming model Similar to OpenMP, designed for Accelerated computing Portability Ignore directives, code is portable to the host Portable across different types of Accelerators Performance portability Performance Feedback Information Unique to compilers Enables incremental porting and tuning

20 PGI CUDA Fortran

21 attributes(global) subroutine mm_kernel ( A, B, C, N, M, L ) real :: A(N,M), B(M,L), C(N,L), Cij integer, value :: N, M, L integer :: i, j, kb, k, tx, ty real, shared :: Asub(16,16),Bsub(16,16) tx = threadidx%x ty = threadidx%y i = blockidx%x * 16 + tx j = blockidx%y * 16 + ty Cij = 0.0 do kb = 1, M, 16 Asub(tx,ty) = A(i,kb+tx-1) Bsub(tx,ty) = B(kb+ty-1,j) call syncthreads() do k = 1,16 Cij = Cij + Asub(tx,k) * Bsub(k,ty) enddo call syncthreads() enddo C(i,j) = Cij end subroutine mmul_kernel real, device, allocatable, dimension(:,:) :: Adev,Bdev,Cdev... allocate (Adev(N,M), Bdev(M,L), Cdev(N,L)) Adev = A(1:N,1:M) Bdev = B(1:M,1:L) call mm_kernel >> ( Adev, Bdev, Cdev, N, M, L) C(1:N,1:L) = Cdev deallocate ( Adev, Bdev, Cdev )... Host CodeGPU Code CUDA Fortran in a Nutshell

22 CUDA Fortran / OpenACC Interoperability module mymod real, dimension(:), allocatable, device :: xDev end module... use mymod... allocate( xDev(n) ) ! allocates xDev in GPU memory call init_kernel >> (xDev, n)... !$acc data copy( y(:) )! no need to copy xDev... !$acc kernels loop do i = 1, n y(i) = y(i) + a*xDev(i) enddo... !$acc end data

23 CUDA Fortran Supports Generic Interfaces and Overloading use cublas real(4), device :: xd(N) real(4) x(N) call random_number(x) ! Alloc xd in device memory, copy x ! to xd, invoke overloaded isamax allocate(xd(N)) xd = x j = isamax(N,xd,1) ! On the host, same name k = isamax(N,x,1) module cublas ! isamax interface isamax integer function isamax & (n, x, incx) integer :: n, incx real(4) :: x(*) end function integer function isamaxcu & (n, x, incx) bind(c, & name='cublasIsamax') integer, value :: n, incx real(4), device :: x(*) end function end interface...

24 CUDA Fortran Supports Encapsulation Isolate device data and accelerator kernel declarations in Fortran modules module mm real, device, allocatable :: a(:) real, device :: x, y(10) real, constant :: c1, c2(10) integer, device :: n contains attributes(global) subroutine s( b )... Partition source into sections written and maintained by accelerator experts vs those evolved by science and engineering domain experts

25 !$CUF Kernel Directives Simplifies Kernel Creation module madd_device_module use cudafor contains subroutine madd_dev(a,b,c,sum,n1,n2) real,dimension(:,:),device :: a,b,c real :: sum integer :: n1,n2 type(dim3) :: grid, block !$cuf kernel do (2) >> do j = 1,n2 do i = 1,n1 a(i,j) = b(i,j) + c(i,j) sum = sum + a(i,j) enddo enddo end subroutine end module module madd_device_module use cudafor implicit none contains attributes(global) subroutine madd_kernel(a,b,c,blocksum,n1,n2) real, dimension(:,:) :: a,b,c real, dimension(:) :: blocksum integer, value :: n1,n2 integer :: i,j,tindex,tneighbor,bindex real :: mysum real, shared :: bsum(256) ! Do this thread's work mysum = 0.0 do j = threadidx%y + (blockidx%y-1)*blockdim%y, n2, blockdim%y*griddim%y do i = threadidx%x + (blockidx%x-1)*blockdim%x, n1, blockdim%x*griddim%x a(i,j) = b(i,j) + c(i,j) mysum = mysum + a(i,j) ! accumulates partial sum per thread enddo enddo ! Now add up all partial sums for the whole thread block ! Compute this thread's linear index in the thread block ! We assume 256 threads in the thread block tindex = threadidx%x + (threadidx%y-1)*blockdim%x ! Store this thread's partial sum in the shared memory block bsum(tindex) = mysum call syncthreads() ! Accumulate all the partial sums for this thread block to a single value tneighbor = 128 do while( tneighbor >= 1 ) if( tindex = 1 ) if( tindex >>(a,b,c,blocksum,n1,n2) call madd_sum_kernel >>(blocksum,dsum,nb) r = cudaThreadSynchronize() ! don't deallocate too early deallocate(blocksum) end subroutine end module Equivalent hand-written CUDA kernels

26 2014

27 PGI 2014 Multi-core x64 Highlights Performance Tuning: OpenMP 75% faster than GCC Comprehensive MPI support Free PGI for your MacBook

28 Industry-leading Multi-core x86 Performance SPECompG_base2012 relative performance as measured by The Portland Group during the week of July 29, 2013. The number of OpenMP threads was set to match the number of cores on each system. SPEComp ® is a registered trademark of the Standard Performance Evaluation Corporation (SPEC).

29 Comprehensive MPI Support Open MPI MVAPICH2 MPICH3SGI MPI DebugRunProfile

30 What is in Free PGI for OS X? OpenMP 3.1 & auto-parallelizing Fortran 2003 and C99 compilers Optimized for the latest multi-core x86-64 CPUs Supported on Mountain Lion and Mavericks with Xcode 5 Includes a cmd-level parallel Fortran debugger

31 PGI Accelerator 2014 Highlights NVIDIA Tesla K40 and AMD Radeon GPUs Support OpenACC 2.0 Features and Optimizations CUDA Fortran and OpenACC Debugging

32 OpenACC Performance Portability CPU results are one core of an Intel Core i7-3930 CPU @ 3.20GHz (Sandy Bridge).

33 PGI 14.1 OpenACC New Features Accelerator-side procedure calls using the routine directive Unstructured data lifetimes using enter_data, exit_data Comprehensive host_data support in Fortran declare create, declare device_resident Fortran deviceptr data clause Multidimensional dynamically allocated C/C++ arrays OpenACC 2.0 API function calls Calling of CUDA Fortran atomic functions in OpenACC regions Tuning, tuning, tuning … driven by SPEC ACCEL bmks, COSMO, NIM, Cloverleaf, WRF, Gaussian, proprietary customer codes, …

34 OpenACC Procedure Calls void matvec(... ){ float s = 0.0; #pragma acc loop vector reduction(+:s) for(int j=0; j<n; ++j) s += a[i*n+j]*v[j]; x[i] = s; } void test(... ){ #pragma acc parallel loop gang... for(int i=0; i<n; ++i) matvec(a, v, x, i, n); } % pgcc -acc -Minline test.c #pragma acc routine vector void matvec(...){ float s = 0.0; #pragma acc loop vector reduction(+:s) for(int j=0; j<n; ++j) s += a[i*n+j]*v[j]; x[i] = s; } void test(...){ #pragma acc parallel loop gang... for(int i=0; i<n; ++i) matvec(a, v, x, i, n); } % pgcc -acc -c matvec.c % pgcc -acc -c test.c % pgcc -acc test.o matvec.o Inlining required, single fileSeparate compilation + link step Simplifies porting, allows separate compilation and libraries of device-side procedures

35 OpenACC Unstructured Data Lifetimes alloc(){ x = (float*)malloc(...);... } do_init(){ alloc(); for( i=0; i<n; ++i) x[i] =... } int main(){ do_init(); #pragma acc data copy(x[0:n]) for( time=0; time<n; ++time) process( time ); genoutput(); } alloc(){ x = (float*)malloc(...);... #pragma acc enter data create(x[0:n]) } do_init(){ alloc(); #pragma acc parallel present(x[0:n]) for( i=0; i<n; ++i) x[i] =... } int main(){ do_init(); #pragma acc data present(x[0:n]) for( time=0; time<n; ++time) process( time ) #pragma acc exit data copyout(x[0:n]) genoutput(); } Initialization on host Initialization on device Enables fine-tuning of data movement through dynamic data management

36 OpenACC Host_data Directive void cudaproc(...){ cudakernel >>( a, n ); } void test(...){ #ifdef CUDA cudaMalloc( &a, sizeof(float)*n ); #else a = malloc( sizeof(float)*n ); #endif... #pragma acc parallel loop deviceptr(a) for( i=0; i<n; ++i ) a[i] =... #ifdef CUDA cudaproc( a, n ); #else hostproc( a, n ); #endif } % nvcc -c cproc.cu % pgcc -DCUDA -acc test.c cproc.o -Mcuda void cudaproc(...){ cudakernel >>( a, n ); } void test(...){ a = malloc( sizeof(float)*n );... #pragma acc data copy(a[0:n]) { #pragma acc parallel loop for( i=0; i<n; ++i ) a[i] =... if( usingcuda){ #pragma acc host_data use_device(a) cudaproc( a, n ); }else{ hostproc( a, n ); } % nvcc -c cproc.cu % pgcc -acc test.c cproc.o -Mcuda cudaMalloc needed host_data allows runtime test Enhances interoperability of CUDA and OpenACC, enables use of GPUDirect

37 OpenACC Atomic Operations #pragma acc data copyout(f[0:3*nall]) copyin(x[0:3*nall], numneigh[0:nlocal], neighbors[0:nlocal*maxneighs]) { // clear force on own and ghost atoms #pragma acc kernels loop for(int i = 0; i < nall; i++) { …… } #pragma acc kernels loop independent { for(int i = 0; i < nlocal; i++) { …… for(int k = 0; k < numneighs; k++) { j = neighs[k]; …… if(GHOST_NEWTON || j < nlocal) { f[3 * j + 0] -= delx * force; f[3 * j + 1] -= dely * force; f[3 * j + 2] -= delz * force; } #pragma acc data copyout(f[0:3*nall]) copyin(x[0:3*nall], numneigh[0:nlocal], neighbors[0:nlocal*maxneighs]) { // clear force on own and ghost atoms #pragma acc kernels loop for(int i = 0; i < nall; i++) { …… } #pragma acc kernels loop independent { for(int i = 0; i < nlocal; i++) { …… for(int k = 0; k < numneighs; k++) { j = neighs[k]; …… if(GHOST_NEWTON || j < nlocal) { #pragma acc atomic update f[3 * j + 0] -= delx * force; #pragma acc atomic update f[3 * j + 1] -= dely * force; #pragma acc atomic update f[3 * j + 2] -= delz * force; } miniMD contains a race condition With OpenACC atomic Fails – requires total re-write without support for OpenACC 2.0 atomic directives, currently scheduled for PGI 14.4, in the meantime …

38 OpenACC Atomic Operations #pragma acc data copyout(f[0:3*nall]) copyin(x[0:3*nall], numneigh[0:nlocal], neighbors[0:nlocal*maxneighs]) { // clear force on own and ghost atoms #pragma acc kernels loop for(int i = 0; i < nall; i++) { …… } #pragma acc kernels loop independent { for(int i = 0; i < nlocal; i++) { …… for(int k = 0; k < numneighs; k++) { j = neighs[k]; …… if(GHOST_NEWTON || j < nlocal) { f[3 * j + 0] -= delx * force; f[3 * j + 1] -= dely * force; f[3 * j + 2] -= delz * force; } #pragma acc data copyout(f[0:3*nall]) copyin(x[0:3*nall], numneigh[0:nlocal], neighbors[0:nlocal*maxneighs]) { // clear force on own and ghost atoms #pragma acc kernels loop for(int i = 0; i < nall; i++) { …… } #pragma acc kernels loop independent { for(int i = 0; i < nlocal; i++) { …… for(int k = 0; k < numneighs; k++) { j = neighs[k]; …… if(GHOST_NEWTON || j < nlocal) { //#pragma acc atomic update atomicSubf(f[3 * j + 0], delx * force); //#pragma acc atomic update atomicSubf(f[3 * j + 1], dely * force); //#pragma acc atomic update atomicSubf(f[3 * j + 2], delz * force); } miniMD contains a race conditionCalling CUDA atomics in OpenACC regions

39 Debugging CUDA Fortran with Allinea DDT

40 PGI compilers enable performance-portable programming across CPU+Accelerator systems and minimize re-coding for each new HW advancement Key Value-add of the PGI Compilers

Compilers & Tools for HPC January 2014 www.pgroup.com.

Similar presentations

Presentation on theme: "Compilers & Tools for HPC January 2014 www.pgroup.com."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Compilers & Tools for HPC January 2014 www.pgroup.com.

Similar presentations

Presentation on theme: "Compilers & Tools for HPC January 2014 www.pgroup.com."— Presentation transcript:

Similar presentations

About project

Feedback