Porting physical parametrizations to GPUs using compiler directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement des Innern EDI Bundesamt für Meteorologie.

Slides:



Advertisements
Similar presentations
GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Multi Agent Simulation and its optimization over parallel architecture using CUDA™ Abdur Rahman and Bilal Khan NEDUET(Department Of Computer and Information.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Status of Dynamical Core C++ Rewrite (Task 5) Oliver Fuhrer (MeteoSwiss), Tobias Gysi (SCS), Men Muhheim (SCS), Katharina Riedinger (SCS), David Müller.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Extracted directly from:
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
GPU Programming with CUDA – Optimisation Mike Griffiths
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
GPU Architecture and Programming
CUDA - 2.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
Status of Dynamical Core C++ Rewrite Oliver Fuhrer (MeteoSwiss), Tobias Gysi (SCS), Men Muhheim (SCS), Katharina Riedinger (SCS), David Müller (SCS), Thomas.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Application of Emerging Computational Architectures (GPU, MIC) to Atmospheric Modeling Tom Henderson NOAA Global Systems Division
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Single Instruction Multiple Threads
CS427 Multicore Architecture and Parallel Computing
Lecture 2: Intro to the simd lifestyle and GPU internals
Lecture 5: GPU Compute Architecture
Experience with Maintaining the GPU Enabled Version of COSMO
Lecture 5: GPU Compute Architecture for the last time
Cristiano Padrin (CASPUR)
Memory System Performance Chapter 3
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Porting physical parametrizations to GPUs using compiler directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement des Innern EDI Bundesamt für Meteorologie und Klimatologie MeteoSchweiz

2 02/05/2011 X. Lapillonne Outline Computing with GPUs GPU implementation using PGI-directives The example of the microphysics parametrization Summary

3 02/05/2011 X. Lapillonne Computing on Graphical Processing Units (GPUs) Benefit from the highly parallel architecture of GPUs Higher peak performance at lower cost / power consumption. High memory bandwidth Cores Freq. (MHz) Peak Perf. S.P. (GFLOPs) Peak Perf. D.P. (GFLOPs) Memory Bandwith (GB/sec) Power Cons. (W) CPU: AMD Opteron GPU: Fermi C

4 02/05/2011 X. Lapillonne Execution model Host (CPU) Kernel Sequential Device (GPU) Grid Block (0,0) Block (0,0) Block (1,0) Block (1,0) Block (1,0) Block (1,0) Block (1,1) Block (1,1) Block (1,1) Thread (0,0) Thread (1,0) Thread (2,0) Thread (0,1) Thread (1,1) Thread (2,1) Thread (0,2) Thread (1,2) Thread (2,2) Data Transfer Copy data from CPU to GPU Load GPU program (Kernel) and execution: –Same kernel is executed by all threads –Threads are grouped in blocks:  Synchronized (similar to SIMD vectorization)  Share data through shared memory –Blocks are arranged in a grid  Threads from different blocks are independent Copy back data form GPU to CPU 2 levels of parallelism

5 02/05/2011 X. Lapillonne Computing on Graphical Processing Units (GPUs) To be efficient the code needs to take advantage of fine grain parallelism so as to execute 1000s of threads in parallel. GPU code: –Programming level:  OpenCL, CUDA, CUDA Fortran (PGI) …  Best performance, but require complete rewrite –Directive based approach:  PGI, OpenMP-acc, HMPP (CAPS)  Smaller modifications to original code  The resulting code is still understandable by Fortran programmers and can be easily modified  Possible performance sacrifice with respect to CUDA code  No standard for the moment Data transfer time between host and GPU may strongly reduce the overall performance

6 02/05/2011 X. Lapillonne Running COSMO with GPUs 1) Simple accelerator approach: Kernels and data launched for each part of the code. DynamicMicrophysicsTurbulenceRadiation Phys. parametrization I/O GPU The GPU gains may be strongly reduce by large back and forth data transfers.

7 02/05/2011 X. Lapillonne DynamicMicrophysicsTurbulenceRadiation Phys. parametrization I/O GPU Running COSMO with GPUs 2) Data remain on device, only send to CPU for I/O and communication CUDA Directives Strongly reduces device-host data transfer time (1 per time step) Possible within the PGI-directive framework Arrays declared on the GPU can be directly passed from the dynamical core (CUDA) to the physical parametrization (directives) without requiring intermediate copy to the CPU

8 02/05/2011 X. Lapillonne Outline Computing with GPUs GPU implementation using PGI-directives The example of the microphysics parametrization Summary

9 02/05/2011 X. Lapillonne GPU implementation using PGI directives Kernel generated at compilation time by adding directives in the code. Example of a matrix multiply to be compiled for an accelerator !$acc region do k = 1,n1 do i = 1,n3 c(i,k) = 0.0 do j = 1,n2 c(i,k) = c(i,k) + a(i,j) * b(j,k) enddo enddo enddo !$acc end region Grid and block sizes are automatically set by the compiler or can be manually tuned using the parallel and vector keywords Mirror and reflected keywords enable to declare GPU resident data arrays, thus avoiding data transfer between multiple kernel calls. Based on other codes experiences (WRF, fluid dynamics) 1, directly adding directives to existing code may not be very efficient : still requires some re-writing to get better performance (loop reordering …) Limitations : calls to subroutines within acc region need to be inlined, … 1

10 02/05/2011 X. Lapillonne Importance of loop re-ordering Typical example for column based physical parametrization: - i is the parallel direction nlev=60, N=10000 Execution time : 574 μs Execution time : 192 μs !$acc region do i=1,N !init do k=1,nlev a(i,k)=0.0D0 end do ! first layer a(i,1)=0.1D0 ! vertical computation do k=2,nlev a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k) end do end do !$acc end region i is the outer most loop The stride-1 index loop should be the most outer loop to fully benefit from the vector-type parallelization. !$acc region !init do k=1,nlev do i=1,N a(i,k)=0.0D0 end do end do ! first layer do i=1,N a(i,1)=0.1D0 end do ! vertical computation do k=2,nlev do i=1,N a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k) end do end do !$acc end region

11 02/05/2011 X. Lapillonne Outline Computing with GPUs GPU implementation using PGI-directives The example of the microphysics parametrization Summary

12 02/05/2011 X. Lapillonne Test case: Microphysics parametrization (two-category ice scheme) Calculates the rates of change of temperature, cloud water, cloud ice, water vapour, rain and snow due to cloud microphysical processes. The microphysics contains most features of physical parametrizations: column based, vertical integration, math. intrinsic (log, exp…) → Compute bound. 6% of the total computation time is spent in microphysics computation. The directives are tested using a stand alone version of the microphysics. Validity of GPU results are checked using reference CPU outputs.

13 02/05/2011 X. Lapillonne Different implementations Version 1 (v1) → Direct implementation: Few changes required; actual structure of the code is kept Version 2 (v2) → Optimized for GPU: Modifications in the routine: –Move the most inner loop outside (Parallel loop over stride-1 index) –Remove masks calculation (used for vectorization) –Scalarized intermediate matrices → new scalar variables may be assigned to register

14 02/05/2011 X. Lapillonne Comparison with CPU results GPU code version 1 and 2 are compared with a reference mpi-parallel code running on AMD 2.6 GHZ Opteron Istanbul (6 cores) Fermi card (C2070), using double precision :00z+3h Test case n x x n y x n z = 80 x 60 x 60, nstep = 100

15 02/05/2011 X. Lapillonne Using Fermi card and double precision real 10x speed up achieved with Fermi card and code version 2. Based on theoretical peak performance from 6 cores Opteron and Fermi card, one would expect a 9x speed up. Only 2x speed up when taking data transfer into account (data transfer time is ~2s while execution time is ~0.5s) Version 2 code is about 6x faster than version 1

16 02/05/2011 X. Lapillonne Summary GPUs are massively parallel hardware which can provide much higher computational power than CPU for comparable power and price A project is currently carried out to port the full COSMO code to such architecture, using CUDA for the dynamical core and a directive based approach for the parametrizations Porting of the microphysics subroutine to GPU was successfully carried out using PGI directive Speed up using Fermi card and double precision reals with respect to reference MPI CPU code running on 6 cores AMD Opteron: 10x without data transfer 2x when considering data transfer The large overhead of data transfer (4x execution time) shows that going to GPU is only viable if more computation is done (i.e all physics or all physics + dynamic) on the device between two data transfers. Optimized code for GPU 6x time faster than direct GPU implementation: it is essential to reorder loops to take advantage of the synchronized parallelisation.

17 02/05/2011 X. Lapillonne Next steps GPU implementation and performance evaluation of other physical parametrizations in stand-alone code:  Radiation : Done, but awaiting for PGI bug-fix  Turbulence Actual implementation in COSMO: 1.Introduction of ICON physical parametrization, with new data structure: t(nproma, ke, nblock). 2.PGI directives in COSMO. GPU – CPU version will have different loop order, what is the best solution ? (ex: #ifdef, pre-processing …)

18 02/05/2011 X. Lapillonne Additional slides

19 02/05/2011 X. Lapillonne Data structure Considering data format t(nvec,nz,nblock), with nblock=(nx x ny)/nvec. x y Horizontal x,y plan nproma Block 1 Block 2 Block 3

20 02/05/2011 X. Lapillonne Test case: One moment bulk microphysical parametrization (two-category ice scheme) Calculates the rates of change of temperature, cloud water, cloud ice, water vapour, rain and snow due to cloud microphysical processes. The microphysics contains most features of physical parametrizations: column based, vertical integration, math. intrinsic (log, exp…) → Compute bound. 6% of the total computation time is spent in microphysics computation. The hydci_pp subroutine : Do k=1,nz ! vertical loop ! Check for existence of rain and snow Do i,j=1,nhorizontal !horizontal loop if rain(i,j,k) THEN ic1=ic1+1 idx1(ic1)=i ; idx2(ic2)=j …. ! Compute changes for grid point with rain and snow Do i1d=1,ic1 !horizontal loop i = idx1(i1d); j = jdx1(i1d) compute(i,j,k) ! search for cloud …. Mask for vectorization Actual computation

21 02/05/2011 X. Lapillonne Different implementations Version 1 (v1) → Direct implementation: Few changes required, mostly keep the actual structure of the code, in particular the optimisations for vector machine. only one parallel loop over iblock (3 rd dimension) to maximize the level of parallelisation nvec is here set to 1 No parallel loop over inner most index : does not take advantage of the vector type parallelisation Optimal execution time for nv1=128 !$acc region do kernel, parallel, vector( nv1 ) & !$acc private(zpkr,zpks,zprvr,zprvs,zvzr,zvzs) & DO ib=1,nblock DO k=1,nz Do i=1,nvec microphysics computation (i,k,ib) Determine grid and block size 1D grid and block in this example

22 02/05/2011 X. Lapillonne Version 2 (v2) → Optimized for GPU: Important changes in the routine: –Move the most inner loop outside levels (k) loop –Remove masks calculation (used for vectorization) –Scalarized intermediate matrices → new scalar variables may be assigned to register Two parallel loops !$acc region do parallel, vector( nv1 ) DO ib=1,nblock !$acc do kernel, vector( nv2 ) Do i=1,nvec DO k=1,nz computation (i,k,ib) Synchronous parallel thread (vector) access contiguous memory Different implementations Optimal execution time for nv1=nv2=16 and nvec=16

23 02/05/2011 X. Lapillonne !$acc region do kernel, parallel, vector(128) & !$acc private(zpkr,zpks,zprvr,zprvs,zvzr,zvzs) & !$acc private(zcrim,zcagg,zbsdep,zvz0s,zn0s) & !$acc private(z1orhog,zrho1o2,zqrk,zqsk) & !$acc private(zdtdh,zzar,zzas) & !$acc private(ic1, ic2, ic3, ic4, ic5, ic6) & !$acc private(i,j,idx1,jdx1,idx2,jdx2,idx3,jdx3,idx4,jdx4,idx5,jdx5,idx6,jdx6) & !$acc private(zeln7o8qrk,zeln27o16qrk,zeln13o8qrk,zeln3o16qrk,zeln13o12qsk,zeln5o24qsk,zeln2o3qsk) & !$acc private(zcsdep,zcidep,zcslam,scau,scac,snuc,scfrz,simelt,sidep,ssdep,sdau,srim,sshed,sicri,srcri,sagg,siau,ssmelt,sev,srfrz) & !$acc private(zqvsi,zims,zimr) & !$acc private(zqvt,zqct,zqit) & !$acc private(qrg,qsg,qvg,qcg,qig,tg,ppg,rhog) & !$acc private(zcorr) & !$acc private(zpres,zdummy) loop_over_blocks: DO ib=1,nblock ! ********************************************************************* ! Loop from the top of the model domain to the surface to calculate the ! transfer rates and sedimentation terms ! ********************************************************************* ! Delete precipitation fluxes from previous timestep !CDIR BEGIN COLLAPSE prr_gsp (:,:,ib) = 0.0_ireals prs_gsp (:,:,ib) = 0.0_ireals zpkr (:,:) = 0.0_ireals loop_over_levels: DO k = 1, ke IF ( ildiabf_lh==1 ) THEN ! initialize temperature increment due to latent heat tinc_lh(:,:,k,ib) = tinc_lh(:,:,k,ib) - t(:,:,k,ib) ENDIF ic1 = 0 ic2 = 0 DO j = jstartpar, jendpar DO i = istartpar, iendpar qrg = qr(i,j,k,ib) qsg = qs(i,j,k,ib) qvg = qv(i,j,k,ib) Version 1

24 02/05/2011 X. Lapillonne Tuning the kernel schedule Tesla (T10) card, using single precision computation Test case n x x n y x n z = 100 x 100 x 60, nstep = 100 Try different block sizes by changing the vector(nv1) argument Optimal execution time for nv1=128, t e = 2.88 s. Additional data transfer time is t d = 2.1 s

25 02/05/2011 X. Lapillonne Output from the compiler Generating copyout(prr_gsp(:,:)) Generating copyout(prs_gsp(:,:)) Generating compute capability 1.3 binary 780, Loop is parallelizable 789, Loop is parallelizable Accelerator kernel generated 780, !$acc do parallel, vector(16) 789, !$acc do vector(16) Cached references to size [10] block of 'mma' Cached references to size [10] block of 'mmb' Using register for 'prr_gsp' Using register for 'prs_gsp' CC 1.3 : 64 registers; 100 shared, 556 constant, 36 local memory bytes; 25 occupancy

26 02/05/2011 X. Lapillonne Version 2 !$acc region, parallel, vector(16) !$acc copyin(rho,p,mma,mmb,dz) & !$acc copy(qs,qr,tinc_lh,t,qi,qc,qv) & !$acc copyout( prr_gsp,prs_gsp) loop_over_blocks: DO ib=1,nblock !$acc do kernel, vector(16) loop_over_xdim: DO i = istartpar, iendpar ! Delete precipitation fluxes from previous timestep prr_gsp (i,ib) = 0.0_ireals prs_gsp (i,ib) = 0.0_ireals zpkr = 0.0_ireals zpks = 0.0_ireals zprvr = 0.0_ireals zprvs = 0.0_ireals zvzr = 0.0_ireals zvzs = 0.0_ireals ! ********************************************************************* ! Loop from the top of the model domain to the surface to calculate the ! transfer rates and sedimentation terms ! ********************************************************************* loop_over_levels: DO k = 1, ke IF ( ildiabf_lh==1 ) THEN ! initialize temperature increment due to latent heat tinc_lh(i,k,ib) = tinc_lh(i,k,ib) - t(i,k,ib) ENDIF

27 02/05/2011 X. Lapillonne Tuning the kernel schedule vector(nv1), vector(nv2) arguments, and data format nvec can now be changed Optimal execution time for nv1=nv2=16 and nvec=16, t e = 0.64 s. Speed up compare to version 1 is 4.5 Data transfer time is again about 2 s, i.e. 3 time larger than execution time

28 02/05/2011 X. Lapillonne Output from the compiler Generating copyout(prr_gsp(1:ie,1,1:nblock)) Generating compute capability 1.3 binary 791, Loop is parallelizable Accelerator kernel generated 791, !$acc do parallel, vector(256) Cached references to size [10] block of 'mma' Cached references to size [10] block of 'mmb' CC 1.3 : 64 registers; 100 shared, 1284 constant, 0 local memory bytes; 25 occupancy

29 02/05/2011 X. Lapillonne Scaling with system size N=nx x ny nx x ny =100 x 100 not enough parallelism