Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cristiano Padrin (CASPUR)

Similar presentations


Presentation on theme: "Cristiano Padrin (CASPUR)"— Presentation transcript:

1 Cristiano Padrin (CASPUR)
1 Computational Mathematics and Applications Group First experiences with Porting COSMO code to GPU using F2C-ACC Fortran to CUDA compiler Cristiano Padrin (CASPUR) Piero Lanucara (CASPUR)– Alessandro Cheloni (CNMCA) 1 1 1

2 2 The GPU explosion A huge amount of computing power: exponential growth with respect to “standard” multicore CPUs 2 2 2

3 Jazz Fermi GPU Cluster at CASPUR
785 MFlops/W 192 cores Intel Ghz 14336 cores on 32 Fermi C2050 QDR IB interconnect 1 TB RAM 200 TB IB storage 14.3 Tflops Peak CASPUR awarded as CUDA Research Center for Jazz cluster is actually number 5 of Little Green List 10.1 Tflops Linpack 3 3 3

4 Introduction The problem: porting large, legacy Fortran applications on GPGPU architectures. CUDA is the standard de-facto but only for C/C++ codes. There is no standard yest: several GPU Fortran compilers: commercial (CAPS HMPP, PGI Accelator and CUDA Fortran), freely available (F2C-ACC), …. Our choice: F2C-ACC (Govett) directive-based compiler from NOAA Nota: c’e tutto il supporto NOAA e di Govett in particolare. Esperienza su modelli oceanografici ed atmosferici (SMS incluso). 4

5 How F2C-ACC partecipates “in make”
filename.f90 F2C-ACC $(F2C) $(F2COPT) filename.f90 $(M4) filename.m4 > filename.cu $(NVCC) -c $(NVCC_OPT) -I$(INCLUDE) filename.cu filename.m4 m4 filename.cu nvcc filename.o 5

6 6 F2C-ACC Workflow F2C-ACC translates Fortran code, with user added directives, in CUDA (relies on m4 library for interlanguages dependencies) Some hand coding could be needed (see results) Debugging and optimization Tips (e.g. Thread, block synchronization, out of memory, coalesce, occupancy....) are to be done manually Compile and linking using CUDA libraries to create an executable to run 6 6 6

7 7 Himeno Benchmark 7 7 7

8 Himeno Benchmark: MPI version
8 Himeno Benchmark: MPI version 1 Process - 1 GPU 2 Process 512 x 256 x 128 4 Process 512 x 256 x 64 8 Process 512 x 256 x 32 16 Process 512 x 256 x 16 8 8 8

9 Porting the Microphysics
In POMPA task 6 we are exploring “the possibilities of a simple porting of specific physics or dynamics kernels to GPUs”. During last Workshop in Manno at CSCS two different approaches emerged to deal with the problem: one based on PGI Accelerator directives and the other one based on the F2C- ACC tool. The study has been done on the Microphysics stand alone program optimized by Xavier Lapillonne for GPU with PGI, and refered on the HPCforge. 9

10 Reference Code Structure
In microphysics program the two nested do-loop over space inside the subroutine hydci_pp has been individuated as the part to be accelerated via PGI directives. FILE mo_gscp_dwd.f90 FILE... MAIN... MODULE mo_gscp_dwd Subr. HYDCI_PP_INIT FILE... Elemental Functions MODULE... Subr. HYDCI_PP Accelerated Part via PGI dir. Subr. SATAD FILE... MODULE... 10

11 Reference Code Structure “SATAD” OF SOME GLOBALS
Simplified HYDCI_PP's workflow presettings 2 nested do-loop over “i and k” COMPUTING UPDATE GLOBAL OUT ACCELERATED PART “SATAD” OF SOME GLOBALS ... 11

12 Modified Code Structure
We proceeded to accelerate the same part of the code via F2C-ACC directives. Due to current release limitations of F2C-ACC the code structure has been partly modfied, while the workflow has been leaved unchanged. The part of the code to be accelerated remain the same but this has been extracted from hydci_pp subroutine and a file apart containing a new subroutine as been created for it: accComp.f90. FILE mo_gscp_dwd.f90 FILE accComp.f90 MODULE mo_gscp_dwd Subr. HYDCI_PP_INIT Subr. accComp Accelerated Part via F2C-ACC dir. Subr. HYDCI_PP 12

13 Modified Code Structure: why ?
Major limitations have driven the changing in the code are: Modules are (for now) not supported → necessary variables passed to the called subroutines and called subroutines/functions included into the file. F2C-ACC “--kernel” option isn't carefully tested → elemental functions and subroutines (“satad”) inlined. FILE mo_gscp_dwd.f90 FILE accComp.f90 MODULE mo_gscp_dwd Subr. HYDCI_PP_INIT Subr. accComp Accelerated Part via F2C-ACC dir. Subr. HYDCI_PP 13

14 Modified Code Structure
Host / Device View CPU GPU MODULE mo_gscp_dwd Subr. HYDCI_PP_INIT CopyIn Subr. accComp Accelerated Part via F2C-ACC dir. Subr. HYDCI_PP CopyOut 14

15 Results Timesteps 250 500 750 1000 CPU 42,685 84,951 125,973 166,206 GPU 5,952 11,819 18,389 26,081 F2C-ACC 4,814 9,634 16,650 21,843

16 Results The file check.dat produced by the run of the model developed with F2C-ACC show us a better comparison with the file check.dat produced with the PGI In particualr, we can see the comparison for one iteration between the F2C- ACC version and the Fortran version: Comparing files … # field nt nd n_err mean R_er max R_er max A_er ( i, j, k) t E E E ( 16, 58, 42) 8 tinc_lh E E E ( 13, 53, 47)

17 Conclusions First results are encouraging: F2C-ACC Microphysics performances are quite good. F2C-ACC Directive based (incremental parallelization): readable, only one source code to mantain “Adjustable” CUDA code is generated: portability and efficiency Ongoing project: is an «application specific Fortran-to-CUDA compiler for performance evaluation»:momentary limited support for some advanced Fortran features (e.g. Modules) Check for correctness: intrinsics (e.g. reduction), advanced Fermi features (e.g. FMA support) are not «automatically» driven into the F2C-ACC Compiler 17

18


Download ppt "Cristiano Padrin (CASPUR)"

Similar presentations


Ads by Google