Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.

Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

Background  Back in 2014 : Adaptation of IFS physics’ cloud scheme (CLOUDSC) to new architectures as part of ECMWF Scalability programme  Emphasis was on GPU-migration by use of OpenACC directives  CLOUDSC consumes about 10% of IFS Forecast time  Some 3500 lines of Fortran2003 – before OpenACC directives  This presentation concentrates comparing performances on  Haswell – OpenMP version of CLOUDSC  NVIDIA GPU (K40) – OpenACC version of CLOUDSC

Some earlier results  Baseline results down from 40s to 0,24s on K40 GPU  PGI 14.7 & CUDA 5.5 / 6.0 (runs performed ~ 3Q/2014)  Also Cray CCE 8.4 OpenACC-compiler was tried  OpenACC directives inserted automatically  By use of acc_insert Perl script followed by manual cleanup  Source code lines expanded from 3500 to 5000 in CLOUDSC !  The code with OpenACC directives still sustains ca. the same performance as before on Intel Xeon host side  GPUs computational performance was the same or better compared to Intel Haswell (model with 36-cores, 2.3GHz)  Data transfer added serious overheads  Strange DATA PRESENT testing & memory pinning slowdowns

The problem setup for this case study  Given 160,000 grid point columns (NGPTOT)  Each with 137 levels (NLEV)  About 80,000 columns fit into one K40 GPU  Grid point columns are independent of each other  So no horizontal dependencies here, but... ... level dependency prevents parallelization along vertical dim  Arrays are organized in blocks of grid point columns  Instead of using ARRAY(NGPTOT, NLEV)... ... we use ARRAY(NPROMA, NLEV, NBLKS)  NPROMA is a (runtime) fixed blocking factor  Arrays are OpenMP thread safe over NBLKS

Hardware, compiler & NPROMA’s used  Haswell-node : 24-cores @ 2.5GHz  2 x NVIDIA K40c GPUs on each Haswell-node via PCIe  Each GPU equipped with 12GB memory – with CUDA 7.0  PGI Compiler 15.7 with OpenMP & OpenACC  –O4 –fast –mp=numa,allcores,bind –Mfprelaxed  –tp haswell –Mvect=simd:256 [ -acc ]  Environment variables  PGI_ACC_NOSHARED=1  PGI_ACC_BUFFERSIZE=4M  Typical good NPROMA value for Haswell ~ 10 – 100  Per GPUs NPROMA up to 80,000 for max performance

Haswell : Driving CLOUDSC with OpenMP REAL(kind=8) :: array(NPROMA, NLEV, NGPBLKS) !$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) !$OMP DO SCHEDULE(DYNAMIC,1) DO JKGLO=1,NGPTOT,NPROMA ! So called NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA CALL CLOUDSC ( 1, ICEND, NPROMA, KLEV, & & array(1,1,IBL), & ! ~ 65 arrays like this ) END DO !$OMP END DO !$OMP END PARALLEL Typical values for NPROMA in OpenMP implementation: 10 – 100 Typical values for NPROMA in OpenMP implementation: 10 – 100

OpenMP scaling (Haswell, in GFlops/s)

Development of OpenACC/GPU-version  The driver-code with OpenMP-loop kept roughly unchanged  GPU to HOST data mapping (ACC DATA) added  Note that OpenACC can (in most cases) co-exist with OpenMP  Allows an elegant multi-GPU implementation  CLOUDSC was pre-processed with ” acc_insert ” – Perl-script  Allowed automatic creation of ACC KERNELS and ACC DATA PRESENT / CREATE clauses to CLOUDSC  In addition some minimal manual source code clean-up  CLOUDSC performance on GPU needs very large NPROMA  Lack of multilevel parallelism (only across NPROMA, not NLEV)

Driving OpenACC CLOUDSC with OpenMP !$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) & !$OMP& PRIVATE(tid, idgpu) num_threads(NumGPUs) tid = omp_get_thread_num() ! OpenMP thread number idgpu = mod(tid, NumGPUs) ! Effective GPU# for this thread CALL acc_set_device_num(idgpu, acc_get_device_type()) !$OMP DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA ! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA !$acc data copyout(array(:,:,IBL),...) & ! ~22 : GPU to Host !$acc& copyin(array(:,:,IBL)) ! ~43 : Host to GPU CALL CLOUDSC (... array(1,1,IBL)...) ! Runs on GPU# !$acc end data END DO !$OMP END DO !$OMP END PARALLEL Typical values for NPROMA in OpenACC implementation: > 10,000 Typical values for NPROMA in OpenACC implementation: > 10,000

Sample OpenACC coding of CLOUDSC !$ACC KERNELS LOOP COLLAPSE(2) PRIVATE(ZTMP_Q,ZTMP) DO JK=1,KLEV DO JL=KIDIA,KFDIA ztmp_q = 0.0_JPRB ztmp = 0.0_JPRB !$ACC LOOP PRIVATE(ZQADJ) REDUCTION(+:ZTMP_Q, +:ZTMP) DO JM=1,NCLV-1 IF (ZQX(JL,JK,JM)<RLMIN) THEN ZLNEG(JL,JK,JM) = ZLNEG(JL,JK,JM)+ZQX(JL,JK,JM) ZQADJ = ZQX(JL,JK,JM)*ZQTMST ztmp_q = ztmp_q + ZQADJ ztmp = ztmp + ZQX(JL,JK,JM) ZQX(JL,JK,JM) = 0.0_JPRB ENDIF ENDDO PSTATE_q_loc(JL,JK) = PSTATE_q_loc(JL,JK) + ztmp_q ZQX(JL,JK,NCLDQV) = ZQX(JL,JK,NCLDQV) + ztmp ENDDO !$ACC END KERNELS ASYNC(IBL) ASYNC removes CUDA-thread syncs

OpenACC scaling (K40c, in GFlops/s)

Timing (ms) breakdown : single GPU

Saturating GPUs with more work !$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) & !$OMP& PRIVATE(tid, idgpu) num_threads(NumGPUs * 4) tid = omp_get_thread_num() ! OpenMP thread number idgpu = mod(tid, NumGPUs) ! Effective GPU# for this thread CALL acc_set_device_num(idgpu, acc_get_device_type()) !$OMP DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA ! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA !$acc data copyout(array(:,:,IBL),...) & ! ~22 : GPU to Host !$acc& copyin(array(:,:,IBL)) ! ~43 : Host to GPU CALL CLOUDSC (... array(1,1,IBL)...) ! Runs on GPU# !$acc end data END DO !$OMP END DO !$OMP END PARALLEL More threads here

Saturating GPUs with more work  Consider few performance degradation facts at present  Parallelism only in NPROMA dimension in CLOUDSC  Updating 60-odd arrays back and forth every time step  OpenACC overhead related to data transfers & ACC DATA  Can we do better ? YES !  We can enable concurrently executed kernels through OpenMP ! Time-sharing GPU(s) across multiple OpenMP-threads  About 4 simultaneous OpenMP host–threads can saturate a single GPU in our CLOUDSC case  Extra care must be taken to avoid running out of memory on GPU Needs ~ 4X smaller NPROMA : 20,000 instead of 80,000

Multiple copies of CLOUDSC per GPU (GFlops/s)

nvvp profiler shows time-sharing impact GPU is 4- way time- shared GPU is fed with work by one OpenMP thread only

Timing (ms) : 4-way time-shared vs. no T/S GPU is 4- way time- shared GPU is not time-shared

24-core Haswell 2.5GHz vs. K40c GPU(s) (GFlops/s) T/S = GPUs time- shared T/S = GPUs time- shared

Conclusions  CLOUDSC OpenACC prototype from 3Q/2014 was ported to ECMWF’s tiny GPU cluster in 3Q/2015  Since last time PGI compiler has improved and OpenACC overheads have been greatly reduced (PGI 14.7 vs. 15.7)  With CUDA 7.0 and concurrent kernels – it seems – time- sharing (oversubscribing) GPUs with more work pays off  Saturation of GPUs can be achieved not surprisingly by help of multi-core host launching more data blocks onto GPUs  The outcome is not bad considering we seem to be underutilizing the GPUs (parallelism just along NPROMA)

Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.

Similar presentations

Presentation on theme: "Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.

Similar presentations

Presentation on theme: "Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015."— Presentation transcript:

Similar presentations

About project

Feedback