Download presentation
Presentation is loading. Please wait.
Published byEugene McCarthy Modified over 9 years ago
1
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015
2
Slide 2 Background Back in 2014 : Adaptation of IFS physics’ cloud scheme (CLOUDSC) to new architectures as part of ECMWF Scalability programme Emphasis was on GPU-migration by use of OpenACC directives CLOUDSC consumes about 10% of IFS Forecast time Some 3500 lines of Fortran2003 – before OpenACC directives This presentation concentrates comparing performances on Haswell – OpenMP version of CLOUDSC NVIDIA GPU (K40) – OpenACC version of CLOUDSC
3
Slide 3 Some earlier results Baseline results down from 40s to 0,24s on K40 GPU PGI 14.7 & CUDA 5.5 / 6.0 (runs performed ~ 3Q/2014) Also Cray CCE 8.4 OpenACC-compiler was tried OpenACC directives inserted automatically By use of acc_insert Perl script followed by manual cleanup Source code lines expanded from 3500 to 5000 in CLOUDSC ! The code with OpenACC directives still sustains ca. the same performance as before on Intel Xeon host side GPUs computational performance was the same or better compared to Intel Haswell (model with 36-cores, 2.3GHz) Data transfer added serious overheads Strange DATA PRESENT testing & memory pinning slowdowns
4
Slide 4 The problem setup for this case study Given 160,000 grid point columns (NGPTOT) Each with 137 levels (NLEV) About 80,000 columns fit into one K40 GPU Grid point columns are independent of each other So no horizontal dependencies here, but... ... level dependency prevents parallelization along vertical dim Arrays are organized in blocks of grid point columns Instead of using ARRAY(NGPTOT, NLEV)... ... we use ARRAY(NPROMA, NLEV, NBLKS) NPROMA is a (runtime) fixed blocking factor Arrays are OpenMP thread safe over NBLKS
5
Slide 5 Hardware, compiler & NPROMA’s used Haswell-node : 24-cores @ 2.5GHz 2 x NVIDIA K40c GPUs on each Haswell-node via PCIe Each GPU equipped with 12GB memory – with CUDA 7.0 PGI Compiler 15.7 with OpenMP & OpenACC –O4 –fast –mp=numa,allcores,bind –Mfprelaxed –tp haswell –Mvect=simd:256 [ -acc ] Environment variables PGI_ACC_NOSHARED=1 PGI_ACC_BUFFERSIZE=4M Typical good NPROMA value for Haswell ~ 10 – 100 Per GPUs NPROMA up to 80,000 for max performance
6
Slide 6 Haswell : Driving CLOUDSC with OpenMP REAL(kind=8) :: array(NPROMA, NLEV, NGPBLKS) !$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) !$OMP DO SCHEDULE(DYNAMIC,1) DO JKGLO=1,NGPTOT,NPROMA ! So called NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA CALL CLOUDSC ( 1, ICEND, NPROMA, KLEV, & & array(1,1,IBL), & ! ~ 65 arrays like this ) END DO !$OMP END DO !$OMP END PARALLEL Typical values for NPROMA in OpenMP implementation: 10 – 100 Typical values for NPROMA in OpenMP implementation: 10 – 100
7
Slide 7 OpenMP scaling (Haswell, in GFlops/s)
8
Slide 8 Development of OpenACC/GPU-version The driver-code with OpenMP-loop kept roughly unchanged GPU to HOST data mapping (ACC DATA) added Note that OpenACC can (in most cases) co-exist with OpenMP Allows an elegant multi-GPU implementation CLOUDSC was pre-processed with ” acc_insert ” – Perl-script Allowed automatic creation of ACC KERNELS and ACC DATA PRESENT / CREATE clauses to CLOUDSC In addition some minimal manual source code clean-up CLOUDSC performance on GPU needs very large NPROMA Lack of multilevel parallelism (only across NPROMA, not NLEV)
9
Slide 9 Driving OpenACC CLOUDSC with OpenMP !$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) & !$OMP& PRIVATE(tid, idgpu) num_threads(NumGPUs) tid = omp_get_thread_num() ! OpenMP thread number idgpu = mod(tid, NumGPUs) ! Effective GPU# for this thread CALL acc_set_device_num(idgpu, acc_get_device_type()) !$OMP DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA ! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA !$acc data copyout(array(:,:,IBL),...) & ! ~22 : GPU to Host !$acc& copyin(array(:,:,IBL)) ! ~43 : Host to GPU CALL CLOUDSC (... array(1,1,IBL)...) ! Runs on GPU# !$acc end data END DO !$OMP END DO !$OMP END PARALLEL Typical values for NPROMA in OpenACC implementation: > 10,000 Typical values for NPROMA in OpenACC implementation: > 10,000
10
Slide 10 Sample OpenACC coding of CLOUDSC !$ACC KERNELS LOOP COLLAPSE(2) PRIVATE(ZTMP_Q,ZTMP) DO JK=1,KLEV DO JL=KIDIA,KFDIA ztmp_q = 0.0_JPRB ztmp = 0.0_JPRB !$ACC LOOP PRIVATE(ZQADJ) REDUCTION(+:ZTMP_Q, +:ZTMP) DO JM=1,NCLV-1 IF (ZQX(JL,JK,JM)<RLMIN) THEN ZLNEG(JL,JK,JM) = ZLNEG(JL,JK,JM)+ZQX(JL,JK,JM) ZQADJ = ZQX(JL,JK,JM)*ZQTMST ztmp_q = ztmp_q + ZQADJ ztmp = ztmp + ZQX(JL,JK,JM) ZQX(JL,JK,JM) = 0.0_JPRB ENDIF ENDDO PSTATE_q_loc(JL,JK) = PSTATE_q_loc(JL,JK) + ztmp_q ZQX(JL,JK,NCLDQV) = ZQX(JL,JK,NCLDQV) + ztmp ENDDO !$ACC END KERNELS ASYNC(IBL) ASYNC removes CUDA-thread syncs
11
Slide 11 OpenACC scaling (K40c, in GFlops/s)
12
Slide 12 Timing (ms) breakdown : single GPU
13
Slide 13 Saturating GPUs with more work !$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) & !$OMP& PRIVATE(tid, idgpu) num_threads(NumGPUs * 4) tid = omp_get_thread_num() ! OpenMP thread number idgpu = mod(tid, NumGPUs) ! Effective GPU# for this thread CALL acc_set_device_num(idgpu, acc_get_device_type()) !$OMP DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA ! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA !$acc data copyout(array(:,:,IBL),...) & ! ~22 : GPU to Host !$acc& copyin(array(:,:,IBL)) ! ~43 : Host to GPU CALL CLOUDSC (... array(1,1,IBL)...) ! Runs on GPU# !$acc end data END DO !$OMP END DO !$OMP END PARALLEL More threads here
14
Slide 14 Saturating GPUs with more work Consider few performance degradation facts at present Parallelism only in NPROMA dimension in CLOUDSC Updating 60-odd arrays back and forth every time step OpenACC overhead related to data transfers & ACC DATA Can we do better ? YES ! We can enable concurrently executed kernels through OpenMP ! Time-sharing GPU(s) across multiple OpenMP-threads About 4 simultaneous OpenMP host–threads can saturate a single GPU in our CLOUDSC case Extra care must be taken to avoid running out of memory on GPU Needs ~ 4X smaller NPROMA : 20,000 instead of 80,000
15
Slide 15 Multiple copies of CLOUDSC per GPU (GFlops/s)
16
Slide 16 nvvp profiler shows time-sharing impact GPU is 4- way time- shared GPU is fed with work by one OpenMP thread only
17
Slide 17 Timing (ms) : 4-way time-shared vs. no T/S GPU is 4- way time- shared GPU is not time-shared
18
Slide 18 24-core Haswell 2.5GHz vs. K40c GPU(s) (GFlops/s) T/S = GPUs time- shared T/S = GPUs time- shared
19
Slide 19 Conclusions CLOUDSC OpenACC prototype from 3Q/2014 was ported to ECMWF’s tiny GPU cluster in 3Q/2015 Since last time PGI compiler has improved and OpenACC overheads have been greatly reduced (PGI 14.7 vs. 15.7) With CUDA 7.0 and concurrent kernels – it seems – time- sharing (oversubscribing) GPUs with more work pays off Saturation of GPUs can be achieved not surprisingly by help of multi-core host launching more data blocks onto GPUs The outcome is not bad considering we seem to be underutilizing the GPUs (parallelism just along NPROMA)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.