Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.

Slides:



Advertisements
Similar presentations
GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Contemporary Languages in Parallel Computing Raymond Hummel.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An Introduction to Programming with CUDA Paul Richmond
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop
GPU Programming with CUDA – Optimisation Mike Griffiths
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Experiences parallelising the mixed C-Fortran Sussix BPM post-processor H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
QCAdesigner – CUDA HPPS project
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
Single Node Optimization Computational Astrophysics.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Martin Kruliš by Martin Kruliš (v1.0)1.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Martin Kruliš by Martin Kruliš (v1.1)1.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
An Update on Accelerating CICE with OpenACC
Gwangsun Kim, Jiyun Jeong, John Kim
Employing compression solutions under openacc
CS427 Multicore Architecture and Parallel Computing
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Lecture 5: GPU Compute Architecture
Experience with Maintaining the GPU Enabled Version of COSMO
Lecture 5: GPU Compute Architecture for the last time
Multithreading Why & How.
Synchronization These notes introduce:
6- General Purpose GPU Programming
Presentation transcript:

Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015

Slide 2 Background  Back in 2014 : Adaptation of IFS physics’ cloud scheme (CLOUDSC) to new architectures as part of ECMWF Scalability programme  Emphasis was on GPU-migration by use of OpenACC directives  CLOUDSC consumes about 10% of IFS Forecast time  Some 3500 lines of Fortran2003 – before OpenACC directives  This presentation concentrates comparing performances on  Haswell – OpenMP version of CLOUDSC  NVIDIA GPU (K40) – OpenACC version of CLOUDSC

Slide 3 Some earlier results  Baseline results down from 40s to 0,24s on K40 GPU  PGI 14.7 & CUDA 5.5 / 6.0 (runs performed ~ 3Q/2014)  Also Cray CCE 8.4 OpenACC-compiler was tried  OpenACC directives inserted automatically  By use of acc_insert Perl script followed by manual cleanup  Source code lines expanded from 3500 to 5000 in CLOUDSC !  The code with OpenACC directives still sustains ca. the same performance as before on Intel Xeon host side  GPUs computational performance was the same or better compared to Intel Haswell (model with 36-cores, 2.3GHz)  Data transfer added serious overheads  Strange DATA PRESENT testing & memory pinning slowdowns

Slide 4 The problem setup for this case study  Given 160,000 grid point columns (NGPTOT)  Each with 137 levels (NLEV)  About 80,000 columns fit into one K40 GPU  Grid point columns are independent of each other  So no horizontal dependencies here, but... ... level dependency prevents parallelization along vertical dim  Arrays are organized in blocks of grid point columns  Instead of using ARRAY(NGPTOT, NLEV)... ... we use ARRAY(NPROMA, NLEV, NBLKS)  NPROMA is a (runtime) fixed blocking factor  Arrays are OpenMP thread safe over NBLKS

Slide 5 Hardware, compiler & NPROMA’s used  Haswell-node : 2.5GHz  2 x NVIDIA K40c GPUs on each Haswell-node via PCIe  Each GPU equipped with 12GB memory – with CUDA 7.0  PGI Compiler 15.7 with OpenMP & OpenACC  –O4 –fast –mp=numa,allcores,bind –Mfprelaxed  –tp haswell –Mvect=simd:256 [ -acc ]  Environment variables  PGI_ACC_NOSHARED=1  PGI_ACC_BUFFERSIZE=4M  Typical good NPROMA value for Haswell ~ 10 – 100  Per GPUs NPROMA up to 80,000 for max performance

Slide 6 Haswell : Driving CLOUDSC with OpenMP REAL(kind=8) :: array(NPROMA, NLEV, NGPBLKS) !$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) !$OMP DO SCHEDULE(DYNAMIC,1) DO JKGLO=1,NGPTOT,NPROMA ! So called NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA CALL CLOUDSC ( 1, ICEND, NPROMA, KLEV, & & array(1,1,IBL), & ! ~ 65 arrays like this ) END DO !$OMP END DO !$OMP END PARALLEL Typical values for NPROMA in OpenMP implementation: 10 – 100 Typical values for NPROMA in OpenMP implementation: 10 – 100

Slide 7 OpenMP scaling (Haswell, in GFlops/s)

Slide 8 Development of OpenACC/GPU-version  The driver-code with OpenMP-loop kept roughly unchanged  GPU to HOST data mapping (ACC DATA) added  Note that OpenACC can (in most cases) co-exist with OpenMP  Allows an elegant multi-GPU implementation  CLOUDSC was pre-processed with ” acc_insert ” – Perl-script  Allowed automatic creation of ACC KERNELS and ACC DATA PRESENT / CREATE clauses to CLOUDSC  In addition some minimal manual source code clean-up  CLOUDSC performance on GPU needs very large NPROMA  Lack of multilevel parallelism (only across NPROMA, not NLEV)

Slide 9 Driving OpenACC CLOUDSC with OpenMP !$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) & !$OMP& PRIVATE(tid, idgpu) num_threads(NumGPUs) tid = omp_get_thread_num() ! OpenMP thread number idgpu = mod(tid, NumGPUs) ! Effective GPU# for this thread CALL acc_set_device_num(idgpu, acc_get_device_type()) !$OMP DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA ! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA !$acc data copyout(array(:,:,IBL),...) & ! ~22 : GPU to Host !$acc& copyin(array(:,:,IBL)) ! ~43 : Host to GPU CALL CLOUDSC (... array(1,1,IBL)...) ! Runs on GPU# !$acc end data END DO !$OMP END DO !$OMP END PARALLEL Typical values for NPROMA in OpenACC implementation: > 10,000 Typical values for NPROMA in OpenACC implementation: > 10,000

Slide 10 Sample OpenACC coding of CLOUDSC !$ACC KERNELS LOOP COLLAPSE(2) PRIVATE(ZTMP_Q,ZTMP) DO JK=1,KLEV DO JL=KIDIA,KFDIA ztmp_q = 0.0_JPRB ztmp = 0.0_JPRB !$ACC LOOP PRIVATE(ZQADJ) REDUCTION(+:ZTMP_Q, +:ZTMP) DO JM=1,NCLV-1 IF (ZQX(JL,JK,JM)<RLMIN) THEN ZLNEG(JL,JK,JM) = ZLNEG(JL,JK,JM)+ZQX(JL,JK,JM) ZQADJ = ZQX(JL,JK,JM)*ZQTMST ztmp_q = ztmp_q + ZQADJ ztmp = ztmp + ZQX(JL,JK,JM) ZQX(JL,JK,JM) = 0.0_JPRB ENDIF ENDDO PSTATE_q_loc(JL,JK) = PSTATE_q_loc(JL,JK) + ztmp_q ZQX(JL,JK,NCLDQV) = ZQX(JL,JK,NCLDQV) + ztmp ENDDO !$ACC END KERNELS ASYNC(IBL) ASYNC removes CUDA-thread syncs

Slide 11 OpenACC scaling (K40c, in GFlops/s)

Slide 12 Timing (ms) breakdown : single GPU

Slide 13 Saturating GPUs with more work !$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) & !$OMP& PRIVATE(tid, idgpu) num_threads(NumGPUs * 4) tid = omp_get_thread_num() ! OpenMP thread number idgpu = mod(tid, NumGPUs) ! Effective GPU# for this thread CALL acc_set_device_num(idgpu, acc_get_device_type()) !$OMP DO SCHEDULE(STATIC) DO JKGLO=1,NGPTOT,NPROMA ! NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA !$acc data copyout(array(:,:,IBL),...) & ! ~22 : GPU to Host !$acc& copyin(array(:,:,IBL)) ! ~43 : Host to GPU CALL CLOUDSC (... array(1,1,IBL)...) ! Runs on GPU# !$acc end data END DO !$OMP END DO !$OMP END PARALLEL More threads here

Slide 14 Saturating GPUs with more work  Consider few performance degradation facts at present  Parallelism only in NPROMA dimension in CLOUDSC  Updating 60-odd arrays back and forth every time step  OpenACC overhead related to data transfers & ACC DATA  Can we do better ? YES !  We can enable concurrently executed kernels through OpenMP ! Time-sharing GPU(s) across multiple OpenMP-threads  About 4 simultaneous OpenMP host–threads can saturate a single GPU in our CLOUDSC case  Extra care must be taken to avoid running out of memory on GPU Needs ~ 4X smaller NPROMA : 20,000 instead of 80,000

Slide 15 Multiple copies of CLOUDSC per GPU (GFlops/s)

Slide 16 nvvp profiler shows time-sharing impact GPU is 4- way time- shared GPU is fed with work by one OpenMP thread only

Slide 17 Timing (ms) : 4-way time-shared vs. no T/S GPU is 4- way time- shared GPU is not time-shared

Slide core Haswell 2.5GHz vs. K40c GPU(s) (GFlops/s) T/S = GPUs time- shared T/S = GPUs time- shared

Slide 19 Conclusions  CLOUDSC OpenACC prototype from 3Q/2014 was ported to ECMWF’s tiny GPU cluster in 3Q/2015  Since last time PGI compiler has improved and OpenACC overheads have been greatly reduced (PGI 14.7 vs. 15.7)  With CUDA 7.0 and concurrent kernels – it seems – time- sharing (oversubscribing) GPUs with more work pays off  Saturation of GPUs can be achieved not surprisingly by help of multi-core host launching more data blocks onto GPUs  The outcome is not bad considering we seem to be underutilizing the GPUs (parallelism just along NPROMA)