1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Chapter Hardwired vs Microprogrammed Control Multithreading
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Porting physical parametrizations to GPUs using compiler directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement des Innern EDI Bundesamt für Meteorologie.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
Enhancing GPU for Scientific Computing Some thoughts.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Advanced / Other Programming Models Sathish Vadhiyar.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU Architecture and Programming
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Single Node Optimization Computational Astrophysics.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
My Coordinates Office EM G.27 contact time:
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Prof. Zhang Gang School of Computer Sci. & Tech.
CS427 Multicore Architecture and Parallel Computing
Lecture 2: Intro to the simd lifestyle and GPU internals
Lecture 5: GPU Compute Architecture
Experience with Maintaining the GPU Enabled Version of COSMO
Lecture 5: GPU Compute Architecture for the last time
NVIDIA Fermi Architecture
Cristiano Padrin (CASPUR)
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement des Innern EDI Bundesamt für Meteorologie und Klimatologie MeteoSchweiz

2 06/09/2011, COSMO GM Xavier Lapillonne Outline Physics with 2d data structure Porting the physical parametrization to GPU using directives Running COSMO on an hybrid GPU-CPU system

3 06/09/2011, COSMO GM Xavier Lapillonne New data structure 2D data fields inside the physics packages with one horizontal and one vertical dimensions: f(nproma,ke), with nproma = ie x je / nblock. Goals: Physics package could be shared with ICON code Blocking strategy: all physics parametrization could be computed while data remains in the cache organize_physics should be structured as follow: call init_radiation call init_turbulence … do ib=1,nblock call copy_to block call organize_radiation … call organize_turbulence call copy_back end do Note : an omp parallelization could be introduced around the block loop where data inside organise_scheme is in block form t_b(nproma,ke) Routines below organize_scheme will be shared with ICON. Fields are passed via argument list: call fesft(t_b(:,:), …

4 06/09/2011, COSMO GM Xavier Lapillonne Current status Base code: COSMO d version of microphysics (hydci_pp), radiation (Ritter-Geleyn), turbulence (turbtran+turbdiff). For the moment microphysics and radiation are in separate block loop. The turbulence scheme is copying 3d fields (i.e turbdiff(t(:,je,:) …) Next steps All 3 parametrizations (microphysics + radiation + turbulence) in a common block loop Performance analysis OMP parallelization (?) Longer term All parametrization required for operational runs should be inside the block loop and in 2 dimensional form

5 06/09/2011, COSMO GM Xavier Lapillonne Outline Physics with 2d data structure Porting the physical parametrization to GPU using directives Running COSMO on an hybrid GPU-CPU system

6 06/09/2011, COSMO GM Xavier Lapillonne Computing on Graphical Processing Units (GPUs) Benefit from the highly parallel architecture of GPUs Higher peak performance at lower cost / power consumption. High memory bandwidth Cores Freq. (GHz) Peak Perf. S.P. (GFLOPs) Peak Perf. D.P. (GFLOPs) Memory Bandwith (GB/sec) Power Cons. (W) CPU: AMD Magny-cours GPU: Fermi M

7 06/09/2011, COSMO GM Xavier Lapillonne Execution model Host (CPU) Kernel Sequential Device (GPU) Data Transfer Copy data from CPU to GPU (CPU and GPU memory are separate) Load specific GPU program (Kernel) Execution: Same kernel is executed by all threads, SIMD parallelism (Single instruction, multiple data) Copy back data from GPU to CPU … … … … Parallel threads

8 06/09/2011, COSMO GM Xavier Lapillonne The directive approach, an example !$acc data region local(a,b) !$acc update device(b) !initialization !$acc region do k=1,nlev do i=1,N a(i,k)=0.0D0 end do end do !$acc end region ! first layer !$acc region do i=1,N a(i,1)=0.1D0 end do !$acc end region ! vertical computation !$acc region do k=2,nlev do i=1,N a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k) end do end do !$acc end region !$acc update host(a) !$acc end data region !$acc data region local(a,b) !$acc update device(b) !initialization !$acc region do kernel do i=1,N do k=1,nlev a(i,k)=0.0D0 end do end do !$acc end region ! first layer !$acc region do i=1,N a(i,1)=0.1D0 end do !$acc end region ! vertical computation !$acc region do kernel do i=1,N do k=2,nlev a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k) end do end do !$acc end region !$acc update host(a) !$acc end data region N=1000, nlev=60: t= 555 μs t= 225 μs note : PGI directives Loop reordering 3 different kernels Array “a” remains on the GPU between the different kernel calls

9 06/09/2011, COSMO GM Xavier Lapillonne Physical parametrizations on GPU using directives Physical parametrizations are tested using standalone code. Currently ported parametrizations: PGI : microphysics (hydci_pp), radiation (fesft), turbulence (only turbdiff yet) OMP – acc (Cray) : microphysics, radiation GPU optimizaiton: loop reordering, replacement of arrays with scalars Note: hydci_pp, fesft and turbdiff subroutines represents respectively 6.7%, 8% and 7.3% of the total execution time of a typical cosmo-2 run. Current version of OMP-acc are a subset of PGI directives and it is possible to write PGI code so that there is almost a one to one translation to omp-acc. First investigation show similar performance between the two compilers, but would need further analysis

10 06/09/2011, COSMO GM Xavier Lapillonne Results, Fermi card using PGI directives Peak performance of a Fermi card for double precision is 515 GFlop/s, i.e. we are getting respectively 5%, 4.5% and 2.5% peak performance for the microphysics, radiation and turbulence schemes Theoretical bandwith is 140 GB/s, but maximum achievable is around 110 GB/s Test domain: n x x n y x n z = 80 x 60 x 60

11 06/09/2011, COSMO GM Xavier Lapillonne Results: Comparison with CPU Parallel CPU code run on 12 cores AMD magny-cours CPU – however there are no mpi-communications in these standalone test codes. Note: Expected performance would be between 3x and 5x and depending whether the problem is compute or memory bandwith bound. Overhead of data transfer for microphysics and turbulence is very large.

12 06/09/2011, COSMO GM Xavier Lapillonne Comments on the observed performance The microphysics has the largest compute intensity (with respect to memory access) and as such is more suited for the GPU. The lower speed up observed for the radiation is quite relative, and essentially comes from the fact that it is very well optimized and is vectorized on the CPU (~9% Peak performance) The turbulence scheme requires more memory access. Next steps Port turbtran subroutine with pgi + additional test and optimizations (october 2011) Further investigation of radiation and turbulence schemes with Cray directives (november 2011) GPU version of microphysics + radiation + turbulence inside COSMO (november-december 2011)

13 06/09/2011, COSMO GM Xavier Lapillonne Outline Physics with 2d data structure Porting the physical parametrization to GPU using directives Running COSMO on an hybrid GPU-CPU system

14 06/09/2011, COSMO GM Xavier Lapillonne Possible future implementations in COSMO DynamicMicrophysicsTurbulenceRadiation Phys. parametrization I/O GPU DynamicMicrophysicsTurbulenceRadiation Phys. parametrization I/O GPU Data movement for each routine “Full GPU” : Data remain on device, only send to CPU for I/O and communication C++ - CUDA Directives

15 06/09/2011, COSMO GM Xavier Lapillonne Running COSMO-2 on Hybrid-system Multicores Processor GPUs One (or more) multicores CPU Domain decomposition One GPU per subdomain.

16 06/09/2011, COSMO GM Xavier Lapillonne Summary Porting of the microphysics, radiation and turbulence scheme on GPU was successfully carried out using a directive based approach Comparing with a 12 cores CPU, a speed up between 2.4x and 6.5x was observed using one Fermi GPU card These results are within the expected values considering hardware properties The large overhead of data transfer shows that the “full GPU” approach (i.e. data remains on the GPU, all computation on the device) is the prefered approach for COSMO

17 06/09/2011, COSMO GM Xavier Lapillonne Additional slides

18 06/09/2011, COSMO GM Xavier Lapillonne Comparison between PGI and OMP-acc !$acc data region local(a) !time loop do itime=1,nt !initialization !$acc region do k=1,nlev do i=1,N a(i,k)=0.0D0 end do end do !$acc end region ! first layer !$acc region do kernel do i=1,N a(i,1)=0.1D0 end do !$acc end region ! vertical computation !$acc region do kernel do i=1,N do k=2,nlev a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k) end do end do !$acc end region end do ! end time loop !$acc update host(a) !$acc end data region !$omp acc_data acc_shared(a) !time loop do itime=1,nt !initialization !$omp acc_region_loop do k=1,nlev do i=1,N a(i,k)=0.0D0 end do end do !$omp end acc_region loop ! first layer !$omp acc_region_loop do i=1,N a(i,1)=0.1D0 end do !$omp end acc_region_loop ! vertical computation !$omp acc_region_loop kernel do i=1,N do k=2,nlev a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k) end do end do !$omp end acc_region_loop end do ! end time loop !$omp acc_update host(a) !$omp end acc_data

19 06/09/2011, COSMO GM Xavier Lapillonne MAIN_ / mo_gscp_dwd_hydci_pp_ _ (x10) User time (approx) secs cycles System to D1 refill 2.434M/sec lines System to D1 bandwidth MB/sec bytes D2 to D1 bandwidth MB/sec bytes L2 to System BW per core MB/sec bytes HW FP Ops / User time M/sec ops 4.5%peak(DP) MAIN_ / src_radiation_fesft_ (x1) User time (approx) secs cycles 100.0%Time System to D1 refill M/sec lines System to D1 bandwidth MB/sec bytes D2 to D1 bandwidth MB/sec bytes L2 to System BW per core MB/sec bytes HW FP Ops / User time M/sec ops 9.3%peak(DP) Craypat infos MAIN_ / turbulence_diff_ref_turbdiff_ (x10) User time (approx) secs cycles 100.0%Time System to D1 refill M/sec lines System to D1 bandwidth MB/sec bytes D2 to D1 bandwidth MB/sec bytes L2 to System BW per core MB/sec bytes HW FP Ops / User time M/sec ops 3.4%peak(DP)

20 06/09/2011, COSMO GM Xavier Lapillonne Palu Results

21 06/09/2011, COSMO GM Xavier Lapillonne Results, microphysics, double precision, Palu

22 06/09/2011, COSMO GM Xavier Lapillonne Results, Radiation, double precision, Palu

23 06/09/2011, COSMO GM Xavier Lapillonne Results, Turbulence, double precision, Palu