Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.

Slides:



Advertisements
Similar presentations
GPU Programming using BU Shared Computing Cluster
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Porting physical parametrizations to GPUs using compiler directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement des Innern EDI Bundesamt für Meteorologie.
High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Advanced / Other Programming Models Sathish Vadhiyar.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
PP POMPA (WG6) Overview Talk COSMO GM11, Rome st Birthday.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU Architecture and Programming
CUDA - 2.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Single Node Optimization Computational Astrophysics.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Application of Emerging Computational Architectures (GPU, MIC) to Atmospheric Modeling Tom Henderson NOAA Global Systems Division
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Prof. Zhang Gang School of Computer Sci. & Tech.
GPU Architecture and Its Application
CS427 Multicore Architecture and Parallel Computing
Linchuan Chen, Xin Huo and Gagan Agrawal
NVIDIA Fermi Architecture
Cristiano Padrin (CASPUR)
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches Departement des Innern EDI Bundesamt für Meteorologie und Klimatologie MeteoSchweiz

2 08/09/2011, COSMO GM X. Lapillonne Outline Computing with GPUs COSMO parametrizations on GPUs using directives Running COSMO on an hybrid system Summary

3 08/09/2011, COSMO GM X. Lapillonne Computing on Graphical Processing Units (GPUs) Benefit from the highly parallel architecture of GPUs Higher peak performance at lower cost / power consumption. High memory bandwidth Cores Freq. (GHz) Peak Perf. S.P. (GFLOPs) Peak Perf. D.P. (GFLOPs) Memory Bandwith (GB/sec) Power Cons. (W) CPU: AMD Magny-cours GPU: Fermi M X 5X 3

4 08/09/2011, COSMO GM X. Lapillonne Execution model Host (CPU) Kernel Sequential Device (GPU) Data Transfer Copy data from CPU to GPU (CPU and GPU memory are separate) Load specific GPU program (Kernel) Execution: Same kernel is executed by all threads, SIMD parallelism (Single instruction, multiple data) Copy back data from GPU to CPU … … … … Parallel threads

5 08/09/2011, COSMO GM X. Lapillonne Outline Computing with GPUs COSMO parametrizations on GPUs using directives Running COSMO on an hybrid system Summary

6 08/09/2011, COSMO GM X. Lapillonne The directive approach, an example !$acc data region local(a,b) !$acc update device(b) !initialization !$acc region do k=1,nlev do i=1,N a(i,k)=0.0D0 end do end do !$acc end region ! first layer !$acc region do i=1,N a(i,1)=0.1D0 end do !$acc end region ! vertical computation !$acc region do k=2,nlev do i=1,N a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k) end do end do !$acc end region !$acc update host(a) !$acc end data region !$acc data region local(a,b) !$acc update device(b) !initialization !$acc region do kernel do i=1,N do k=1,nlev a(i,k)=0.0D0 end do end do !$acc end region ! first layer !$acc region do i=1,N a(i,1)=0.1D0 end do !$acc end region ! vertical computation !$acc region do kernel do i=1,N do k=2,nlev a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k) end do end do !$acc end region !$acc update host(a) !$acc end data region N=1000, nlev=60: t= 555 μs t= 225 μs note : PGI directives Loop reordering 3 different kernels Array “a” remains on the GPU between the different kernel calls

7 08/09/2011, COSMO GM X. Lapillonne Cosmo physical parametrizations with directives Note: Directives are tested in standalone version of various parametrizations OMP-acc : discussed within the OMP committee, currently only supported by a test version of the Cray compiler. + : possible future standard currently ported: microphysics (hydci_pp), radiation (fesft) PGI directives: developed by PGI compiler, quite advance. + : most mature, OMP-acc are a subset of PGI directives - : vendor specific currently ported: microphysics (hydci_pp), radiation (fesft), turbulence (turbdiff) F2C-ACC: developed by NOAA + : freely available, generates CUDA code (possibility to further optimize and debug at this stage) - : on going project currently ported: microphysics (hydci_pp)

8 08/09/2011, COSMO GM X. Lapillonne Cosmo physical parametrizations with directives Specific GPU optimizations have been introduced: loop reordering (if necessary) replacement of arrays with scalars For a typical COSMO-2 simulation on a CPU cluster, the subroutines hydci_pp, fesft and turbdiff represent respectively 6.7%, 8% and 7.3% of the total execution time.

9 08/09/2011, COSMO GM X. Lapillonne Performance results: CPU/GPU comparison CPU : AMD 12 cores “Magny-Cours”. MPI-parallel code, note: there are no mpi-communication as the parametrization are column independent GPU: Fermi card (M2050) :00z+3h Codes are tested using one subdomain of size n x x n y x n z = 80 x 60 x 60

10 08/09/2011, COSMO GM X. Lapillonne Results: Comparison with CPU Speed-up between 2.4x and 6.5x Note: Expected performance would be between 3x and 5x and depending whether the problem is compute or memory bandwith bound. Overhead of data transfer for microphysics and turbulence is very large.

11 08/09/2011, COSMO GM X. Lapillonne Comparison PGI, F2-ACC First investigations show that F2-ACC implementation is 1.2 time faster Showing execution + data transfer time for the microphysics run on same GPU

12 08/09/2011, COSMO GM X. Lapillonne Outline Computing with GPUs COSMO parametrizations on GPUs using directives Running COSMO on an hybrid system Summary

13 08/09/2011, COSMO GM X. Lapillonne Possible future implementations in COSMO DynamicMicrophysicsTurbulenceRadiation Phys. parametrization I/O GPU DynamicMicrophysicsTurbulenceRadiation Phys. parametrization I/O GPU Data movement for each routine Data remain on device, only send to CPU for I/O and communication C++, CUDA Directives

14 08/09/2011, COSMO GM X. Lapillonne Running COSMO-2 on Hybrid-system Multicores Processor GPUs One (or more) multicores CPU Domain decomposition One GPU per subdomain.

15 08/09/2011, COSMO GM X. Lapillonne Outline Computing with GPUs COSMO parametrizations on GPUs using directives Running COSMO on an hybrid system Summary

16 08/09/2011, COSMO GM X. Lapillonne Summary In the frame of POMPA investigations are carried out to port the COSMO code to GPU architecture Different directive approaches are considered to port the physical parametrizations on such architecture: PGI, OMP-acc and F2-ACC. Comparing with a high-end 12 cores CPU, a speed up between 2.4x and 6.5x was observed using one Fermi GPU card with PGI These results are within the expected values considering hardware properties The large overhead of data transfer shows that the full GPU approach (i.e. data remains on the GPU, all computation on the device) is the prefered approach for COSMO First investigations on the microphysics show a speed up of 1.2x with respect to PGI implementation on the GPU

17 08/09/2011, COSMO GM X. Lapillonne Additional slides

18 08/09/2011, COSMO GM X. Lapillonne Results, Fermi card using PGI directives Peak performance of a Fermi card for double precision is 515 GFlop/s, i.e. we are getting respectively 5%, 4.5% and 2.5% peak performance for the microphysics, radiation and turbulence schemes Test domain: n x x n y x n z = 80 x 60 x 60

19 08/09/2011, COSMO GM X. Lapillonne Scaling with system size N=nx x ny nx x ny =100 x 100 not enough parallelism

20 08/09/2011, COSMO GM X. Lapillonne !$omp acc_data acc_shared(a,b,c) !$omp acc_update acc(a,b) !$omp acc_region do k = 1,n1 do i = 1,n3 c(i,k) = 0.0 do j = 1,n2 c(i,k) = c(i,k) + a(i,j) * b(j,k) enddo enddo enddo !$omp end acc_region !$omp acc_update host(c) !$omp end acc_data The directive approach Example with OMP-acc: Generates kernel at loop level

21 08/09/2011, COSMO GM X. Lapillonne Computing on Graphical Processing Units (GPUs) To be efficient the code needs to take advantage of fine grain parallelism so as to execute 1000s of threads in parallel. GPU code: Programming level:  OpenCL, CUDA, CUDA Fortran (PGI) …  Best performance, but require complete rewrite Directive based approach:  OpenMP-acc, PGI, HMPP, F2-ACC  Smaller modifications to original code  The resulting code is still understandable by Fortran programmers and can be easily modified  Possible performance sacrifice with respect to CUDA code  No standard for the moment (but work within the OMP commitee) Data transfer time between host and GPU may strongly reduce the overall performance