Is RRTMGP suited for GPU?

Slides:



Advertisements
Similar presentations
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Advertisements

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
Chapter Hardwired vs Microprogrammed Control Multithreading
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
MATLAB Tips for Simulation Parinya Sanguansat. Version Selection  X86  Faster  Smaller  X64  Slower  Bigger.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
GPU Architecture and Programming
1 Serial Run-time Error Detection and the Fortran Standard Glenn Luecke Professor of Mathematics, and Director, High Performance Computing Group Iowa State.
Experiences parallelising the mixed C-Fortran Sussix BPM post-processor H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
Operating Systems: Internals and Design Principles
Co-Design Update 12/19/14 Brent Pickering Virginia Tech AOE Department.
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
As a general rule you should be using multiple languages these days (except for Java)
Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.
PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
Chapter 3 Data Representation
These slides are based on the book:
Graphics Processor Graphics Processing Unit
Processes and threads.
Component 1.6.
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
A Closer Look at Instruction Set Architectures
Naming and Binding A computer system is merely a bunch of resources that are glued together with names Thus, much of system design is merely making choices.
Representation, Syntax, Paradigms, Types
PT Evaluation of the Dycore Parallel Phase (EDP2)
Chapter 4: Multithreaded Programming
OOP What is problem? Solution? OOP
Basics Of X86 Architecture
William Stallings Computer Organization and Architecture 8th Edition
Prof. Zhang Gang School of Computer Sci. & Tech.
Parallel Processing and GPUs
Experience with Maintaining the GPU Enabled Version of COSMO
CMSC 341 Prof. Michael Neary
Chapter 4: Threads.
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Simulation of computer system
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Memory Management Tasks
Representation, Syntax, Paradigms, Types
Storage Structure and Efficient File Access
UNIT IV RAID.
BIC 10503: COMPUTER ARCHITECTURE
Parameter Passing Actual vs formal parameters
Algorithm Correctness
Chapter 1 Introduction.
Representation, Syntax, Paradigms, Types
File Storage and Indexing
What is Computer Architecture?
Chapter 4: Threads & Concurrency
COMS 361 Computer Organization
Representation, Syntax, Paradigms, Types
What is Computer Architecture?
CS510 Operating System Foundations
CPU Structure CPU must:
Lecture 4: Instruction Set Design/Pipelining
Chapter 4:Parallel Programming in CUDA C
WJEC GCSE Computer Science
Chapter 13: I/O Systems.
Memory Principles.
Prebared by Omid Mustefa.
Seminar on Enterprise Software
Presentation transcript:

Is RRTMGP suited for GPU?

Expectations Embarrassing parallel Memory intensive computations Columns can be split up and computed in parallel Memory intensive computations Memory is faster on GPU than CPU Answer to the title: Yes

Context

Speed-up GPU vs CPU The expected speed up will be defined much by the memory performance. GDDR is faster than DDR. This advantage is expected to further grow in future GPU generations. Memory bandwidth speed is approximately 1.5 - 2 times faster on GPU. Drawback: GDDR is typically smaller than DDR

Computations in RRTMGP Multiple components to parallelize Gas optics, flux solver, etc. Multiple sub-components, each with its own logic Computations are relatively lightweight Terms and factors from multiple sources, often arrays, are combined using basic arithmetic. Static data can be parked on GPU memory e.g. k-coefficients

Scale Dimensions Approx. 40.000+ Columns 100 Layers 250 Pseudo-spectral 10 other

Memory Access Patterns Memory access is mostly sequential. There are local interpolations that interfere with a perfect sequential memory access. These disruptions are at a local scale only. Indexing on arrays change for components. Gas optics (pseudo-spectral, layer, column) Flux solver (column, layer, pseudo-spectral)

Lessons learned

Overview Compilers struggle with newer FORTRAN, OpenACC, and libraries. FORTRAN 2003 NetCDF library for I/O For OpenACC we tested: PGI and Cray Without OpenACC we tested: Intel, PGI, GNU, Cray, and NAG

Success: Cray and OpenACC We got gas optics to work to the extend that it compiled and computed the correct answers to 15 digit precision on GPU. With $ACC PARALLEL; $ACC KERNEL crashes Error messages could be better Issues Member variables and OpenACC are not workable Function calls within parallel regions are not supported by compiler Optional arguments and OpenACC are not workable Defining dynamic dimensions of variables in member functions

PGI and NetCDF Failure: PGI and NetCDF do not play nice ERROR: Segmentation fault pgi/15.3 netcdf/4.3.3.1 on Janus @ rc.colorado.edu This prevented us from testing OpenACC and PGI. The PGI compiler is one of the prime choices for OpenACC. Q: What is the standard NetCDF library for Python? netCDF4, scipy.io.netcdf, or Scientific.IO.NetCDF

Intel Does not support OpenACC for practical purpose. A few hick-ups with FORTAN 2003 standard, but overall “thumbs up”. Side note: The compiler is sometimes too lenient in the syntax it accepts intel/15.0.2 netcdf/4.3.3.1

GNU Does not support OpenACC for practical purpose. A few hick-ups with FORTAN 2003 standard, but overall “thumbs up”. Does not support some FORTAN 2003 implicit memory allocations Expected to be slower than other compilers gnu/4.9.2 netcdf/4.3.3.1

Extra slides

Parallelism in RRTMGP Columns Layers Pseudo-spectral (gpts) other

Strategies for OpenACC Parallelism Solver Gas Optics

OpenACC – example gas optics

Future Outlook C++ implementation, Hackathon, etc. http://www.openacc.org/content/openacc-hackathon-tu-dresdenforschungzentrum-julich