Is RRTMGP suited for GPU?

Slides:

Advertisements

Similar presentations

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Advertisements

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.

Chapter Hardwired vs Microprogrammed Control Multithreading

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

MATLAB Tips for Simulation Parinya Sanguansat. Version Selection  X86  Faster  Smaller  X64  Slower  Bigger.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

GPU Architecture and Programming

1 Serial Run-time Error Detection and the Fortran Standard Glenn Luecke Professor of Mathematics, and Director, High Performance Computing Group Iowa State.

Experiences parallelising the mixed C-Fortran Sussix BPM post-processor H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note

1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.

Operating Systems: Internals and Design Principles

Co-Design Update 12/19/14 Brent Pickering Virginia Tech AOE Department.

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

As a general rule you should be using multiple languages these days (except for Java)

Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.

PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.

I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.

Chapter 3 Data Representation

These slides are based on the book:

Graphics Processor Graphics Processing Unit

Processes and threads.

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

A Closer Look at Instruction Set Architectures

Naming and Binding A computer system is merely a bunch of resources that are glued together with names Thus, much of system design is merely making choices.

Representation, Syntax, Paradigms, Types

PT Evaluation of the Dycore Parallel Phase (EDP2)

Chapter 4: Multithreaded Programming

OOP What is problem? Solution? OOP

Basics Of X86 Architecture

William Stallings Computer Organization and Architecture 8th Edition

Prof. Zhang Gang School of Computer Sci. & Tech.

Parallel Processing and GPUs

Experience with Maintaining the GPU Enabled Version of COSMO

CMSC 341 Prof. Michael Neary

Chapter 4: Threads.

UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department

Simulation of computer system

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Memory Management Tasks

Representation, Syntax, Paradigms, Types

Storage Structure and Efficient File Access

BIC 10503: COMPUTER ARCHITECTURE

Parameter Passing Actual vs formal parameters

Algorithm Correctness

Chapter 1 Introduction.

Representation, Syntax, Paradigms, Types

File Storage and Indexing

What is Computer Architecture?

Chapter 4: Threads & Concurrency

COMS 361 Computer Organization

Representation, Syntax, Paradigms, Types

What is Computer Architecture?

CS510 Operating System Foundations

CPU Structure CPU must:

Lecture 4: Instruction Set Design/Pipelining

Chapter 4:Parallel Programming in CUDA C

WJEC GCSE Computer Science

Chapter 13: I/O Systems.

Memory Principles.

Prebared by Omid Mustefa.

Seminar on Enterprise Software

Presentation transcript:

Is RRTMGP suited for GPU?

Expectations Embarrassing parallel Memory intensive computations Columns can be split up and computed in parallel Memory intensive computations Memory is faster on GPU than CPU Answer to the title: Yes

Context

Speed-up GPU vs CPU The expected speed up will be defined much by the memory performance. GDDR is faster than DDR. This advantage is expected to further grow in future GPU generations. Memory bandwidth speed is approximately 1.5 - 2 times faster on GPU. Drawback: GDDR is typically smaller than DDR

Computations in RRTMGP Multiple components to parallelize Gas optics, flux solver, etc. Multiple sub-components, each with its own logic Computations are relatively lightweight Terms and factors from multiple sources, often arrays, are combined using basic arithmetic. Static data can be parked on GPU memory e.g. k-coefficients

Scale Dimensions Approx. 40.000+ Columns 100 Layers 250 Pseudo-spectral 10 other

Memory Access Patterns Memory access is mostly sequential. There are local interpolations that interfere with a perfect sequential memory access. These disruptions are at a local scale only. Indexing on arrays change for components. Gas optics (pseudo-spectral, layer, column) Flux solver (column, layer, pseudo-spectral)

Lessons learned

Overview Compilers struggle with newer FORTRAN, OpenACC, and libraries. FORTRAN 2003 NetCDF library for I/O For OpenACC we tested: PGI and Cray Without OpenACC we tested: Intel, PGI, GNU, Cray, and NAG

Success: Cray and OpenACC We got gas optics to work to the extend that it compiled and computed the correct answers to 15 digit precision on GPU. With $ACC PARALLEL; $ACC KERNEL crashes Error messages could be better Issues Member variables and OpenACC are not workable Function calls within parallel regions are not supported by compiler Optional arguments and OpenACC are not workable Defining dynamic dimensions of variables in member functions

PGI and NetCDF Failure: PGI and NetCDF do not play nice ERROR: Segmentation fault pgi/15.3 netcdf/4.3.3.1 on Janus @ rc.colorado.edu This prevented us from testing OpenACC and PGI. The PGI compiler is one of the prime choices for OpenACC. Q: What is the standard NetCDF library for Python? netCDF4, scipy.io.netcdf, or Scientific.IO.NetCDF

Intel Does not support OpenACC for practical purpose. A few hick-ups with FORTAN 2003 standard, but overall “thumbs up”. Side note: The compiler is sometimes too lenient in the syntax it accepts intel/15.0.2 netcdf/4.3.3.1

GNU Does not support OpenACC for practical purpose. A few hick-ups with FORTAN 2003 standard, but overall “thumbs up”. Does not support some FORTAN 2003 implicit memory allocations Expected to be slower than other compilers gnu/4.9.2 netcdf/4.3.3.1

Extra slides

Parallelism in RRTMGP Columns Layers Pseudo-spectral (gpts) other

Strategies for OpenACC Parallelism Solver Gas Optics

OpenACC – example gas optics

Future Outlook C++ implementation, Hackathon, etc. http://www.openacc.org/content/openacc-hackathon-tu-dresdenforschungzentrum-julich