The Finite-Volume Dynamical Core on GPUs within GEOS-5 William Putman Global Modeling and Assimilation Office NASA GSFC 9/8/11 Programming weather, climate,

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Introduction to the CUDA Platform
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
OpenFOAM on a GPU-based Heterogeneous Cluster
GPU Computational Screening of Carbon Capture Materials J Kim 1, A Koniges 1, R Martin 1, M Haranczyk 1, J Swisher 2 and B Smit 1,2 1 Berkeley Lab (USA),
Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Porting a 2D Finite Volume Advection Code to the Intel MIC Kaushik Datta – NASA GSFC (SSSO) / Northrop Grumman Hamid Oloso – NASA GSFC (SSSO) / AMT Tom.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Supporting GPU Sharing in Cloud Environments with a Transparent
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
GPU Programming with CUDA – Optimisation Mike Griffiths
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
Accelerating MATLAB with CUDA
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
A Portable Regional Weather and Climate Downscaling System Using GEOS-5, LIS-6, WRF, and the NASA Workflow Tool Eric M. Kemp 1,2 and W. M. Putman 1, J.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Progress Toward Accelerating CAM-SE. Jeff Larkin Along with: Rick Archibald, Ilene Carpenter, Kate Evans, Paulius Micikevicius, Jim Rosinski, Jim Schwarzmeier,
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Development of a GPU based PIC
PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
NFV Compute Acceleration APIs and Evaluation
HPC Roadshow Overview of HPC systems and software available within the LinkSCEEM project.
Employing compression solutions under openacc
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
6- General Purpose GPU Programming
Presentation transcript:

The Finite-Volume Dynamical Core on GPUs within GEOS-5 William Putman Global Modeling and Assimilation Office NASA GSFC 9/8/11 Programming weather, climate, and earth-system models on heterogeneous multi-core platforms- Boulder, CO

Outline Motivation Motivation Test advection kernel Test advection kernel Approach in GEOS-5 Approach in GEOS-5 Design for FV development Design for FV development Early results Early results Status/future Status/future Motivation Motivation Test advection kernel Test advection kernel Approach in GEOS-5 Approach in GEOS-5 Design for FV development Design for FV development Early results Early results Status/future Status/future Development Platform Development Platform NASA Center for Climate Simulation GPU Cluster 32 Compute Nodes 2 Hex-core 2.8 GHz Intel Xeon Westmere Processors 48 GB of memory per node 2 NVidia M2070 GPUs dedicated x16 PCIe Gen2 connection Infiniband QDR Interconnect 64 Graphical Processing Units 1 Tesla GPU (M2070) 448 CUDA cores ECC Memory 6 GB of GDDR5 memory 515 Gflop/s of double precision floating point performance (peak) 1.03 Tflop/s of single precision floating point performance (peak) 148 GB/sec memory bandwidth 1 PCIe x16 Gen2 system interface

We are pushing the resolution of global models into the 10- to 1-km range GEOS-5 can fit a 5-day forecast at 10-km within the 3-hour window required for operations using 12,000 Intel Westmere cores At current cloud-permitting resolutions (10- to 3-km) required scaling of 300,000 cores is reasonable (though not readily available) To get to global cloud resolving (1-km or finer) requires order 10-million cores Weak scaling of cloud-permitting GEOS-5 model indicates need for accelerators ~90% of those computations are in the dynamics Motivation Global Cloud Resolving GEOS-6 Motivation Global Cloud Resolving GEOS-6 PDF of Average Convective Cluster Brightness Temperature 3.5-km GEOS-5 Simulated Clouds

The ultimate target: the FV dynamical core – accounts for ~ 90% of the compute cycles at high-resolution (1- to 10-km) The D-grid Shallow water routines are as costly as the non- hydrostatic dynamics (thus first pieces to attack) An offline Cuda C demonstration kernel was developed for the 2-D advection scheme Data Transfers from Host to the Device cost about 10-15% Fermi GPGPU 16x 32-core Streaming Multiprocessors Fermi GPGPU 16x 32-core Streaming Multiprocessors For a 512x512 domain, the benchmark revealed up to 80x speedup Caveats: Written entirely on the GPU (no data transfers) Single CPU to Single GPU speedup compares Cuda C to C code Motivation Idealized FV advection kernel Motivation Idealized FV advection kernel CUDA Profiler – Used to profile

Fermi GPGPU 16x 32-core Streaming Multiprocessors Fermi GPGPU 16x 32-core Streaming Multiprocessors The Finite-Volume kernel performs 2-dimensional advection on a 256x256 mesh Blocks on the GPU are used to decompose the mesh in a similar fashion to MPI domain decomposition Optimal distribution of blocks improve occupancy on the GPU Targeting 100% Occupancy and threads in multiples of the Warp size (32) Best performance with 16, 32 or 64 threads in the Y-direction Fermi – Compute 2.0 CUDA device: [Tesla M2050] Occupancy - the amount of shared memory and registers used by each thread block, or the ratio of active warps to the maximum number of warps available Warp – A collection of 32 threads CUDA Profiler – Used to profile and compute occupancy Motivation Idealized FV advection kernel - tuning Motivation Idealized FV advection kernel - tuning Total Number of Threads

Approach GEOS-5 Modeling Framework and the FV 3 dycore Approach GEOS-5 Modeling Framework and the FV 3 dycore Earth System Modeling Framework (ESMF) GEOS-5 uses a fine-grain component design with light-weight ESMF components used down to the parameterization level A hierarchical topology is used to create Composite Components, defining coupling (relations) between parents and children components As a result, implementation of GEOS-5 residing entirely on GPUs is unrealistic, we must have data exchanges to the CPU for ESMF component connections Flexible Modeling System (FMS) Component based modeling framework developed and implemented at GFDL The MPP layer provides a uniform interface to different message-passing libraries, used for all MPI communication in FV The GPU implementation of FV will extend out to this layer and exchange data for halo updates between GPU and CPU fv_dynamics dyn_core Halo Updates do 1,npz c_sw geopk NH column based Halo Updates do 1,npz d_sw geopk NH column based Halo Updates Tracer advection Vertical remapping PGI Cuda Fortran – CPU and GPU code co-exist in the same code-base (#ifdef _CUDA)

1.8x - 1.3x Speedup Approach Single Precision FV cubed Approach Single Precision FV cubed FV was converted to single precision prior to beginning GPU development

Approach Domain Decomposition (MPI and GPU) Approach Domain Decomposition (MPI and GPU) MPI Decomposition – 2D in X,Y GPU blocks distributed in X,Y within the decomposed domain

Bottom-up development Target kernels for 1D and 2D advection will be developed at the lowest level of FV (tp_core module) fxppm/fyppm xtp/ytp fv_tp_2d The advection kernels are reused throughout the c_sw and d_sw routines (the Shallow Water equations) delp/pt/vort advection At the dyn_core layer halo regions will be exchanged between the host and the device The device data is centrally located and maintained at a high level (fv_arrays) to maintain object oriented approach (and we can pin this memory as needed) Test-driven development Offline test modules have been created to develop GPU kernels for tp_core Easily used to validate results with the CPU code Improve development time by avoiding costly rebuilds of full GEOS-5 code-base Approach GEOS-5 Modeling Framework and the FV dycore Approach GEOS-5 Modeling Framework and the FV dycore

π 1D flux-form operators Directionally split Cross-stream inner-operators The value at the edge is an average of two one-sided 2nd order extrapolations across edge discontinuities Positivity for tracers Fitting by Cubic Polynomial to find the value on the other edge of the cell - vanishing 2nd derivative - local mean = cell mean of left/right cells ORD=7 details ( 4th order and continuous before monotonicity )… Sub-Grid PPM Distribution Schemes Details of the Implementation The FV advection scheme (PPM) Details of the Implementation The FV advection scheme (PPM)

Details of the Implementation Serial offline test kernel for 2D advection (fv_tp_2d with PGI Cuda Fortran) Details of the Implementation Serial offline test kernel for 2D advection (fv_tp_2d with PGI Cuda Fortran) GPU Code istat = cudaMemcpy(q_device, q, NX*NY) call copy_corners_dev >>() call xtp_dev >>() call intermediateQj_dev >>() call ytp_dev >>() call copy_corners_dev >>() call ytp_dev >>() call intermediateQi_dev >>() call xtp_dev >>() call yflux_average_dev >>() call xflux_average_dev >>() istat = cudaMemcpy(fy, fy_device, NX*NY) istat = cudaMemcpy(fx, fx_device, NX*NY) ! Compare fy/fx bit-wise reproducible to CPU code

GPU Code istat = cudaMemcpyAsync(qj_device, q, NX*NY, stream(2)) istat = cudaMemcpyAsync(qi_device, q, NX*NY, stream(1)) call copy_corners_dev >>() call xtp_dev >>() call intermediateQj_dev >>() call ytp_dev >>() call copy_corners_dev >>() call ytp_dev >>() call intermediateQi_dev >>() call xtp_dev >>() call yflux_average_dev >>() call xflux_average_dev >>() istat = cudaMemcpyAsync(fy, fy_device, NX*NY, stream(2)) istat = cudaMemcpyAsync(fx, fx_device, NX*NY, stream(1)) Data is copied back to the host for export, but the GPU work can continue… Details of the Implementation Serial offline test kernel for 2D advection (fv_tp_2d with PGI Cuda Fortran) Details of the Implementation Serial offline test kernel for 2D advection (fv_tp_2d with PGI Cuda Fortran)

GPU Code call getCourantNumbersY(…stream(2)) call getCourantNumbersX(…stream(1)) call fv_tp_2d(delp…) call update_delp(delp,fx,fy,…) call update_KE_Y(…stream(2)) call update_KE_X(…stream(1)) call divergence_damping() call compute_vorticity() call fv_tp_2d(vort…) call update_uv(u,v,fx,fy,…) istat = cudaStreamSynchronize(stream(2)) istat = cudaStreamSynchronize(stream(1)) istat = cudaMemcpy(delp, delp_dev, NX*NY) istat = cudaMemcpy( u, u_dev, NX*(NY+1)) istat = cudaMemcpy( v, v_dev, (NX+1)*NY) CPU Time 6 coresD_SW coresD_SW GPU Time 6 GPUsD_SW GPUsD_SW Speedup 6 GPUs: 6 cores 16.2x 6 GPUs : 36 cores 4.6x 36 GPUs : 36 cores 10.2x Times for a 1-day 28-km Shallow Water Test Case Details of the Implementation D_SW – Asynchronous multi-stream Details of the Implementation D_SW – Asynchronous multi-stream

Status - Summary Most of D_SW is implemented on GPU Preliminary results are being generated (but need to be studied more) C_SW routine is similar to D_SW, but has not been touched yet Data transfers between host and device are done asynchronously when possible Most data transfers will move up to the dyn_core level as implementation progresses, improving performance Higher-level operations in dyn_core will be tested with pragmas (Kerr - GFDL) Non-hydrostatic core must be tackled (column based) Strong scaling potential?