Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer Swiss National Supercomputing Centre Ulrich Schättler Deutscher.

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.

Introductions to Parallel Programming Using OpenMP

Thoughts on Shared Caches Jeff Odom University of Maryland.

1 National Institute of Meteorology Brazil - INMET The NWP System at INMET COSMO Model Gilberto Ricardo Bonatti Antônio Cardoso Neto REUNIÓN DE EXPERTOS.

Beowulf Supercomputer System Lee, Jung won CS843.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Introduction CS 524 – High-Performance Computing.

Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.

Parallel Computing Overview CS 524 – High-Performance Computing.

NETL 2014 Workshop on Multiphase Flow Science August 5-6, 2014, Morgantown, WV Accelerating MFIX-DEM code on the Intel Xeon Phi Dr. Handan Liu Dr. Danesh.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Early experiences on COSMO hybrid parallelization at CASPUR/USAM Stefano Zampini Ph.D. CASPUR COSMO-POMPA Kick-off Meeting, 3-4 May 2011, Manno (CH)

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.

Are their more appropriate domain-specific performance metrics for science and engineering HPC applications available then the canonical “percent of peak”

1 CCOS Seasonal Modeling: The Computing Environment S.Tonse, N.J.Brown & R. Harley Lawrence Berkeley National Laboratory University Of California at Berkeley.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.

Parallelization: Area Under a Curve. AUC: An important task in science Neuroscience – Endocrine levels in the body over time Economics – Discounting:

Independent Component Analysis (ICA) A parallel approach.

Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.

Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss News from COSMO COSMO User Workshop 2010.

Preliminary Results of Global Climate Simulations With a High- Resolution Atmospheric Model P. B. Duffy, B. Govindasamy, J. Milovich, K. Taylor, S. Thompson,

Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,

Status of COSMO Software and Documentation Ulrich Schättler Deutscher Wetterdienst COSMO Annual Meeting 2003.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Status of the Support Activities WG 6: Reference Version and Implementation WG Coordinator: Ulrich Schättler.

1 Monday, 26 October 2015 © Crown copyright Met Office Computing Update Paul Selwood, Met Office.

ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.

Climate-Weather modeling studies Using a Prototype Global Cloud-System Resolving Model Zhi Liang (GFDL/DRC)

Petascale –LLNL Appro AMD: 9K processors [today] –TJ Watson Blue Gene/L: 40K processors [today] –NY Blue Gene/L: 32K processors –ORNL Cray XT3/4 : 44K.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.

High performance parallel computing of climate models towards the Earth Simulator --- computing science activities at CRIEPI --- Yoshikatsu Yoshida and.

Overview on the Work of WG 6 Reference Version and Implementation WG Coordinator: Ulrich Schättler.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Benchmarks of a Weather Forecasting Research Model Daniel B. Weber, Ph.D. Research Scientist CAPS/University of Oklahoma ****CONFIDENTIAL**** August 3,

COSMO – 09/2007 STC Report and Presentation by Cosmo Partners DWD, MCH, USAM / ARPA SIM, HNMS, IMGW, NMA, HMC.

Status of the COSMO-Software and Documentation WG 6: Reference Version and Implementation WG Coordinator: Ulrich Schättler.

CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.

Full and Para Virtualization

Single Node Optimization Computational Astrophysics.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Slide 1 NEMOVAR-LEFE Workshop 22/ Slide 1 Current status of NEMOVAR Kristian Mogensen.

Vincent N. Sakwa RSMC, Nairobi

Introduction to HPC Debugging with Allinea DDT Nick Forrington

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

Manchester Computing Supercomputing, Visualization & eScience Zoe Chaplin 11 September 2003 CAS2K3 Comparison of the Unified Model Version 5.3 on Various.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

1 The Interactions of Aerosols, Clouds, and Radiation on the Regional Scale.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

4D-VAR Optimization Efficiency Tuning

John Levesque Director Cray Supercomputing Center of Excellence

PT Evaluation of the Dycore Parallel Phase (EDP2)

Overview of the COSMO NWP model

Experience with Maintaining the GPU Enabled Version of COSMO

External Projects related to WG 6

CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph

T511-L60 CY26R MPI tasks and 4 OpenMP threads

Hybrid Programming with OpenMP and MPI

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Parallel Computing Explained How to Parallelize a Code

Presentation transcript:

Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer Swiss National Supercomputing Centre Ulrich Schättler Deutscher Wetterdienst

NWP/Climate challenge The challenges of making climate science applications scale Designed for different architectures (e.g. vector) Flat profiles (typical of environmental apps) Running multiple 100 year simulations (ensembles) Advection/reaction of O(100) of chemical species Increasing resolution (cloud,eddy-resolving) Communication between different models (with different grids and times scales) I/O requirements Etc

What is COSMO-CLM?  COSMO is an operational non-hydrostatic meso-to-micro scale NWP system. Used by MeteoSwiss, DWD,etc.  The COSMO model in CLimate Mode (COSMO-CLM or CCLM) is a non-hydrostatic regional climate model.

COSMO NWP Applications 4 DWD (Offenbach, Germany): NEC SX-8R, SX-9 MeteoSwiss: Cray XT4: COSMO-7 and COSMO-2 use MPI- Tasks on 402 out of 448 dual core AMD nodes ARPA-SIM (Bologna, Italy): IBM pwr5: up to 160 of 512 nodes at CINECA COSMO-LEPS (at ECMWF): running on ECMWF pwr6 as member-state time-critical application HNMS (Athens, Greece): IBM pwr4: 120 of 256 nodes IMGW (Warsawa, Poland): SGI Origin 3800: uses 88 of 100 nodes ARPA-SIM (Bologna, Italy): Linux-Intel x86-64 Cluster for testing (uses 56 of 120 cores) USAM (Rome, Italy): HP Linux Cluster XEON biproc quadcore System in preparation Roshydromet (Moscow, Russia), SGI NMA (Bucharest, Romania): Still in planning / procurement phase                     

CLM Community

Sample NWP output Precipitation 2m Temperature

Goal  An operational NWP prediction forecast for 72 hours make take half an hour of CPU time  By extension, a 100 year climate simulation on a similar grid would take several months of CPU time. Might need several simulations to extract useful science.  Need: Improve both performance and, if necessary, scaling.  Our goal is to examine different hybrid programming schemes that will provide a substantial CPU time savings without requiring extensive use of machine resources.  First step: Examine mixed programming model using MPI and OpenMP.

Description The computational grid is a 3D rotated-latitude/longitude structured grid. Communications are through 2- or 3-line halo exchanges. Many loops are of the form do k = 1, ke (height levels) do j = 1, je (latitude) do i = 1, ie (longitude) … end do Some 2D and 4D arrays, but of the same basic structure. Designed and optimized for vector architectures.

COSMO characteristics Main computational region is a time stepping loop over a dynamical core that solves for fluid motion a physical core that computes radiation transfer, precipitation, etc.  Examine scaling of 1-km to elucidate ‘hotspots’ longitude points 765 latitude points 90 vertical points 3 halo grid points 4 I/O tasks 1 hour of simulated time

1km observations  Computational intensity is low ( 95%.  Most expensive routine (fast_waves_rk) has a computational intensity of 0.2 regardless of core count.  Few routines are load imbalanced  MPI isn’t a problem  I/O isn’t a problem (for this particular problem configuration)  Good scaling out to several thousand cores, though performance is rather low.  Implies algorithms are memory-bandwidth limited.  Percentage peak is 3.5-4%

Scaling observations  Physical and dynamical cores scale well out to > 2000 cores  Dynamical core takes approximately 3x more time than the physical core, regardless of core count.  I/O and MPI communications are not limiting factors

Hybridization path  Most computational work is encapsulated in multiply-nested loops in multiple subroutines that are called from a main driver loop.  Most outer loops are over the number of levels, inner loops over latitude/longitude.  Insert OpenMP PARALLEL DO directives on outermost loop Also attempted use of OpenMP 3.0 COLLAPSE directive. Over 600 directives inserted. Also enabled use of SSE instructions on all routines (previously only used on some routines).

Hybrid+fast Hybrid

Observations  Important to have as much of the code as possible compiled to use SSE instructions.  Important not to overuse COLLAPSE directive which may interfere with compiler optimization  Code runs approximately as fast as the MPI only version for most core counts.  Dynamical and physical cores scale well, though the physical core shows a much more pronounced loss of scaling beyond 4 threads.  Reducing the number of I/O tasks improves performance (why?) and reduces idle cores.

‘Climate grid’  In order to examine the results more similar to what will be used for a 100 year climate science run:  102 latitudes  102 longitudes  60 height levels  1 I/O task  Run for 24 simulated hours.

Observations  Relatively weak (<10%) improvement over standard MPI code.  Best results at 3 or 6 threads.  Performance decrease for 4 threads where threads from one MPI task will span a socket.  Performance decrease going from six to twelve threads indicates still some performance bottlenecks.

Summary  Loop level parallelism can achieve some modest performance gains Can require many threaded loops -> OpenMP overhead Can require a lot more software engineering to maintain  Introducing new private variables into old loops  Introducing new physics that needs to be threaded Can be problematic when dealing with threads that include arrays created using Fortran allocate statement.

What now?  Could examine threading and blocking at a higher level. This will require more extensive work to the code. Improve data re-use Reduce memory footprint Attack physics first and then attack dynamical core? Is it possible to design algorithms that can be tuned for cache size for different architectures, as is done in CAM (Community Atmosphere Model)?  Investigate new algorithms Some promise here as refactoring fast_waves to be more cache friendly and to reuse data has improved performance for that routine by a factor of 2x.  Improve I/O  Currently being investigated at CSCS  Improve scaling of communications