PP POMPA (WG6) Overview Talk COSMO GM11, Rome st Birthday.

PP POMPA (WG6) Overview Talk COSMO GM11, Rome st Birthday

Who is POMPA? ARPA-EMR Davide Cesari C2SM/ETH Xavier Lapillonne, Anne Roches, Carlos Osuna CASPUR Stefano Zampini, Piero Lanucara, Cristiano Padrin Cray Pozanovich Jeffrey, Roberto Ansaloni CSCS Matthew Cordery, Mauro Biancho, Jean-Guillaume Piccinali, William Sawyer, Neil Stringfellow, Thomas Schulthess, Ugo Varetto DWD Ulrich Schättler, Kristina Fröhlich KIT Andrew Ferrone, Hartwig Anzt MeteoSwiss Petra Baumann, Oliver Fuhrer, André Walser NVIDIA Tim Schröder, Thomas Bradley Roshydromet Dmitry Mikushin SCS Tobias Gysi, Men Muheim, David Müller, Katharina Riedinger USAM David Palella, Alessandro Cheloni, Pier Francesco Coppola USI Daniel Ruprecht

Kickoff Workshop May 3-4 2011, hosted by CSCS in Manno 15 talks, 18 participants Goal get to know each other, report on work already done, plan and coordinate future activities Revised project plan

Task Overview Task 1 Performance analysis and documentation Task 2 Redesign memory layout and data structures Closely linked to work in Task 5 and 6 Task 3 Improve current parallelization Task 4 Parallel I/O Focus on NetCDF (which is still from 1 core) Technical problems New person (Carlos Osuna, C2SM) starting work on 15.09.2011 Task 5 Redesign implementation of dynamical core Task 6 Explore GPU acceleration Task 7 Implementation documentation No progress

Performance Analysis Goal -Understand the code from a performance perspective (workflow, data movement, bottlenecks, problems, …) -Guide and prioritize the work in the other tasks -Try to ensure exchange of information and performance portability developments

Performance Analysis (Task 1) Work COSMO RAPS 5.0 benchmark with DWD, MeteoSwiss and IPCC/ETH runscripts on hpcforge.org ( Ulrich Schättler, Oliver Fuhrer, Anne Roches ) Workflow of RK timestep ( Ulrich Schättler ) http://www.c2sm.ethz.ch/research/COSMO-CCLM/hp2c_one_year_meeting/2a_schaettler http://www.c2sm.ethz.ch/research/COSMO-CCLM/hp2c_one_year_meeting/2a_schaettler Performance analysis COSMO RAPS 5.0 on Cray XT4, XT5 and XE6 ( Jean-Guillaume Piccinali, Anne Roches ) COSMO-ART ( Oliver Fuhrer ) Wiki page

Jean-Guillaume Piccinali and Anne Roches

Problem: Overfetching Computational intensity is the ration of floating point operations (ops) per memory reference (ref) When accessing a single array value, a complete cache line (64 Bytes = 8 double precision values) is loaded into L1 cache do i = 1+nbounlines, ie-nbounlines A(i) = 0.0d0 end do … also loads A(1), A(2), A(3) If subdomain on processor is very small many values loaded from memory never get used for computation A(1)A(2)A(3)A(4)…A(ie-3)A(ie-2)A(ie-1)A(ie)

Performance Analysis: Wiki https://wiki.c2sm.ethz.ch/Wiki/ProjPOMPATask1

Improve Current Parallelization (Task 2) Loop level hybrid parallelization (OpenMP/MPI) ( Matthew Cordery, Davide Cesari, Stefano Zampini ) No clear benefit of this approach vs. flat MPI parallelization Approach suitable for memory bandwidth bound code? Restructuring of code (into blocks) may help! Overlap communication with computation using non-blocking MPI calls ( Stefano Zampini ) Lumped halo-updates for COSMO-ART ( Christoph Knote, Andrew Ferrone )

Halo exchange in Cosmo 3 types of point to point communications: 2 partially non-blocking and 1 full blocking (with MPI_SENDRECV) Halo swapping needs completion of East to West before starting South to North communication (implicit corner exchange) New version which communicates corners (2x more messages) Stefano Zampini

New halo-exchange routine Stefano Zampini Compute ASend AReceive AUse A CALL exch_boundaries(A) communication time OLD CALL exch_boundaries(A,2) CALL exch_boundaries(A,3) communication time NEW

Early results: COSMO2 Total time (s) for model runs Mean total time for RK dynamics Is Testany / Waitany the most efficient way to assure completion? Restructuring of code to find more work (B) could help!

Explore GPU Acceleration (Task 6) Goal Investigate whether and how GPUs can be leveraged for numerical weather prediction with COSMO Background Early investigations by Michalakes et al. using WRF physical parametrizations Full port of JMA next-generation model (ASUCA) to GPUs via a rewrite in CUDA New model developments (e.g. NIM at NOAA) which have GPUs as a target architecture in mind from the very start

GPU Motivation Chip Architecture Peak Performance Memory Bandwidth Power Consumption Price per Node Intel Westmere 6 cores @ 3.4 GHz 81.6 GFlops 32 GB/s 130 Watt X $ NVIDIA Fermi M2090 512 cores @ 1.3 GHz 665 GFlops 155 GB/s 225 Watt X $ × 8 compute bound × 5 memory bound “power bound” × 1.7

Programming GPUs Programming languages (OpenCL, CUDA C, CUDA Fortran, …) Two codes to maintain Highest control, but require complete rewrite Highest performance (if done by expert) Directive based approach (PGI, OpenMP-acc, HMPP, …) Smaller modifications to original code The resulting code is still understandable by Fortran programmers and can be easily modified Possible performance sacrifice (w.r.t. rewrite) No standard for the moment Source-to-source translation (F2C-acc, Kernelgen, …) One source code Can achieve very good performance Legacy codes often don’t map very well onto GPUs Hard to debug

Challenges How to change a wheel on a moving car? GPU hardware and programming models are rapidly changing Several approaches are vendor bound and/or not part of a standard COSMO is also rapidly evolving How to have a single readable code which also compiles onto GPUs? Efficiency may require restructuring or even a change of algorithm Directives jungle Efficient GPU implementation requires… to execute all of COSMO on the GPU enough fine grain parallelism (i.e. threads)

Explore GPU Acceleration (Task 6) Work Source-to-source translation of the whole model (Dmitry Mikushin) Porting of physical parametrizations using PGI directives or f2c-acc (Xavier Lapillone, Cristiano Padrin)  next talk Rewrite of dynamical core for GPUs (Oliver Fuhrer)  talk after next talk

HP2C OPCODE Project Additional proposal to the Swiss HP2C initiative to build an “OPerational COSMO DEmonstrator (OPCODE)” Project proposal accepted Start of project 1 June 2011 until end of 2012 Project lead: André Walser Project resources: second contract with IT company SCS to continue collaboration until end of 2012 2 new positions at MeteoSwiss for about 1 year contribution to position at C2SM contribution from CSCS

HP2C OPCODE Project Main Goals Leverage the research results of the ongoing HP2C COSMO project Prototype implementation of the MeteoSwiss production suite making aggressive use of GPU technology Similar time-to-solution on hardware with substantially lower power consumption and price Cray XT4 (3 cabinets) GPU based hardware (a few rack units)

Thank you!

PP POMPA (WG6) Overview Talk COSMO GM11, Rome st Birthday.

Similar presentations

Presentation on theme: "PP POMPA (WG6) Overview Talk COSMO GM11, Rome st Birthday."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PP POMPA (WG6) Overview Talk COSMO GM11, Rome st Birthday.

Similar presentations

Presentation on theme: "PP POMPA (WG6) Overview Talk COSMO GM11, Rome st Birthday."— Presentation transcript:

Similar presentations

About project

Feedback